* [RFC 0/7] Peer-direct memory
@ 2016-02-11 16:12 Artemy Kovalyov
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 28+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
The following set of patches implements Peer-Direct support
over RDMA stack.
Peer-Direct technology allows RDMA operations to directly target memory
in external hardware devices, such as GPU cards, SSD based storage,
dedicated ASIC accelerators, etc.
This technology allows RDMA-based (over InfiniBand/RoCE) application to avoid
unneeded data copying when sharing data between peer hardware devices.
Recently introduced ZONE_DEVICE patch [1] allows to register devices as
providers of "device memory" regions, making RDMA operation with them
transparantly available. This patch is intended for scenarios which not fit
into ZONE_DEVICE infrastrcture, but device still want to exposure it's
IO regions to RDMA access.
To implement this technology, we defined an API to securely expose the memory
of a hardware device (peer memory) to an RDMA hardware device.
This cover letter describes the API defined for Peer-Direct.
It also details the required implementation for a hardware device to expose
memory buffers over Peer-Direct.
In addition, it describes the flow and the API that IB core and low level IB hardware
drivers implement to support the technology
Flow:
-----------------
Each peer memory should register itself into the IB core (ib_core) module, and
provide a set of callbacks to manage its memory basic functionality.
The required functionality includes getting pages descriptors based upon user space
virtual address, dma mapping these pages, getting the memory page size,
removing the DMA mapping of the pages and releasing page descriptors.
Those callbacks are quite similar to the kernel API used to pin normal host memory
and exposed it to the hardware.
Detailed description of the API is included later in this cover letter.
The peer direct controller, implemented as part of the IB core services, provides registry
and brokering services between peer memory providers and low level IB hardware drivers.
This makes the usage of peer-direct almost completely transparent to the individual hardware drivers.
The only changes required in the low level IB hardware drivers is supporting an interface
for immediate invalidation of registered memory regions.
The IB hardware driver should use ib_umem_get with an extra signaling
that the requested memory may reside on a peer memory. When a given
user space virtual memory address found to belong to a peer memory
client, an ib_umem is built using the callbacks provided by the peer
memory client. In case the IB hardware driver supports invalidation
on that ib_umem it must be signaled as part of ib_umem_get, otherwise
if the peer memory requires invalidation support the registration will
be rejected.
After getting the ib_umem, if it is residing on a peer memory that requires
invalidation support, the low level IB hardware driver must register the
invalidation callback for this ib_umem.
If this callback is called, the driver should ensure that no access to
the memory mapped by the umem will happen once the callback returns.
===============================================================================
Peer memory API
===============================================================================
Peer client structure:
-------------------------------------------------------------------------------
struct peer_memory_client {
int (*acquire) (unsigned long addr, size_t size, void **client_context);
int (*get_pages) (unsigned long addr, size_t size, int write, int force,
struct sg_table *sg_head,
void *client_context, void *core_context);
int (*dma_map) (struct sg_table *sg_head, void *client_context,
struct device *dma_device, int dmasync, int *nmap);
int (*dma_unmap) (struct sg_table *sg_head, void *client_context,
struct device *dma_device);
void (*put_pages) (struct sg_table *sg_head, void *client_context);
unsigned long (*get_page_size) (void *client_context);
void (*release) (void *client_context);
};
A detailed description of above callbacks is defined as part of the first patch
in peer_mem.h header file.
-----------------------------------------------------------------------------------
void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
int (**invalidate_callback)
(void *reg_handle, u64 core_context));
Description:
Each peer memory should use this function to register as an available
peer memory client during its initialization. The callbacks provided
as part of the peer_client may be used later on by the IB core when
registering and unregistering its memory. When the invalidation callback
returns, the user of the allocation is guaranteed not to access it.
----------------------------------------------------------------------------------
void ib_unregister_peer_memory_client(void *reg_handle);
Description:
On unload, the peer memory client must unregister itself, to prevent
any additional callbacks to the unloaded module.
----------------------------------------------------------------------------------
The structure of the patchset
Patches 1-3:
This set of patches introduces the API, adds the required support to the IB core layer,
allowing Peers to be registered and be part of the flow. The first
patch introduces the API, the next two patches add the infrastructure to manage peer client
and use its registration callbacks.
Patches 4-5,7:
Those patches add the required functionality for peers to notify IB core that
a specific registration should be invalidated.
Patch 6:
This patch adds kernel module allowing RDMA transfers with various types of memory
like mmaped devices (that create PFN mappings) and mmaped files from DAX
filesystems.
[1] https://lkml.org/lkml/2015/8/25/841
Artemy Kovalyov (7):
IB/core: Introduce peer client interface
IB/core: Get/put peer memory client
IB/core: Umem tunneling peer memory APIs
IB/core: Infrastructure to manage peer core context
IB/core: Invalidation support for peer memory
IB/core: Peer memory client for IO memory
IB/mlx5: Invalidation support for MR over peer memory
drivers/infiniband/Kconfig | 19 ++
drivers/infiniband/core/Makefile | 5 +
drivers/infiniband/core/io_peer_mem.c | 332 ++++++++++++++++++++++++++++++++++
drivers/infiniband/core/peer_mem.c | 304 +++++++++++++++++++++++++++++++
drivers/infiniband/core/umem.c | 138 +++++++++++++-
drivers/infiniband/hw/mlx5/cq.c | 15 +-
drivers/infiniband/hw/mlx5/doorbell.c | 6 +-
drivers/infiniband/hw/mlx5/main.c | 3 +
drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 ++
drivers/infiniband/hw/mlx5/mr.c | 166 +++++++++++++----
drivers/infiniband/hw/mlx5/qp.c | 3 +-
drivers/infiniband/hw/mlx5/srq.c | 4 +-
include/rdma/ib_peer_mem.h | 76 ++++++++
include/rdma/ib_umem.h | 61 ++++++-
include/rdma/ib_verbs.h | 5 +
include/rdma/peer_mem.h | 238 ++++++++++++++++++++++++
16 files changed, 1330 insertions(+), 57 deletions(-)
create mode 100644 drivers/infiniband/core/io_peer_mem.c
create mode 100644 drivers/infiniband/core/peer_mem.c
create mode 100644 include/rdma/ib_peer_mem.h
create mode 100644 include/rdma/peer_mem.h
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* [RFC 1/7] IB/core: Introduce peer client interface
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 2/7] IB/core: Get/put peer memory client Artemy Kovalyov
` (6 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Introduces an API between IB core to peer memory clients,(e.g. GPU cards)
to provide access for the HCA to read/write GPU memory.
As a result it allows RDMA-based application to use GPU computing power,
and RDMA interconnect at the same time w/o copying the data between the
P2P devices.
Each peer memory client should register with IB core. In the registration
request, it should supply callbacks to its memory basic functionality such
as get/put pages, get_page_size, dma map/unmap.
The client can optionally require the ability to invalidate memory it
provided, by requesting an invalidation callback details.
Upon successful registration, IB core will provide the client with a unique
registration handle and an invalidate callback function in case required by
the peer.
The handle should be used when unregistering the client, the callback
function can be used by the client in later patches, for a request from
the client to immediately release pinned pages.
Each peer must be able to recognize whether it's the owner of a specific
virtual address range. In case the answer is YES, further calls for memory
functionality will be tunneled to that peer.
The recognition is done via the 'acquire' call. The call arguments provide
the address and size of the memory requested. Upon recognition, the
acquire call returns a peer direct client specific context. The context
will be provided by the peer direct controller to the peer direct client
callbacks when referring the specific address range.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/Kconfig | 10 ++
drivers/infiniband/core/Makefile | 1 +
drivers/infiniband/core/peer_mem.c | 82 +++++++++++++
include/rdma/ib_peer_mem.h | 44 +++++++
include/rdma/peer_mem.h | 238 +++++++++++++++++++++++++++++++++++++
5 files changed, 375 insertions(+)
create mode 100644 drivers/infiniband/core/peer_mem.c
create mode 100644 include/rdma/ib_peer_mem.h
create mode 100644 include/rdma/peer_mem.h
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 8a8440c..2837d66 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -64,6 +64,16 @@ config INFINIBAND_ADDR_TRANS_CONFIGFS
This allows the user to config the default GID type that the CM
uses for each device, when initiaing new connections.
+config INFINIBAND_PEER_MEM
+ bool "InfiniBand Peer memory access"
+ depends on INFINIBAND_USER_MEM
+ depends on MMU_NOTIFIER
+ default y
+ ---help---
+ Peer memory access feature allows RDMA operations to directly target
+ memory in external hardware devices, such as GPU cards, SSD based
+ storage, dedicated ASIC accelerators, etc.
+
source "drivers/infiniband/hw/mthca/Kconfig"
source "drivers/infiniband/hw/qib/Kconfig"
source "drivers/infiniband/hw/cxgb3/Kconfig"
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index f818538..9882d00 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -13,6 +13,7 @@ ib_core-y := packer.o ud_header.o verbs.o cq.o sysfs.o \
roce_gid_mgmt.o
ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
+ib_core-$(CONFIG_INFINIBAND_PEER_MEM) += peer_mem.o
ib_mad-y := mad.o smi.o agent.o mad_rmpp.o
diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
new file mode 100644
index 0000000..2c26a39
--- /dev/null
+++ b/drivers/infiniband/core/peer_mem.c
@@ -0,0 +1,82 @@
+/*
+ * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <rdma/ib_peer_mem.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_umem.h>
+
+static DEFINE_MUTEX(peer_memory_mutex);
+static LIST_HEAD(peer_memory_list);
+
+static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
+{
+ return -ENOSYS;
+}
+
+void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
+ int (**invalidate_callback)
+ (void *reg_handle, u64 core_context))
+{
+ struct ib_peer_memory_client *ib_peer_client;
+
+ ib_peer_client = kzalloc(sizeof(*ib_peer_client), GFP_KERNEL);
+ if (!ib_peer_client)
+ return NULL;
+
+ ib_peer_client->peer_mem = peer_client;
+ /* Once peer supplied a non NULL callback it's an indication that
+ * invalidation support is required for any memory owning.
+ */
+ if (invalidate_callback) {
+ *invalidate_callback = ib_invalidate_peer_memory;
+ ib_peer_client->invalidation_required = 1;
+ }
+
+ mutex_lock(&peer_memory_mutex);
+ list_add_tail(&ib_peer_client->core_peer_list, &peer_memory_list);
+ mutex_unlock(&peer_memory_mutex);
+
+ return ib_peer_client;
+}
+EXPORT_SYMBOL(ib_register_peer_memory_client);
+
+void ib_unregister_peer_memory_client(void *reg_handle)
+{
+ struct ib_peer_memory_client *ib_peer_client = reg_handle;
+
+ mutex_lock(&peer_memory_mutex);
+ list_del(&ib_peer_client->core_peer_list);
+ mutex_unlock(&peer_memory_mutex);
+
+ kfree(ib_peer_client);
+}
+EXPORT_SYMBOL(ib_unregister_peer_memory_client);
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
new file mode 100644
index 0000000..cbe928e
--- /dev/null
+++ b/include/rdma/ib_peer_mem.h
@@ -0,0 +1,44 @@
+/*
+ * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#if !defined(IB_PEER_MEM_H)
+#define IB_PEER_MEM_H
+
+#include <rdma/peer_mem.h>
+
+struct ib_peer_memory_client {
+ const struct peer_memory_client *peer_mem;
+ struct list_head core_peer_list;
+ int invalidation_required;
+};
+
+#endif
diff --git a/include/rdma/peer_mem.h b/include/rdma/peer_mem.h
new file mode 100644
index 0000000..1ec96ea
--- /dev/null
+++ b/include/rdma/peer_mem.h
@@ -0,0 +1,238 @@
+/*
+ * Copyright (c) 2016, Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#if !defined(PEER_MEM_H)
+#define PEER_MEM_H
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/export.h>
+#include <linux/scatterlist.h>
+
+/**
+ * struct peer_memory_client - registration information for peer client.
+ * @acquire: callback function to be used by IB core to detect whether a
+ * virtual address in under the responsibility of a specific
+ * peer client.
+ * @get_pages: callback function to be used by IB core asking the peer client
+ * to pin the physical pages of the given address range and returns
+ * that information. It equivalents to the kernel API of
+ * get_user_pages(), but targets peer memory.
+ * @dma_map: callback function to be used by IB core asking the peer client
+ * to fill the dma address mapping for a given address range.
+ * @dma_unmap: callback function to be used by IB core asking the peer client
+ * to take relevant actions to unmap the memory.
+ * @put_pages: callback function to be used by IB core asking the peer client
+ * to remove the pinning from the given memory.
+ * It's the peer-direct equivalent of the kernel API put_page.
+ * @get_page_size: callback function to be used by IB core to query the peer
+ * client for the page size for the given allocation.
+ * @release: callback function to be used by IB core asking peer client to
+ * release all resources associated with previous acquire call.
+ * The call will be performed only for contexts that have been
+ * successfully acquired (i.e. acquire returned a non-zero value).
+ * Additionally, IB core guarentees that there will be no pages
+ * pinned through this context when the callback is called.
+ *
+ * The subsections in this description contain detailed description
+ * of the callback arguments and expected return values for the
+ * callbacks defined in this struct.
+ *
+ * acquire:
+ *
+ * Callback function to be used by IB core to detect
+ * whether a virtual address in under the responsibility
+ * of a specific peer client.
+ *
+ * addr [IN] - virtual address to be checked whether belongs
+ * to peer.
+ *
+ * size [IN] - size of memory area starting at addr.
+ *
+ * client_context [OUT] - peer opaque data which holds
+ * a peer context for the acquired
+ * address range, will be provided
+ * back to the peer memory in
+ * subsequent calls for that given
+ * memory.
+ *
+ * If peer takes responsibility on the given address range further
+ * calls for memory management will be directed to the callbacks
+ * of this peer client.
+ *
+ * Return - 1 in case peer client takes responsibility on that
+ * range, negative value if error
+ * happens during process, 0 otherwise.
+ *
+ * get_pages:
+ *
+ * Callback function to be used by IB core asking the
+ * peer client to pin the physical pages of the given
+ * address range and returns that information. It
+ * equivalents to the kernel API of get_user_pages(), but
+ * targets peer memory.
+ *
+ * addr [IN] - start virtual address of that given
+ * allocation.
+ *
+ * size [IN] - size of memory area starting at addr.
+ *
+ * write [IN] - indicates whether the pages will be
+ * written to by the caller. Same meaning
+ * as of kernel API get_user_pages, can be
+ * ignored if not relevant.
+ *
+ * force [IN] - indicates whether to force write access
+ * even if user mapping is read only. Same
+ * meaning as of kernel API get_user_pages,
+ * can be ignored if not relevant.
+ *
+ * sg_head [IN/OUT] - pointer to head of struct sg_table.
+ * The peer client should allocate a
+ * table big enough to store all of the
+ * required entries. This function should
+ * fill the table with physical addresses
+ * and sizes of the memory segments
+ * composing this memory mapping. The
+ * table allocation can be done using
+ * sg_alloc_table. Filling in the
+ * physical memory addresses and size can
+ * be done using sg_set_page.
+ *
+ * client_context [IN] - peer context for the given allocation, as
+ * received from the acquire call.
+ *
+ * core_context [IN] - IB core context. If the peer client wishes
+ * to invalidate any of the pages pinned
+ * through this API, it must provide this
+ * context as an argument to the invalidate
+ * callback.
+ *
+ * Return - 0 success, otherwise errno error code.
+ *
+ * dma_map:
+ *
+ * Callback function to be used by IB core asking the peer client
+ * to fill the dma address mapping for a given address range.
+ *
+ * sg_head [IN/OUT] - pointer to head of struct sg_table.
+ * The peer memory should fill the
+ * dma_address and dma_length for each
+ * scatter gather entry in the table.
+ *
+ * client_context [IN] - peer context for the allocation mapped.
+ *
+ * dma_device [IN] - the RDMA capable device which
+ * requires access to the peer memory.
+ *
+ * dmasync [IN] - flush in-flight DMA when the memory region
+ * is written. Same meaning as with host
+ * memory mapping, can be ignored
+ * if not relevant.
+ *
+ * nmap [OUT] - number of mapped/set entries.
+ *
+ * Return - 0 success, otherwise errno error code.
+ *
+ * dma_unmap:
+ *
+ * Callback function to be used by IB core asking the peer client
+ * to take relevant actions to unmap the memory.
+ *
+ * sg_head [IN] - pointer to head of struct sg_table.
+ * The peer memory should release the
+ * dma_address and dma_length for each
+ * scatter gather entry in the table.
+ *
+ * client_context [IN] - peer context for the allocation mapped.
+ *
+ * dma_device [IN] - the RDMA capable device which requires
+ * access to the peer memory.
+ *
+ * Return - 0 success, otherwise errno error code.
+ *
+ * put_pages:
+ *
+ * Callback function to be used by IB core asking the peer client
+ * to remove the pinning from the given memory.
+ * It's the peer-direct equivalent of the kernel API put_page.
+ *
+ * sg_head [IN] - pointer to head of struct sg_table.
+ *
+ * client_context [IN] - peer context for that given allocation.
+ *
+ * get_page_size:
+ *
+ * Callback function to be used by IB core to query the
+ * peer client for the page size for the given
+ * allocation.
+ *
+ * client_context [IN] - peer context for that given allocation.
+ *
+ * Return - Page size in bytes
+ *
+ * release:
+ *
+ * Callback function to be used by IB core asking peer
+ * client to release all resources associated with
+ * previous acquire call. The call will be performed only
+ * for contexts that have been successfully acquired
+ * (i.e. acquire returned a non-zero value).
+ * Additionally, IB core guarentees that there will be no
+ * pages pinned through this context when the callback is
+ * called.
+ *
+ * client_context [IN] - peer context for the given allocation.
+ *
+ **/
+struct peer_memory_client {
+ int (*acquire)(unsigned long addr, size_t size, void **client_context);
+ int (*get_pages)(unsigned long addr, size_t size, int write, int force,
+ struct sg_table *sg_head,
+ void *client_context, u64 core_context);
+ int (*dma_map)(struct sg_table *sg_head, void *client_context,
+ struct device *dma_device, int dmasync, int *nmap);
+ int (*dma_unmap)(struct sg_table *sg_head, void *client_context,
+ struct device *dma_device);
+ void (*put_pages)(struct sg_table *sg_head, void *client_context);
+ unsigned long (*get_page_size)(void *client_context);
+ void (*release)(void *client_context);
+};
+
+void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
+ int (**invalidate_callback)
+ (void *reg_handle, u64 core_context));
+void ib_unregister_peer_memory_client(void *reg_handle);
+
+#endif
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC 2/7] IB/core: Get/put peer memory client
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-11 16:12 ` [RFC 1/7] IB/core: Introduce peer client interface Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 3/7] IB/core: Umem tunneling peer memory APIs Artemy Kovalyov
` (5 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Supplies an API to get/put a peer client functionality.
It encapsulates the details of how to acquire/release a peer client from
its callers and let them get the required peer client in case it exists.
The 'get' call iterates over registered peer clients looking for an
owner of a given address range by calling peer's 'acquire' call.
In case an owner is found the loop is stopped.
The 'put' call does the opposite, lets peer release its resources for
that given address range.
A reference counting/completion mechanism is used to prevent a peer
memory client from going down once there are active users for its memory.
In addition:
- an extra device capability named IB_DEVICE_PEER_MEMORY was introduced,
to be used by low level drivers to mark that they support this
functionality.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/peer_mem.c | 55 ++++++++++++++++++++++++++++++++++++++
include/rdma/ib_peer_mem.h | 12 +++++++++
include/rdma/ib_verbs.h | 5 ++++
3 files changed, 72 insertions(+)
diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index 2c26a39..74e4caa 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -42,6 +42,14 @@ static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
return -ENOSYS;
}
+static void complete_peer(struct kref *kref)
+{
+ struct ib_peer_memory_client *ib_peer_client =
+ container_of(kref, struct ib_peer_memory_client, ref);
+
+ complete(&ib_peer_client->unload_comp);
+}
+
void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
int (**invalidate_callback)
(void *reg_handle, u64 core_context))
@@ -52,6 +60,8 @@ void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
if (!ib_peer_client)
return NULL;
+ init_completion(&ib_peer_client->unload_comp);
+ kref_init(&ib_peer_client->ref);
ib_peer_client->peer_mem = peer_client;
/* Once peer supplied a non NULL callback it's an indication that
* invalidation support is required for any memory owning.
@@ -77,6 +87,51 @@ void ib_unregister_peer_memory_client(void *reg_handle)
list_del(&ib_peer_client->core_peer_list);
mutex_unlock(&peer_memory_mutex);
+ kref_put(&ib_peer_client->ref, complete_peer);
+ wait_for_completion(&ib_peer_client->unload_comp);
kfree(ib_peer_client);
}
EXPORT_SYMBOL(ib_unregister_peer_memory_client);
+
+struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
+ unsigned long addr,
+ size_t size,
+ void **peer_client_context)
+{
+ struct ib_peer_memory_client *ib_peer_client;
+ int ret;
+
+ mutex_lock(&peer_memory_mutex);
+ list_for_each_entry(ib_peer_client, &peer_memory_list, core_peer_list) {
+ ret = ib_peer_client->peer_mem->acquire(addr, size,
+ peer_client_context);
+ if (ret > 0)
+ goto found;
+
+ if (ret < 0) {
+ mutex_unlock(&peer_memory_mutex);
+ return ERR_PTR(ret);
+ }
+ }
+
+ ib_peer_client = NULL;
+
+found:
+ mutex_unlock(&peer_memory_mutex);
+
+ if (ib_peer_client)
+ kref_get(&ib_peer_client->ref);
+
+ return ib_peer_client;
+}
+EXPORT_SYMBOL(ib_get_peer_client);
+
+void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
+ void *peer_client_context)
+{
+ if (ib_peer_client->peer_mem->release)
+ ib_peer_client->peer_mem->release(peer_client_context);
+
+ kref_put(&ib_peer_client->ref, complete_peer);
+}
+EXPORT_SYMBOL(ib_put_peer_client);
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index cbe928e..7f41ce5 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -35,10 +35,22 @@
#include <rdma/peer_mem.h>
+struct ib_ucontext;
+
struct ib_peer_memory_client {
const struct peer_memory_client *peer_mem;
struct list_head core_peer_list;
int invalidation_required;
+ struct kref ref;
+ struct completion unload_comp;
};
+struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
+ unsigned long addr,
+ size_t size,
+ void **peer_client_context);
+
+void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
+ void *peer_client_context);
+
#endif
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 284b00c..3917230 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -209,6 +209,11 @@ enum ib_device_cap_flags {
* by hardware.
*/
IB_DEVICE_CROSS_CHANNEL = (1 << 27),
+ /*
+ * Device supports RDMA access to memory registered by
+ * other locally connected devices (e.g. GPU).
+ */
+ IB_DEVICE_PEER_MEMORY = (1 << 28),
IB_DEVICE_MANAGED_FLOW_STEERING = (1 << 29),
IB_DEVICE_SIGNATURE_HANDOVER = (1 << 30),
IB_DEVICE_ON_DEMAND_PAGING = (1 << 31),
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC 3/7] IB/core: Umem tunneling peer memory APIs
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-11 16:12 ` [RFC 1/7] IB/core: Introduce peer client interface Artemy Kovalyov
2016-02-11 16:12 ` [RFC 2/7] IB/core: Get/put peer memory client Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 4/7] IB/core: Infrastructure to manage peer core context Artemy Kovalyov
` (4 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Builds umem over peer memory client functionality.
It tries to get a peer client for a given address range.
In case it was found further memory calls are tunneled to that peer client.
ib_umem_get_flags was added as successor of ib_umem_get to have additional
flags, for instance indication whether this umem
can be part of a peer client.
Deprecated ib_umem_get left for backward compatibility.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/umem.c | 100 ++++++++++++++++++++++++++++++++++++++---
include/rdma/ib_umem.h | 34 +++++++++++---
2 files changed, 123 insertions(+), 11 deletions(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 38acb3c..2eab34e 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -43,6 +43,63 @@
#include "uverbs.h"
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
+ struct ib_umem *umem, unsigned long addr,
+ int dmasync)
+{
+ int ret;
+ const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+
+ umem->ib_peer_mem = ib_peer_mem;
+ /*
+ * We always request write permissions to the pages, to force breaking
+ * of any CoW during the registration of the MR. For read-only MRs we
+ * use the "force" flag to indicate that CoW breaking is required but
+ * the registration should not fail if referencing read-only areas.
+ */
+ ret = peer_mem->get_pages(addr, umem->length,
+ 1, !umem->writable,
+ &umem->sg_head,
+ umem->peer_mem_client_context,
+ 0);
+ if (ret)
+ goto out;
+
+ umem->page_size = peer_mem->get_page_size
+ (umem->peer_mem_client_context);
+ ret = peer_mem->dma_map(&umem->sg_head,
+ umem->peer_mem_client_context,
+ umem->context->device->dma_device,
+ dmasync,
+ &umem->nmap);
+ if (ret)
+ goto put_pages;
+
+ return umem;
+
+put_pages:
+ peer_mem->put_pages(&umem->sg_head,
+ umem->peer_mem_client_context);
+out:
+ ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
+ return ERR_PTR(ret);
+}
+
+static void peer_umem_release(struct ib_umem *umem)
+{
+ const struct peer_memory_client *peer_mem =
+ umem->ib_peer_mem->peer_mem;
+
+ peer_mem->dma_unmap(&umem->sg_head,
+ umem->peer_mem_client_context,
+ umem->context->device->dma_device);
+ peer_mem->put_pages(&umem->sg_head,
+ umem->peer_mem_client_context);
+ ib_put_peer_client(umem->ib_peer_mem, umem->peer_mem_client_context);
+ kfree(umem);
+}
+#endif
static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
{
@@ -69,7 +126,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
}
/**
- * ib_umem_get - Pin and DMA map userspace memory.
+ * ib_umem_get_flags - Pin and DMA map userspace memory.
*
* If access flags indicate ODP memory, avoid pinning. Instead, stores
* the mm for future page fault handling in conjunction with MMU notifiers.
@@ -78,10 +135,12 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
* @addr: userspace virtual address to start at
* @size: length of region to pin
* @access: IB_ACCESS_xxx flags for memory being pinned
- * @dmasync: flush in-flight DMA when the memory region is written
+ * @flags: IB_UMEM_xxx flags for memory being used
*/
-struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
- size_t size, int access, int dmasync)
+struct ib_umem *ib_umem_get_flags(struct ib_ucontext *context,
+ unsigned long addr,
+ size_t size, int access,
+ unsigned long flags)
{
struct ib_umem *umem;
struct page **page_list;
@@ -96,7 +155,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
struct scatterlist *sg, *sg_list_start;
int need_release = 0;
- if (dmasync)
+ if (flags & IB_UMEM_DMA_SYNC)
dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs);
if (!size)
@@ -144,6 +203,28 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
umem->odp_data = NULL;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ if (flags & IB_UMEM_PEER_ALLOW) {
+ struct ib_peer_memory_client *peer_mem_client;
+ struct ib_umem *peer_umem;
+
+ peer_mem_client =
+ ib_get_peer_client(context, addr, size,
+ &umem->peer_mem_client_context);
+ if (IS_ERR(peer_mem_client)) {
+ kfree(umem);
+ return ERR_CAST(peer_mem_client);
+
+ } else if (peer_mem_client) {
+ peer_umem = peer_umem_get(peer_mem_client, umem, addr,
+ flags & IB_UMEM_DMA_SYNC);
+ if (IS_ERR(peer_umem))
+ kfree(umem);
+ return peer_umem;
+ }
+ }
+#endif
+
/* We assume the memory is from hugetlb until proved otherwise */
umem->hugetlb = 1;
@@ -240,7 +321,7 @@ out:
return ret < 0 ? ERR_PTR(ret) : umem;
}
-EXPORT_SYMBOL(ib_umem_get);
+EXPORT_SYMBOL(ib_umem_get_flags);
static void ib_umem_account(struct work_struct *work)
{
@@ -264,6 +345,13 @@ void ib_umem_release(struct ib_umem *umem)
struct task_struct *task;
unsigned long diff;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ if (umem->ib_peer_mem) {
+ peer_umem_release(umem);
+ return;
+ }
+#endif
+
if (umem->odp_data) {
ib_umem_odp_release(umem);
return;
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 2d83cfd..bb760f9 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -36,6 +36,9 @@
#include <linux/list.h>
#include <linux/scatterlist.h>
#include <linux/workqueue.h>
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+#include <rdma/ib_peer_mem.h>
+#endif
struct ib_ucontext;
struct ib_umem_odp;
@@ -55,6 +58,12 @@ struct ib_umem {
struct sg_table sg_head;
int nmap;
int npages;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ /* peer memory that manages this umem */
+ struct ib_peer_memory_client *ib_peer_mem;
+ /* peer memory private context */
+ void *peer_mem_client_context;
+#endif
};
/* Returns the offset of the umem start relative to the first page. */
@@ -80,10 +89,16 @@ static inline size_t ib_umem_num_pages(struct ib_umem *umem)
return (ib_umem_end(umem) - ib_umem_start(umem)) >> PAGE_SHIFT;
}
+enum ib_peer_mem_flags {
+ IB_UMEM_DMA_SYNC = (1 << 0),
+ IB_UMEM_PEER_ALLOW = (1 << 1),
+};
+
#ifdef CONFIG_INFINIBAND_USER_MEM
+struct ib_umem *ib_umem_get_flags(struct ib_ucontext *context,
+ unsigned long addr, size_t size,
+ int access, unsigned long flags);
-struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
- size_t size, int access, int dmasync);
void ib_umem_release(struct ib_umem *umem);
int ib_umem_page_count(struct ib_umem *umem);
int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
@@ -93,11 +108,13 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
#include <linux/err.h>
-static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
- unsigned long addr, size_t size,
- int access, int dmasync) {
+static inline struct ib_umem *ib_umem_get_flags(struct ib_ucontext *context,
+ unsigned long addr, size_t size,
+ int access,
+ unsigned long flags) {
return ERR_PTR(-EINVAL);
}
+
static inline void ib_umem_release(struct ib_umem *umem) { }
static inline int ib_umem_page_count(struct ib_umem *umem) { return 0; }
static inline int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
@@ -106,4 +123,11 @@ static inline int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offs
}
#endif /* CONFIG_INFINIBAND_USER_MEM */
+static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
+ unsigned long addr, size_t size,
+ int access, int dmasync) {
+ return ib_umem_get_flags(context, addr, size, access,
+ dmasync ? IB_UMEM_DMA_SYNC : 0);
+}
+
#endif /* IB_UMEM_H */
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC 4/7] IB/core: Infrastructure to manage peer core context
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (2 preceding siblings ...)
2016-02-11 16:12 ` [RFC 3/7] IB/core: Umem tunneling peer memory APIs Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 5/7] IB/core: Invalidation support for peer memory Artemy Kovalyov
` (3 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Adds an infrastructure to manage core context for a given umem,
it's needed for the invalidation flow.
Core context is supplied to peer clients as some opaque data for a given
memory pages represented by a umem.
If the peer client needs to invalidate memory it provided through the
peer memory callbacks, it should call the invalidation callback, supplyng
the relevant core context. IB core will use this context to invalidate
the relevant memory.
To prevent cases when there are inflight invalidation calls in parallel
to releasing this memory (e.g. by dereg_mr) we must ensure that context
is valid before accessing it, that's why couldn't use the core context
pointer directly. For that reason we added a lookup table to map between
a ticket id to a core context. Peer client will get/supply the ticket
id, core will check whether exists before accessing its corresponding
context.
The ticket id is provided to the peer memory client, as part of the
get_pages API. The only "remote" party using it is the peer memory
client. It is used for invalidation flow, to specify what memory
registration should be invalidated. This flow might be called
asynchronously, in parallel to an ongoing dereg_mr operation. As such,
the invalidation flow might be called after the memory registration
has been completely released. Relying on a pointer-based, or IDR-based
ticket value can result in spurious invalidation of unrelated memory
regions. Internally, we carefully lock the data structures and
synchronize as needed when extracting the context from the
ticket. This ensures a proper, synchronized release of the memory
mapping. The ticket mechanism allows us to safely ignore inflight
invalidation calls that were arrived too late.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/peer_mem.c | 90 ++++++++++++++++++++++++++++++++++++++
include/rdma/ib_peer_mem.h | 19 ++++++++
include/rdma/ib_umem.h | 8 ++++
3 files changed, 117 insertions(+)
diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index 74e4caa..57afb76 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -42,6 +42,93 @@ static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
return -ENOSYS;
}
+static int ib_peer_insert_context(struct ib_peer_memory_client *ib_peer_client,
+ void *context,
+ u64 *context_ticket)
+{
+ struct core_ticket *core_ticket;
+
+ core_ticket = kzalloc(sizeof(*core_ticket), GFP_KERNEL);
+ if (!core_ticket)
+ return -ENOMEM;
+
+ mutex_lock(&ib_peer_client->lock);
+ core_ticket->key = ib_peer_client->last_ticket++;
+ core_ticket->context = context;
+ list_add_tail(&core_ticket->ticket_list,
+ &ib_peer_client->core_ticket_list);
+ *context_ticket = core_ticket->key;
+ mutex_unlock(&ib_peer_client->lock);
+
+ return 0;
+}
+
+/* Caller should be holding the peer client lock, specifically,
+ * the caller should hold ib_peer_client->lock
+ */
+static int ib_peer_remove_context(struct ib_peer_memory_client *ib_peer_client,
+ u64 key)
+{
+ struct core_ticket *core_ticket, *safe;
+
+ list_for_each_entry_safe(core_ticket, safe,
+ &ib_peer_client->core_ticket_list,
+ ticket_list) {
+ if (core_ticket->key == key) {
+ list_del(&core_ticket->ticket_list);
+ kfree(core_ticket);
+ return 0;
+ }
+ }
+
+ return 1;
+}
+
+/**
+** ib_peer_create_invalidation_ctx - creates invalidation context for given umem
+** @ib_peer_mem: peer client to be used
+** @umem: umem struct belongs to that context
+** @invalidation_ctx: output context
+**/
+int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+ struct ib_umem *umem,
+ struct invalidation_ctx **invalidation_ctx)
+{
+ int ret;
+ struct invalidation_ctx *ctx;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ret = ib_peer_insert_context(ib_peer_mem, ctx,
+ &ctx->context_ticket);
+ if (ret) {
+ kfree(ctx);
+ return ret;
+ }
+
+ ctx->umem = umem;
+ umem->invalidation_ctx = ctx;
+ *invalidation_ctx = ctx;
+ return 0;
+}
+
+/**
+ * ** ib_peer_destroy_invalidation_ctx - destroy a given invalidation context
+ * ** @ib_peer_mem: peer client to be used
+ * ** @invalidation_ctx: context to be invalidated
+ * **/
+void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+ struct invalidation_ctx *invalidation_ctx)
+{
+ mutex_lock(&ib_peer_mem->lock);
+ ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
+ mutex_unlock(&ib_peer_mem->lock);
+
+ kfree(invalidation_ctx);
+}
+
static void complete_peer(struct kref *kref)
{
struct ib_peer_memory_client *ib_peer_client =
@@ -60,9 +147,12 @@ void *ib_register_peer_memory_client(struct peer_memory_client *peer_client,
if (!ib_peer_client)
return NULL;
+ INIT_LIST_HEAD(&ib_peer_client->core_ticket_list);
+ mutex_init(&ib_peer_client->lock);
init_completion(&ib_peer_client->unload_comp);
kref_init(&ib_peer_client->ref);
ib_peer_client->peer_mem = peer_client;
+ ib_peer_client->last_ticket = 1;
/* Once peer supplied a non NULL callback it's an indication that
* invalidation support is required for any memory owning.
*/
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index 7f41ce5..6f3dc84 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -36,6 +36,8 @@
#include <rdma/peer_mem.h>
struct ib_ucontext;
+struct ib_umem;
+struct invalidation_ctx;
struct ib_peer_memory_client {
const struct peer_memory_client *peer_mem;
@@ -43,6 +45,16 @@ struct ib_peer_memory_client {
int invalidation_required;
struct kref ref;
struct completion unload_comp;
+ /* lock is used via the invalidation flow */
+ struct mutex lock;
+ struct list_head core_ticket_list;
+ u64 last_ticket;
+};
+
+struct core_ticket {
+ unsigned long key;
+ void *context;
+ struct list_head ticket_list;
};
struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
@@ -53,4 +65,11 @@ struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
void *peer_client_context);
+int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+ struct ib_umem *umem,
+ struct invalidation_ctx **invalidation_ctx);
+
+void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
+ struct invalidation_ctx *ctx);
+
#endif
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index bb760f9..5d0fb41 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -43,6 +43,13 @@
struct ib_ucontext;
struct ib_umem_odp;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+struct invalidation_ctx {
+ struct ib_umem *umem;
+ u64 context_ticket;
+};
+#endif
+
struct ib_umem {
struct ib_ucontext *context;
size_t length;
@@ -61,6 +68,7 @@ struct ib_umem {
#ifdef CONFIG_INFINIBAND_PEER_MEM
/* peer memory that manages this umem */
struct ib_peer_memory_client *ib_peer_mem;
+ struct invalidation_ctx *invalidation_ctx;
/* peer memory private context */
void *peer_mem_client_context;
#endif
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC 5/7] IB/core: Invalidation support for peer memory
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (3 preceding siblings ...)
2016-02-11 16:12 ` [RFC 4/7] IB/core: Infrastructure to manage peer core context Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 6/7] IB/core: Peer memory client for IO memory Artemy Kovalyov
` (2 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Adds the required functionality to invalidate a given peer
memory represented by some core context.
Each umem that was built over peer memory and supports invalidation has
some invalidation context assigned to it with the required data to
manage, once peer will call the invalidation callback below actions are
taken:
1) Taking lock on peer client to sync with inflight dereg_mr on that
memory.
2) Once lock is taken have a lookup for ticket id to find the matching
core context.
3) In case found will call umem invalidation function, otherwise call is
returned.
Some notes:
1) As peer invalidate callback defined to be blocking it must return
just after that pages are not going to be accessed any more. For that
reason ib_invalidate_peer_memory is waiting for a completion event in
case there is other inflight call coming as part of dereg_mr.
2) The peer memory API assumes that a lock might be taken by a peer
client to protect its memory operations. Specifically, its invalidate
callback might be called under that lock which may lead to an AB/BA
dead-lock in case IB core will call get/put pages APIs with the IB core
peer's lock taken, for that reason as part of
ib_umem_activate_invalidation_notifier lock is taken
then checking for some inflight invalidation state before activating it.
3) Once a peer client admits as part of its registration that it may
require invalidation support, it can't be an owner of a memory range
which doesn't support it.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/peer_mem.c | 85 ++++++++++++++++++++++++++++++++++++--
drivers/infiniband/core/umem.c | 56 +++++++++++++++++++++----
include/rdma/ib_peer_mem.h | 1 +
include/rdma/ib_umem.h | 19 +++++++++
4 files changed, 148 insertions(+), 13 deletions(-)
diff --git a/drivers/infiniband/core/peer_mem.c b/drivers/infiniband/core/peer_mem.c
index 57afb76..f9aaef2 100644
--- a/drivers/infiniband/core/peer_mem.c
+++ b/drivers/infiniband/core/peer_mem.c
@@ -37,9 +37,56 @@
static DEFINE_MUTEX(peer_memory_mutex);
static LIST_HEAD(peer_memory_list);
+/* Caller should be holding the peer client lock, ib_peer_client->lock */
+static struct core_ticket *ib_peer_search_context(
+ struct ib_peer_memory_client *ib_peer_client,
+ u64 key)
+{
+ struct core_ticket *core_ticket;
+
+ list_for_each_entry(core_ticket, &ib_peer_client->core_ticket_list,
+ ticket_list) {
+ if (core_ticket->key == key)
+ return core_ticket;
+ }
+
+ return NULL;
+}
+
static int ib_invalidate_peer_memory(void *reg_handle, u64 core_context)
{
- return -ENOSYS;
+ struct ib_peer_memory_client *ib_peer_client = reg_handle;
+ struct invalidation_ctx *invalidation_ctx;
+ struct core_ticket *core_ticket;
+
+ mutex_lock(&ib_peer_client->lock);
+ core_ticket = ib_peer_search_context(ib_peer_client, core_context);
+ if (!core_ticket) {
+ mutex_unlock(&ib_peer_client->lock);
+ return 0;
+ }
+
+ invalidation_ctx = (struct invalidation_ctx *)core_ticket->context;
+ /* If context is not ready yet, mark it to be invalidated */
+ if (!invalidation_ctx->func) {
+ invalidation_ctx->peer_invalidated = 1;
+ mutex_unlock(&ib_peer_client->lock);
+ return 0;
+ }
+ invalidation_ctx->func(invalidation_ctx->cookie,
+ invalidation_ctx->umem, 0, 0);
+ if (invalidation_ctx->inflight_invalidation) {
+ /* init the completion to wait on
+ * before letting other thread to run
+ */
+ init_completion(&invalidation_ctx->comp);
+ mutex_unlock(&ib_peer_client->lock);
+ wait_for_completion(&invalidation_ctx->comp);
+ }
+
+ kfree(invalidation_ctx);
+
+ return 0;
}
static int ib_peer_insert_context(struct ib_peer_memory_client *ib_peer_client,
@@ -122,11 +169,33 @@ int ib_peer_create_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
void ib_peer_destroy_invalidation_ctx(struct ib_peer_memory_client *ib_peer_mem,
struct invalidation_ctx *invalidation_ctx)
{
- mutex_lock(&ib_peer_mem->lock);
+ int peer_callback;
+ int inflight_invalidation;
+
+ /* If we are under peer callback lock was already taken.*/
+ if (!invalidation_ctx->peer_callback)
+ mutex_lock(&ib_peer_mem->lock);
ib_peer_remove_context(ib_peer_mem, invalidation_ctx->context_ticket);
- mutex_unlock(&ib_peer_mem->lock);
+ /* Make sure to check inflight flag after took the lock and remove
+ * from tree. In addition, from that point using local variables for
+ * peer_callback and inflight_invalidation as after the complete
+ * invalidation_ctx can't be accessed any more as it may be freed
+ * by the callback.
+ */
+ peer_callback = invalidation_ctx->peer_callback;
+ inflight_invalidation = invalidation_ctx->inflight_invalidation;
+ if (inflight_invalidation)
+ complete(&invalidation_ctx->comp);
- kfree(invalidation_ctx);
+ /* On peer callback lock is handled externally */
+ if (!peer_callback)
+ mutex_unlock(&ib_peer_mem->lock);
+
+ /* In case under callback context or callback is pending
+ * let it free the invalidation context
+ */
+ if (!peer_callback && !inflight_invalidation)
+ kfree(invalidation_ctx);
}
static void complete_peer(struct kref *kref)
@@ -186,6 +255,7 @@ EXPORT_SYMBOL(ib_unregister_peer_memory_client);
struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
unsigned long addr,
size_t size,
+ unsigned long flags,
void **peer_client_context)
{
struct ib_peer_memory_client *ib_peer_client;
@@ -193,6 +263,13 @@ struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
mutex_lock(&peer_memory_mutex);
list_for_each_entry(ib_peer_client, &peer_memory_list, core_peer_list) {
+ /* In case peer requires invalidation it can't own memory
+ * which doesn't support it
+ */
+ if (ib_peer_client->invalidation_required &&
+ (!(flags & IB_UMEM_PEER_INVAL_SUPP)))
+ continue;
+
ret = ib_peer_client->peer_mem->acquire(addr, size,
peer_client_context);
if (ret > 0)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 2eab34e..f478f63 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -46,12 +46,19 @@
#ifdef CONFIG_INFINIBAND_PEER_MEM
static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
struct ib_umem *umem, unsigned long addr,
- int dmasync)
+ unsigned long flags)
{
int ret;
const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+ struct invalidation_ctx *ictx = NULL;
umem->ib_peer_mem = ib_peer_mem;
+ if (flags & IB_UMEM_PEER_INVAL_SUPP) {
+ ret = ib_peer_create_invalidation_ctx(ib_peer_mem, umem, &ictx);
+ if (ret)
+ goto end;
+ }
+
/*
* We always request write permissions to the pages, to force breaking
* of any CoW during the registration of the MR. For read-only MRs we
@@ -62,7 +69,7 @@ static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
1, !umem->writable,
&umem->sg_head,
umem->peer_mem_client_context,
- 0);
+ ictx ? ictx->context_ticket : 0);
if (ret)
goto out;
@@ -71,7 +78,7 @@ static struct ib_umem *peer_umem_get(struct ib_peer_memory_client *ib_peer_mem,
ret = peer_mem->dma_map(&umem->sg_head,
umem->peer_mem_client_context,
umem->context->device->dma_device,
- dmasync,
+ flags & IB_UMEM_DMA_SYNC,
&umem->nmap);
if (ret)
goto put_pages;
@@ -82,23 +89,54 @@ put_pages:
peer_mem->put_pages(&umem->sg_head,
umem->peer_mem_client_context);
out:
+ if (ictx)
+ ib_peer_destroy_invalidation_ctx(ib_peer_mem, ictx);
+end:
ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
return ERR_PTR(ret);
}
static void peer_umem_release(struct ib_umem *umem)
{
- const struct peer_memory_client *peer_mem =
- umem->ib_peer_mem->peer_mem;
+ struct ib_peer_memory_client *ib_peer_mem = umem->ib_peer_mem;
+ const struct peer_memory_client *peer_mem = ib_peer_mem->peer_mem;
+ struct invalidation_ctx *ictx = umem->invalidation_ctx;
+
+ if (ictx)
+ ib_peer_destroy_invalidation_ctx(ib_peer_mem, ictx);
peer_mem->dma_unmap(&umem->sg_head,
umem->peer_mem_client_context,
umem->context->device->dma_device);
peer_mem->put_pages(&umem->sg_head,
umem->peer_mem_client_context);
- ib_put_peer_client(umem->ib_peer_mem, umem->peer_mem_client_context);
+ ib_put_peer_client(ib_peer_mem, umem->peer_mem_client_context);
kfree(umem);
}
+
+int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
+ void (*func)(void *cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size),
+ void *cookie)
+{
+ struct invalidation_ctx *ictx = umem->invalidation_ctx;
+ int ret = 0;
+
+ mutex_lock(&umem->ib_peer_mem->lock);
+ if (ictx->peer_invalidated) {
+ pr_err("ib_umem_activate_invalidation_notifier: pages were invalidated by peer\n");
+ ret = -EINVAL;
+ goto end;
+ }
+ ictx->func = func;
+ ictx->cookie = cookie;
+ /* from that point any pending invalidations can be called */
+end:
+ mutex_unlock(&umem->ib_peer_mem->lock);
+ return ret;
+}
+EXPORT_SYMBOL(ib_umem_activate_invalidation_notifier);
#endif
static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
@@ -209,15 +247,15 @@ struct ib_umem *ib_umem_get_flags(struct ib_ucontext *context,
struct ib_umem *peer_umem;
peer_mem_client =
- ib_get_peer_client(context, addr, size,
+ ib_get_peer_client(context, addr, size, flags,
&umem->peer_mem_client_context);
if (IS_ERR(peer_mem_client)) {
kfree(umem);
return ERR_CAST(peer_mem_client);
} else if (peer_mem_client) {
- peer_umem = peer_umem_get(peer_mem_client, umem, addr,
- flags & IB_UMEM_DMA_SYNC);
+ peer_umem = peer_umem_get(peer_mem_client, umem,
+ addr, flags);
if (IS_ERR(peer_umem))
kfree(umem);
return peer_umem;
diff --git a/include/rdma/ib_peer_mem.h b/include/rdma/ib_peer_mem.h
index 6f3dc84..d2b2d5f 100644
--- a/include/rdma/ib_peer_mem.h
+++ b/include/rdma/ib_peer_mem.h
@@ -60,6 +60,7 @@ struct core_ticket {
struct ib_peer_memory_client *ib_get_peer_client(struct ib_ucontext *context,
unsigned long addr,
size_t size,
+ unsigned long flags,
void **peer_client_context);
void ib_put_peer_client(struct ib_peer_memory_client *ib_peer_client,
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 5d0fb41..002da1e 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -42,11 +42,20 @@
struct ib_ucontext;
struct ib_umem_odp;
+struct ib_umem;
#ifdef CONFIG_INFINIBAND_PEER_MEM
struct invalidation_ctx {
struct ib_umem *umem;
u64 context_ticket;
+ void (*func)(void *invalidation_cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size);
+ void *cookie;
+ int peer_callback;
+ int inflight_invalidation;
+ int peer_invalidated;
+ struct completion comp;
};
#endif
@@ -100,6 +109,7 @@ static inline size_t ib_umem_num_pages(struct ib_umem *umem)
enum ib_peer_mem_flags {
IB_UMEM_DMA_SYNC = (1 << 0),
IB_UMEM_PEER_ALLOW = (1 << 1),
+ IB_UMEM_PEER_INVAL_SUPP = (1 << 2),
};
#ifdef CONFIG_INFINIBAND_USER_MEM
@@ -112,6 +122,14 @@ int ib_umem_page_count(struct ib_umem *umem);
int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offset,
size_t length);
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+int ib_umem_activate_invalidation_notifier(struct ib_umem *umem,
+ void (*func)(void *cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size),
+ void *cookie);
+#endif
+
#else /* CONFIG_INFINIBAND_USER_MEM */
#include <linux/err.h>
@@ -129,6 +147,7 @@ static inline int ib_umem_copy_from(void *dst, struct ib_umem *umem, size_t offs
size_t length) {
return -EINVAL;
}
+
#endif /* CONFIG_INFINIBAND_USER_MEM */
static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC 6/7] IB/core: Peer memory client for IO memory
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (4 preceding siblings ...)
2016-02-11 16:12 ` [RFC 5/7] IB/core: Invalidation support for peer memory Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 16:12 ` [RFC 7/7] IB/mlx5: Invalidation support for MR over peer memory Artemy Kovalyov
2016-02-11 19:18 ` [RFC 0/7] Peer-direct memory Jason Gunthorpe
7 siblings, 0 replies; 28+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Adds kernel module allowing RDMA transfers with various types of memory
like mmaped devices (that create PFN mappings) and mmaped files from DAX
filesystems, such as those that reside on NVRAM devices.
Module original source [1]
[1] https://github.com/sbates130272/io_peer_mem
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/Kconfig | 9 +
drivers/infiniband/core/Makefile | 4 +
drivers/infiniband/core/io_peer_mem.c | 332 ++++++++++++++++++++++++++++++++++
3 files changed, 345 insertions(+)
create mode 100644 drivers/infiniband/core/io_peer_mem.c
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 2837d66..90811b2 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -74,6 +74,15 @@ config INFINIBAND_PEER_MEM
memory in external hardware devices, such as GPU cards, SSD based
storage, dedicated ASIC accelerators, etc.
+config INFINIBAND_IO_PEER_MEM
+ tristate "InfiniBand Peer memory client for IO memory"
+ depends on INFINIBAND_PEER_MEM
+ default y
+ ---help---
+ Adds kernel module allowing RDMA transfers with various types of memory
+ like mmaped devices (that create PFN mappings) and mmaped files from DAX
+ filesystems, such as those that reside on NVRAM devices.
+
source "drivers/infiniband/hw/mthca/Kconfig"
source "drivers/infiniband/hw/qib/Kconfig"
source "drivers/infiniband/hw/cxgb3/Kconfig"
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 9882d00..529f1e9 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \
obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o
obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \
$(user_access-y)
+obj-$(CONFIG_INFINIBAND_IO_PEER_MEM) += ib_io_mr.o
ib_core-y := packer.o ud_header.o verbs.o cq.o sysfs.o \
device.o fmr_pool.o cache.o netlink.o \
@@ -36,3 +37,6 @@ ib_umad-y := user_mad.o
ib_ucm-y := ucm.o
ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o
+
+ib_io_mr-y := io_peer_mem.o
+
diff --git a/drivers/infiniband/core/io_peer_mem.c b/drivers/infiniband/core/io_peer_mem.c
new file mode 100644
index 0000000..df7161c
--- /dev/null
+++ b/drivers/infiniband/core/io_peer_mem.c
@@ -0,0 +1,332 @@
+/*
+ * Copyright (c) 2015 PMC-Sierra Inc.
+ * Copyright (c) 2016 Mellanox Technologies, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <rdma/peer_mem.h>
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mmu_notifier.h>
+#include <linux/workqueue.h>
+#include <linux/mutex.h>
+
+MODULE_AUTHOR("Logan Gunthorpe");
+MODULE_DESCRIPTION("MMAP'd IO memory plug-in");
+MODULE_LICENSE("Dual BSD/GPL");
+
+static void *reg_handle;
+static int (*mem_invalidate_callback)(void *reg_handle, u64 core_context);
+
+struct context {
+ unsigned long addr;
+ size_t size;
+ u64 core_context;
+ struct mmu_notifier mn;
+ struct pid *pid;
+ int active;
+ struct work_struct cleanup_work;
+ struct mutex mmu_mutex;
+};
+
+static void do_invalidate(struct context *ctx)
+{
+ mutex_lock(&ctx->mmu_mutex);
+
+ if (!ctx->active)
+ goto unlock_and_return;
+
+ ctx->active = 0;
+ pr_debug("invalidated addr %lx size %zx\n", ctx->addr, ctx->size);
+ mem_invalidate_callback(reg_handle, ctx->core_context);
+
+unlock_and_return:
+ mutex_unlock(&ctx->mmu_mutex);
+}
+
+static void mmu_release(struct mmu_notifier *mn,
+ struct mm_struct *mm)
+{
+ struct context *ctx = container_of(mn, struct context, mn);
+
+ do_invalidate(ctx);
+}
+
+static void mmu_invalidate_range(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct context *ctx = container_of(mn, struct context, mn);
+
+ if (start >= (ctx->addr + ctx->size) || ctx->addr >= end)
+ return;
+
+ pr_debug("mmu_invalidate_range %lx-%lx\n", start, end);
+ do_invalidate(ctx);
+}
+
+static void mmu_invalidate_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address)
+{
+ struct context *ctx = container_of(mn, struct context, mn);
+
+ if (address < ctx->addr || address < (ctx->addr + ctx->size))
+ return;
+
+ pr_debug("mmu_invalidate_page %lx\n", address);
+ do_invalidate(ctx);
+}
+
+static struct mmu_notifier_ops mmu_notifier_ops = {
+ .release = mmu_release,
+ .invalidate_range = mmu_invalidate_range,
+ .invalidate_page = mmu_invalidate_page,
+};
+
+static void fault_missing_pages(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ unsigned long pfn;
+
+ if (!(vma->vm_flags & VM_MIXEDMAP))
+ return;
+
+ for (; start < end; start += PAGE_SIZE) {
+ if (!follow_pfn(vma, start, &pfn))
+ continue;
+
+ handle_mm_fault(current->mm, vma, start, FAULT_FLAG_WRITE);
+ }
+}
+
+static int acquire(unsigned long addr, size_t size, void **context)
+{
+ struct vm_area_struct *vma;
+ struct context *ctx;
+ unsigned long pfn, end;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return 0;
+
+ ctx->addr = addr;
+ ctx->size = size;
+ ctx->active = 0;
+
+ end = addr + size;
+ vma = find_vma(current->mm, addr);
+
+ if (!vma || vma->vm_end < end)
+ goto err;
+
+ pr_debug("vma: %lx %lx %lx %zx\n", addr, vma->vm_end - vma->vm_start,
+ vma->vm_flags, size);
+
+ if (!(vma->vm_flags & VM_WRITE))
+ goto err;
+
+ fault_missing_pages(vma, addr & PAGE_MASK, end);
+
+ if (follow_pfn(vma, addr, &pfn))
+ goto err;
+
+ pr_debug("pfn: %lx\n", pfn << PAGE_SHIFT);
+
+ mutex_init(&ctx->mmu_mutex);
+
+ ctx->mn.ops = &mmu_notifier_ops;
+
+ if (mmu_notifier_register(&ctx->mn, current->mm)) {
+ pr_err("Failed to register mmu_notifier\n");
+ goto err;
+ }
+
+ ctx->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
+
+ if (!ctx->pid)
+ goto err;
+
+ pr_debug("acquire %p\n", ctx);
+ *context = ctx;
+ return 1;
+
+err:
+ kfree(ctx);
+ return 0;
+}
+
+static void deferred_cleanup(struct work_struct *work)
+{
+ struct context *ctx = container_of(work, struct context, cleanup_work);
+ struct task_struct *owning_process;
+ struct mm_struct *owning_mm;
+
+ pr_debug("cleanup %p\n", ctx);
+
+ owning_process = get_pid_task(ctx->pid, PIDTYPE_PID);
+ if (owning_process) {
+ owning_mm = get_task_mm(owning_process);
+ if (owning_mm) {
+ mmu_notifier_unregister(&ctx->mn, owning_mm);
+ mmput(owning_mm);
+ }
+ put_task_struct(owning_process);
+ }
+ put_pid(ctx->pid);
+ kfree(ctx);
+}
+
+static void release(void *context)
+{
+ struct context *ctx = context;
+
+ pr_debug("release %p\n", ctx);
+
+ INIT_WORK(&ctx->cleanup_work, deferred_cleanup);
+ schedule_work(&ctx->cleanup_work);
+}
+
+static int get_pages(unsigned long addr, size_t size, int write, int force,
+ struct sg_table *sg_head, void *context,
+ u64 core_context)
+{
+ struct context *ctx = context;
+ int ret;
+
+ ctx->core_context = core_context;
+ ctx->active = 1;
+
+ ret = sg_alloc_table(sg_head, (ctx->size + PAGE_SIZE - 1) / PAGE_SIZE,
+ GFP_KERNEL);
+ return ret;
+}
+
+static void put_pages(struct sg_table *sg_head, void *context)
+{
+ struct context *ctx = context;
+
+ ctx->active = 0;
+ sg_free_table(sg_head);
+}
+
+static int dma_map(struct sg_table *sg_head, void *context,
+ struct device *dma_device, int dmasync,
+ int *nmap)
+{
+ struct scatterlist *sg;
+ struct context *ctx = context;
+ unsigned long pfn;
+ unsigned long addr = ctx->addr;
+ unsigned long size = ctx->size;
+ struct task_struct *owning_process;
+ struct mm_struct *owning_mm;
+ struct vm_area_struct *vma = NULL;
+ int i, ret = 1;
+
+ *nmap = (ctx->size + PAGE_SIZE - 1) / PAGE_SIZE;
+
+ owning_process = get_pid_task(ctx->pid, PIDTYPE_PID);
+ if (!owning_process)
+ return ret;
+
+ owning_mm = get_task_mm(owning_process);
+ if (!owning_mm)
+ goto out;
+
+ vma = find_vma(owning_mm, ctx->addr);
+ if (!vma)
+ goto out;
+
+ for_each_sg(sg_head->sgl, sg, *nmap, i) {
+ sg_set_page(sg, NULL, PAGE_SIZE, 0);
+ ret = follow_pfn(vma, addr, &pfn);
+ if (ret)
+ goto out;
+
+ sg->dma_address = (pfn << PAGE_SHIFT);
+ sg->dma_length = PAGE_SIZE;
+ sg->offset = addr & ~PAGE_MASK;
+
+ pr_debug("sg[%d] %lx %x %d\n", i,
+ (unsigned long)sg->dma_address,
+ sg->dma_length, sg->offset);
+
+ addr += sg->dma_length - sg->offset;
+ size -= sg->dma_length - sg->offset;
+ }
+out:
+ put_task_struct(owning_process);
+
+ return 0;
+}
+
+static int dma_unmap(struct sg_table *sg_head, void *context,
+ struct device *dma_device)
+{
+ return 0;
+}
+
+static unsigned long get_page_size(void *context)
+{
+ return PAGE_SIZE;
+}
+
+static struct peer_memory_client io_mem_client = {
+ .acquire = acquire,
+ .get_pages = get_pages,
+ .dma_map = dma_map,
+ .dma_unmap = dma_unmap,
+ .put_pages = put_pages,
+ .get_page_size = get_page_size,
+ .release = release,
+};
+
+static int __init io_mem_init(void)
+{
+ reg_handle = ib_register_peer_memory_client(&io_mem_client,
+ &mem_invalidate_callback);
+
+ if (!reg_handle)
+ return -EINVAL;
+
+ return 0;
+}
+
+static void __exit io_mem_cleanup(void)
+{
+ ib_unregister_peer_memory_client(reg_handle);
+}
+
+module_init(io_mem_init);
+module_exit(io_mem_cleanup);
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC 7/7] IB/mlx5: Invalidation support for MR over peer memory
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (5 preceding siblings ...)
2016-02-11 16:12 ` [RFC 6/7] IB/core: Peer memory client for IO memory Artemy Kovalyov
@ 2016-02-11 16:12 ` Artemy Kovalyov
2016-02-11 19:18 ` [RFC 0/7] Peer-direct memory Jason Gunthorpe
7 siblings, 0 replies; 28+ messages in thread
From: Artemy Kovalyov @ 2016-02-11 16:12 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Artemy Kovalyov
Adds the required functionality to work with peer memory
clients which require invalidation support.
It includes:
- module moved to use ib_umem_get_flags.
- umem invalidation callback - once called should free any HW
resources assigned to that umem, then free peer resources
corresponding to that umem.
- The MR object relates to that umem is stay alive till dereg_mr is
called.
- synchronizing support between dereg_mr to invalidate callback.
- advertises the P2P device capability.
Signed-off-by: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/cq.c | 15 ++-
drivers/infiniband/hw/mlx5/doorbell.c | 6 +-
drivers/infiniband/hw/mlx5/main.c | 3 +
drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 +++
drivers/infiniband/hw/mlx5/mr.c | 166 ++++++++++++++++++++++++++--------
drivers/infiniband/hw/mlx5/qp.c | 3 +-
drivers/infiniband/hw/mlx5/srq.c | 4 +-
7 files changed, 163 insertions(+), 46 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 7ddc790..c41913d 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -651,9 +651,11 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata,
*cqe_size = ucmd.cqe_size;
- cq->buf.umem = ib_umem_get(context, ucmd.buf_addr,
- entries * ucmd.cqe_size,
- IB_ACCESS_LOCAL_WRITE, 1);
+ cq->buf.umem = ib_umem_get_flags(context, ucmd.buf_addr,
+ entries * ucmd.cqe_size,
+ IB_ACCESS_LOCAL_WRITE,
+ IB_UMEM_DMA_SYNC |
+ IB_UMEM_PEER_ALLOW);
if (IS_ERR(cq->buf.umem)) {
err = PTR_ERR(cq->buf.umem);
return err;
@@ -991,8 +993,11 @@ static int resize_user(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq,
if (ucmd.reserved0 || ucmd.reserved1)
return -EINVAL;
- umem = ib_umem_get(context, ucmd.buf_addr, entries * ucmd.cqe_size,
- IB_ACCESS_LOCAL_WRITE, 1);
+ umem = ib_umem_get_flags(context, ucmd.buf_addr,
+ entries * ucmd.cqe_size,
+ IB_ACCESS_LOCAL_WRITE,
+ IB_UMEM_DMA_SYNC |
+ IB_UMEM_PEER_ALLOW);
if (IS_ERR(umem)) {
err = PTR_ERR(umem);
return err;
diff --git a/drivers/infiniband/hw/mlx5/doorbell.c b/drivers/infiniband/hw/mlx5/doorbell.c
index a0e4e6d..1d76094 100644
--- a/drivers/infiniband/hw/mlx5/doorbell.c
+++ b/drivers/infiniband/hw/mlx5/doorbell.c
@@ -63,8 +63,10 @@ int mlx5_ib_db_map_user(struct mlx5_ib_ucontext *context, unsigned long virt,
page->user_virt = (virt & PAGE_MASK);
page->refcnt = 0;
- page->umem = ib_umem_get(&context->ibucontext, virt & PAGE_MASK,
- PAGE_SIZE, 0, 0);
+ page->umem = ib_umem_get_flags(&context->ibucontext,
+ virt & PAGE_MASK,
+ PAGE_SIZE, 0,
+ IB_UMEM_PEER_ALLOW);
if (IS_ERR(page->umem)) {
err = PTR_ERR(page->umem);
kfree(page);
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index a55bf05..f198208 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -475,6 +475,9 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
IB_DEVICE_PORT_ACTIVE_EVENT |
IB_DEVICE_SYS_IMAGE_GUID |
IB_DEVICE_RC_RNR_NAK_GEN;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ props->device_cap_flags |= IB_DEVICE_PEER_MEMORY;
+#endif
if (MLX5_CAP_GEN(mdev, pkv))
props->device_cap_flags |= IB_DEVICE_BAD_PKEY_CNTR;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index d475f83..12571e0 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -376,6 +376,13 @@ enum mlx5_ib_mtt_access_flags {
#define MLX5_IB_MTT_PRESENT (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE)
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+struct mlx5_ib_peer_id {
+ struct completion comp;
+ struct mlx5_ib_mr *mr;
+};
+#endif
+
struct mlx5_ib_mr {
struct ib_mr ibmr;
void *descs;
@@ -395,6 +402,11 @@ struct mlx5_ib_mr {
struct mlx5_core_sig_ctx *sig;
int live;
void *descs_alloc;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ struct mlx5_ib_peer_id *peer_id;
+ atomic_t invalidated;
+ struct completion invalidation_comp;
+#endif
};
struct mlx5_ib_umr_context {
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 6000f7a..d34199c 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1037,6 +1037,73 @@ err_1:
return ERR_PTR(err);
}
+static int mlx5_ib_invalidate_mr(struct ib_mr *ibmr)
+{
+ struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
+ struct mlx5_ib_mr *mr = to_mmr(ibmr);
+ int npages = mr->npages;
+ struct ib_umem *umem = mr->umem;
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ if (umem && umem->odp_data) {
+ /* Prevent new page faults from succeeding */
+ mr->live = 0;
+ /* Wait for all running page-fault handlers to finish. */
+ synchronize_srcu(&dev->mr_srcu);
+ /* Destroy all page mappings */
+ mlx5_ib_invalidate_range(umem, ib_umem_start(umem),
+ ib_umem_end(umem));
+ /*
+ * We kill the umem before the MR for ODP,
+ * so that there will not be any invalidations in
+ * flight, looking at the *mr struct.
+ */
+ ib_umem_release(umem);
+ atomic_sub(npages, &dev->mdev->priv.reg_pages);
+
+ /* Avoid double-freeing the umem. */
+ umem = NULL;
+ }
+#endif
+
+ clean_mr(mr);
+
+ if (umem) {
+ ib_umem_release(umem);
+ atomic_sub(npages, &dev->mdev->priv.reg_pages);
+ }
+ return 0;
+}
+
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+static void mlx5_invalidate_umem(void *invalidation_cookie,
+ struct ib_umem *umem,
+ unsigned long addr, size_t size)
+{
+ struct mlx5_ib_mr *mr;
+ struct mlx5_ib_peer_id *peer_id = invalidation_cookie;
+
+ wait_for_completion(&peer_id->comp);
+ if (peer_id->mr == NULL)
+ return;
+
+ mr = peer_id->mr;
+ /* This function is called under client peer lock
+ * so its resources are race protected
+ */
+ if (atomic_inc_return(&mr->invalidated) > 1) {
+ umem->invalidation_ctx->inflight_invalidation = 1;
+ return;
+ }
+
+ umem->invalidation_ctx->peer_callback = 1;
+ mlx5_ib_invalidate_mr(&mr->ibmr);
+ complete(&mr->invalidation_comp);
+}
+#endif
+
+
+
struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
u64 virt_addr, int access_flags,
struct ib_udata *udata)
@@ -1049,16 +1116,38 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
int ncont;
int order;
int err;
-
- mlx5_ib_dbg(dev, "start 0x%llx, virt_addr 0x%llx, length 0x%llx, access_flags 0x%x\n",
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ struct ib_peer_memory_client *ib_peer_mem;
+ struct mlx5_ib_peer_id *mlx5_ib_peer_id = NULL;
+#endif
+ mlx5_ib_dbg(dev, "%llx virt %llx len %llx access_flags %x\n",
start, virt_addr, length, access_flags);
- umem = ib_umem_get(pd->uobject->context, start, length, access_flags,
- 0);
+ umem = ib_umem_get_flags(pd->uobject->context, start, length,
+ access_flags, IB_UMEM_PEER_ALLOW |
+ IB_UMEM_PEER_INVAL_SUPP);
if (IS_ERR(umem)) {
mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(umem));
return (void *)umem;
}
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ ib_peer_mem = umem->ib_peer_mem;
+ if (ib_peer_mem) {
+ mlx5_ib_peer_id = kzalloc(sizeof(*mlx5_ib_peer_id), GFP_KERNEL);
+ if (!mlx5_ib_peer_id) {
+ err = -ENOMEM;
+ goto error;
+ }
+ init_completion(&mlx5_ib_peer_id->comp);
+ err = ib_umem_activate_invalidation_notifier
+ (umem,
+ mlx5_invalidate_umem,
+ mlx5_ib_peer_id);
+ if (err)
+ goto error;
+ }
+#endif
+
mlx5_ib_cont_pages(umem, start, &npages, &page_shift, &ncont, &order);
if (!npages) {
mlx5_ib_warn(dev, "avoid zero region\n");
@@ -1098,6 +1187,15 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
atomic_add(npages, &dev->mdev->priv.reg_pages);
mr->ibmr.lkey = mr->mmr.key;
mr->ibmr.rkey = mr->mmr.key;
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ atomic_set(&mr->invalidated, 0);
+ if (ib_peer_mem) {
+ init_completion(&mr->invalidation_comp);
+ mlx5_ib_peer_id->mr = mr;
+ mr->peer_id = mlx5_ib_peer_id;
+ complete(&mlx5_ib_peer_id->comp);
+ }
+#endif
#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
if (umem->odp_data) {
@@ -1127,6 +1225,12 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
return &mr->ibmr;
error:
+#ifdef CONFIG_INFINIBAND_PEER_MEM
+ if (mlx5_ib_peer_id) {
+ complete(&mlx5_ib_peer_id->comp);
+ kfree(mlx5_ib_peer_id);
+ }
+#endif
ib_umem_release(umem);
return ERR_PTR(err);
}
@@ -1245,54 +1349,44 @@ static int clean_mr(struct mlx5_ib_mr *mr)
mlx5_ib_warn(dev, "failed unregister\n");
return err;
}
- free_cached_mr(dev, mr);
}
- if (!umred)
- kfree(mr);
-
return 0;
}
int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
{
+#ifdef CONFIG_INFINIBAND_PEER_MEM
struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
struct mlx5_ib_mr *mr = to_mmr(ibmr);
- int npages = mr->npages;
- struct ib_umem *umem = mr->umem;
+ int ret = 0;
+ int umred = mr->umred;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
- if (umem && umem->odp_data) {
- /* Prevent new page faults from succeeding */
- mr->live = 0;
- /* Wait for all running page-fault handlers to finish. */
- synchronize_srcu(&dev->mr_srcu);
- /* Destroy all page mappings */
- mlx5_ib_invalidate_range(umem, ib_umem_start(umem),
- ib_umem_end(umem));
- /*
- * We kill the umem before the MR for ODP,
- * so that there will not be any invalidations in
- * flight, looking at the *mr struct.
+ if (atomic_inc_return(&mr->invalidated) > 1) {
+ /* In case there is inflight invalidation
+ * call pending for its termination
*/
- ib_umem_release(umem);
- atomic_sub(npages, &dev->mdev->priv.reg_pages);
-
- /* Avoid double-freeing the umem. */
- umem = NULL;
+ wait_for_completion(&mr->invalidation_comp);
+ } else {
+ ret = mlx5_ib_invalidate_mr(ibmr);
+ if (ret)
+ return ret;
}
-#endif
-
- clean_mr(mr);
-
- if (umem) {
- ib_umem_release(umem);
- atomic_sub(npages, &dev->mdev->priv.reg_pages);
+ kfree(mr->peer_id);
+ mr->peer_id = NULL;
+ if (umred) {
+ atomic_set(&mr->invalidated, 0);
+ free_cached_mr(dev, mr);
+ } else {
+ kfree(mr);
}
-
return 0;
+#else
+ return mlx5_ib_invalidate_mr(ibmr);
+#endif
}
+
struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
enum ib_mr_type mr_type,
u32 max_num_sg)
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 8fb9c27..5e07909 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -611,7 +611,8 @@ static int mlx5_ib_umem_get(struct mlx5_ib_dev *dev,
{
int err;
- *umem = ib_umem_get(pd->uobject->context, addr, size, 0, 0);
+ *umem = ib_umem_get_flags(pd->uobject->context, addr, size,
+ 0, IB_UMEM_PEER_ALLOW);
if (IS_ERR(*umem)) {
mlx5_ib_dbg(dev, "umem_get failed\n");
return PTR_ERR(*umem);
diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index 4659256..23b8ab4 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -115,8 +115,8 @@ static int create_srq_user(struct ib_pd *pd, struct mlx5_ib_srq *srq,
srq->wq_sig = !!(ucmd.flags & MLX5_SRQ_FLAG_SIGNATURE);
- srq->umem = ib_umem_get(pd->uobject->context, ucmd.buf_addr, buf_size,
- 0, 0);
+ srq->umem = ib_umem_get_flags(pd->uobject->context, ucmd.buf_addr,
+ buf_size, 0, IB_UMEM_PEER_ALLOW);
if (IS_ERR(srq->umem)) {
mlx5_ib_dbg(dev, "failed umem get, size %d\n", buf_size);
err = PTR_ERR(srq->umem);
--
1.8.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (6 preceding siblings ...)
2016-02-11 16:12 ` [RFC 7/7] IB/mlx5: Invalidation support for MR over peer memory Artemy Kovalyov
@ 2016-02-11 19:18 ` Jason Gunthorpe
[not found] ` <20160211191838.GA23675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-14 14:27 ` Haggai Eran
7 siblings, 2 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2016-02-11 19:18 UTC (permalink / raw)
To: Artemy Kovalyov
Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w
On Thu, Feb 11, 2016 at 06:12:50PM +0200, Artemy Kovalyov wrote:
> Recently introduced ZONE_DEVICE patch [1] allows to register devices as
> providers of "device memory" regions, making RDMA operation with them
> transparantly available. This patch is intended for scenarios which not fit
> into ZONE_DEVICE infrastrcture, but device still want to exposure it's
> IO regions to RDMA access.
Most of this stuff (eg the whole peer_memory_client thing) has no
buisness being part of the RDMA stack. We can't ask other drivers to
call IB functions just because their devices have ZONE_DIRECT memory.
Resubmit those parts under the mm subsystem, or another more
appropriate place.
This is the same comment as last time this was submitted.
If you want to make some incremental progress then implement the
existing ZONE_DEVICE API for the IB core and add the invalidate stuff
later, once you've negotiated a common API for that with linux-mm.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
[not found] ` <20160211191838.GA23675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-02-12 20:13 ` Christoph Hellwig
[not found] ` <20160212201328.GA14122-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-02-14 14:05 ` Haggai Eran
1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2016-02-12 20:13 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Artemy Kovalyov, dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w
On Thu, Feb 11, 2016 at 12:18:38PM -0700, Jason Gunthorpe wrote:
> Most of this stuff (eg the whole peer_memory_client thing) has no
> buisness being part of the RDMA stack. We can't ask other drivers to
> call IB functions just because their devices have ZONE_DIRECT memory.
>
> Resubmit those parts under the mm subsystem, or another more
> appropriate place.
>
> This is the same comment as last time this was submitted.
>
> If you want to make some incremental progress then implement the
> existing ZONE_DEVICE API for the IB core and add the invalidate stuff
> later, once you've negotiated a common API for that with linux-mm.
Agreed on all points. Nevermind that seems to be missing a user of all
the newly added infrastructure, which should come with the series.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
[not found] ` <20160212201328.GA14122-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2016-02-12 20:36 ` Jason Gunthorpe
[not found] ` <20160212203649.GA10540-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-14 14:09 ` Haggai Eran
1 sibling, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2016-02-12 20:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Artemy Kovalyov, dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w
On Fri, Feb 12, 2016 at 12:13:28PM -0800, Christoph Hellwig wrote:
> On Thu, Feb 11, 2016 at 12:18:38PM -0700, Jason Gunthorpe wrote:
> > Most of this stuff (eg the whole peer_memory_client thing) has no
> > buisness being part of the RDMA stack. We can't ask other drivers to
> > call IB functions just because their devices have ZONE_DIRECT memory.
> >
> > Resubmit those parts under the mm subsystem, or another more
> > appropriate place.
> >
> > This is the same comment as last time this was submitted.
> >
> > If you want to make some incremental progress then implement the
> > existing ZONE_DEVICE API for the IB core and add the invalidate stuff
> > later, once you've negotiated a common API for that with linux-mm.
>
> Agreed on all points. Nevermind that seems to be missing a user of all
> the newly added infrastructure, which should come with the series.
Yep!
I've heard from the people working on NVMe cards that this approach
isn't needed, desired, or correct. I'm deeply skeptical that there
would be an in-kernel user.
AFAIK, this specific interface is only needed to support nvidia's
propritary kernel module, which has some oddball design where these
extra callbacks are needed.
But who knows, maybe someone can think up reason why filesystems might
want to do an invalidate as well.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
[not found] ` <20160211191838.GA23675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-12 20:13 ` Christoph Hellwig
@ 2016-02-14 14:05 ` Haggai Eran
1 sibling, 0 replies; 28+ messages in thread
From: Haggai Eran @ 2016-02-14 14:05 UTC (permalink / raw)
To: Jason Gunthorpe, Kovalyov Artemy
Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
leon-Dj0/KMMU01E@public.gmane.org, Sagi Grimberg
On 11/02/2016 21:18, Jason Gunthorpe wrote:
> Resubmit those parts under the mm subsystem, or another more
> appropriate place.
We want the feedback from linux-mm, and they are Cced.
> If you want to make some incremental progress then implement the
> existing ZONE_DEVICE API for the IB core and add the invalidate stuff
> later, once you've negotiated a common API for that with linux-mm.
So there are couple of issues we currently have with ZONE_DEVICE.
Perhaps they can be solved and then we could use it directly.
First, I'm not sure it is intended to be used for our purpose.
memremap() has this comment [1]:
> memremap() is "ioremap" for cases where it is known that the resource
> being mapped does not have i/o side effects and the __iomem
> annotation is not applicable.
Does this apply also to devm_memremap_pages()? Because the HCA BAR
clearly doesn't fall under this definition.
Second, there's a requirement that ZONE_DEVICE ranges are aligned to
section-boundary, right? We have devices that have 8MB or 32MB BARs,
so they won't work with 128MB sections on x86_64.
Third, I understand there was a desire to place ZONE_DEVICE page structs
in the device itself. This can work for pmem, but obviously won't work
for an I/O device BAR like an HCA.
Regards,
Haggai
[1] http://lxr.free-electrons.com/source/kernel/memremap.c?v=4.4#L38
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
[not found] ` <20160212201328.GA14122-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-02-12 20:36 ` Jason Gunthorpe
@ 2016-02-14 14:09 ` Haggai Eran
1 sibling, 0 replies; 28+ messages in thread
From: Haggai Eran @ 2016-02-14 14:09 UTC (permalink / raw)
To: Christoph Hellwig, Jason Gunthorpe
Cc: Artemy Kovalyov, dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
sagig-VPRAkNaXOzVWk0Htik3J/w
On 12/02/2016 22:13, Christoph Hellwig wrote:
> Agreed on all points. Nevermind that seems to be missing a user of all
> the newly added infrastructure, which should come with the series.
Well, in this submission we add the ib_io_mr module (patch 6) that is
intended to be a user of this API.
We have prototype user space code using it to expose an HCA as a peer to
another HCA, thus allowing remote operation of one HCA through RDMA
operations coming in from the other HCA.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
2016-02-11 19:18 ` [RFC 0/7] Peer-direct memory Jason Gunthorpe
[not found] ` <20160211191838.GA23675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-02-14 14:27 ` Haggai Eran
2016-02-16 18:22 ` Jason Gunthorpe
1 sibling, 1 reply; 28+ messages in thread
From: Haggai Eran @ 2016-02-14 14:27 UTC (permalink / raw)
To: Jason Gunthorpe, Kovalyov Artemy
Cc: dledford@redhat.com, linux-rdma@vger.kernel.org,
linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg
[apologies: sending again because linux-mm address was wrong]
On 11/02/2016 21:18, Jason Gunthorpe wrote:
> Resubmit those parts under the mm subsystem, or another more
> appropriate place.
We want the feedback from linux-mm, and they are now Cced.
> If you want to make some incremental progress then implement the
> existing ZONE_DEVICE API for the IB core and add the invalidate stuff
> later, once you've negotiated a common API for that with linux-mm.
So there are couple of issues we currently have with ZONE_DEVICE.
Perhaps they can be solved and then we could use it directly.
First, I'm not sure it is intended to be used for our purpose.
memremap() has this comment [1]:
> memremap() is "ioremap" for cases where it is known that the resource
> being mapped does not have i/o side effects and the __iomem
> annotation is not applicable.
Does this apply also to devm_memremap_pages()? Because the HCA BAR
clearly doesn't fall under this definition.
Second, there's a requirement that ZONE_DEVICE ranges are aligned to
section-boundary, right? We have devices that have 8MB or 32MB BARs,
so they won't work with 128MB sections on x86_64.
Third, I understand there was a desire to place ZONE_DEVICE page structs
in the device itself. This can work for pmem, but obviously won't work
for an I/O device BAR like an HCA.
Regards,
Haggai
[1] http://lxr.free-electrons.com/source/kernel/memremap.c?v=4.4#L38
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
[not found] ` <20160212203649.GA10540-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-02-14 15:25 ` Sagi Grimberg
[not found] ` <56C09C7E.4060808-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2016-02-14 15:25 UTC (permalink / raw)
To: Jason Gunthorpe, Christoph Hellwig
Cc: Artemy Kovalyov, dledford-H+wXaHxf7aLQT0dZR+AlfA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-mm-u79uwXL29TY76Z2rM5mHXA, leon-Dj0/KMMU01E,
haggaie-VPRAkNaXOzVWk0Htik3J/w, sagig-VPRAkNaXOzVWk0Htik3J/w,
Stephen Bates
>> Agreed on all points. Nevermind that seems to be missing a user of all
>> the newly added infrastructure, which should come with the series.
>
> Yep!
>
> I've heard from the people working on NVMe cards that this approach
> isn't needed, desired, or correct. I'm deeply skeptical that there
> would be an in-kernel user.
Hey Jason,
So regarding NVMe, a possible user for this would be a user-space NVMe
target implementation which wants to use a CMB (controller memory
buffer on the device bar) as an intermediate memory for RDMA. In case
the NVMe unplugs we can't rely on user-space to handle it, and need a
way to tell the HCA to tear down the CMB mappings.
I do think that this problem is solvable if we create a mmu_notitifier
client for this mapping (assuming that device unplug will trigger mmu
invalidate callouts on the relevant pages). But I'm still not exactly
sure how ZONE_DEVICE works with I/O device bar though (I'll appreciate
if someone can educate me here).
CC'ing sbates who played with this stuff at some point...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
2016-02-14 14:27 ` Haggai Eran
@ 2016-02-16 18:22 ` Jason Gunthorpe
2016-02-17 4:03 ` davide rossetti
0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2016-02-16 18:22 UTC (permalink / raw)
To: Haggai Eran
Cc: Kovalyov Artemy, dledford@redhat.com, linux-rdma@vger.kernel.org,
linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg
On Sun, Feb 14, 2016 at 04:27:20PM +0200, Haggai Eran wrote:
> [apologies: sending again because linux-mm address was wrong]
>
> On 11/02/2016 21:18, Jason Gunthorpe wrote:
> > Resubmit those parts under the mm subsystem, or another more
> > appropriate place.
>
> We want the feedback from linux-mm, and they are now Cced.
Resubmit to mm means put this stuff someplace outside
drivers/infiniband in the tree and don't try and inappropriately send
memory management stuff through Doug's tree.
Jason
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
2016-02-16 18:22 ` Jason Gunthorpe
@ 2016-02-17 4:03 ` davide rossetti
[not found] ` <CAPSaadxbFCOcKV=c3yX7eGw9Wqzn3jvPRZe2LMWYmiQcijT4nw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 28+ messages in thread
From: davide rossetti @ 2016-02-17 4:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Haggai Eran, Kovalyov Artemy, dledford@redhat.com,
linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro,
Sagi Grimberg
[-- Attachment #1: Type: text/plain, Size: 1707 bytes --]
On Tue, Feb 16, 2016 at 10:22 AM, Jason Gunthorpe <
jgunthorpe@obsidianresearch.com> wrote:
> On Sun, Feb 14, 2016 at 04:27:20PM +0200, Haggai Eran wrote:
> > [apologies: sending again because linux-mm address was wrong]
> >
> > On 11/02/2016 21:18, Jason Gunthorpe wrote:
> > > Resubmit those parts under the mm subsystem, or another more
> > > appropriate place.
> >
> > We want the feedback from linux-mm, and they are now Cced.
>
> Resubmit to mm means put this stuff someplace outside
> drivers/infiniband in the tree and don't try and inappropriately send
> memory management stuff through Doug's tree.
>
>
Jason,
I beg to differ.
1) I see mm as appropriate for real memory, i.e. something that user-space
apps can pass around.
This is not totally true for BAR memory, for instance as long as CPU
initiated atomic ops are not supported on BAR space of PCIe devices.
OTOT, CPU reading from BAR is awful (BW being abysmal,~10MB/s), while high
BW writing requires use of vector instructions (at least on x86_64).
2) Instead, I see appropriate that two sophisticated devices, like an IB
NIC and a storage/accelerator device, can freely target each other for I/O,
i.e. exchanging peer-to-peer PCIe transactions. And as long as the existing
sophisticated initiators are confined to the RDMA subsystem, that is where
this support belongs to.
On a different note, this reminds me that the current patch set may be
missing a way to disable the use of platform PCIe atomics when the target
is the BAR of a peer device.
--
sincerely,
d.
email: davide DOT rossetti AT gmail DOT com
work: drossetti AT nvidia DOT com
facebook: http://www.facebook.com/dado.rossetti
twitter: @dado_rossetti
skype: d.rossetti
[-- Attachment #2: Type: text/html, Size: 2617 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
[not found] ` <CAPSaadxbFCOcKV=c3yX7eGw9Wqzn3jvPRZe2LMWYmiQcijT4nw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-02-17 4:13 ` davide rossetti
2016-02-17 4:44 ` Jason Gunthorpe
[not found] ` <CAPSaadx3vNBSxoWuvjrTp2n8_-DVqofttFGZRR+X8zdWwV86nw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 2 replies; 28+ messages in thread
From: davide rossetti @ 2016-02-17 4:13 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Haggai Eran, Kovalyov Artemy,
dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
leon-Dj0/KMMU01E@public.gmane.org, Sagi Grimberg
resending, sorry
On Tue, Feb 16, 2016 at 10:22 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>
> On Sun, Feb 14, 2016 at 04:27:20PM +0200, Haggai Eran wrote:
> > [apologies: sending again because linux-mm address was wrong]
> >
> > On 11/02/2016 21:18, Jason Gunthorpe wrote:
> > > Resubmit those parts under the mm subsystem, or another more
> > > appropriate place.
> >
> > We want the feedback from linux-mm, and they are now Cced.
>
> Resubmit to mm means put this stuff someplace outside
> drivers/infiniband in the tree and don't try and inappropriately send
> memory management stuff through Doug's tree.
>
Jason,
I beg to differ.
1) I see mm as appropriate for real memory, i.e. something that
user-space apps can pass around. This is not totally true for BAR
memory, for instance:
a) as long as CPU initiated atomic ops are not supported on BAR space
of PCIe devices.
b) OTOT, CPU reading from BAR is awful (BW being abysmal,~10MB/s),
while high BW writing requires use of vector instructions (at least on
x86_64).
Bottom line is, BAR mappings are not like plain memory.
2) Instead, I see appropriate that two sophisticated devices, like an
IB NIC and a storage/accelerator device, can freely target each other
for I/O, i.e. exchanging peer-to-peer PCIe transactions. And as long
as the existing sophisticated initiators are confined to the RDMA
subsystem, that is where this support belongs to.
On a different note, this reminds me that the current patch set may be
missing a way to disable the use of platform PCIe atomics when the
target is the BAR of a peer device.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
2016-02-17 4:13 ` davide rossetti
@ 2016-02-17 4:44 ` Jason Gunthorpe
2016-02-17 8:49 ` Christoph Hellwig
[not found] ` <CAPSaadx3vNBSxoWuvjrTp2n8_-DVqofttFGZRR+X8zdWwV86nw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
1 sibling, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2016-02-17 4:44 UTC (permalink / raw)
To: davide rossetti
Cc: Haggai Eran, Kovalyov Artemy, dledford@redhat.com,
linux-rdma@vger.kernel.org, linux-mm@kvack.org, leon@leon.ro,
Sagi Grimberg
On Tue, Feb 16, 2016 at 08:13:58PM -0800, davide rossetti wrote:
> Bottom line is, BAR mappings are not like plain memory.
As I understand it the actual use of this in fact when user space
manages to map BAR memory into it's address space and attempts to do DMA
from it. So, I'm not sure I agree at all with this assement.
ie I gather with NVMe the desire is this could happen through the
filesystem with the right open/mmap flags.
So, saying this has nothing to do with core kernel code, or with mm,
is a really big leap.
> 2) Instead, I see appropriate that two sophisticated devices, like an
> IB NIC and a storage/accelerator device, can freely target each
> other
There is nothing special about IB, and no 'sophistication' of the
DMA'ing device is required.
All other DMA devices should be able to target BAR memory. eg TCP TSO,
or storage-to-storage copies from BAR to SCSI immediately come to
mind.
> for I/O, i.e. exchanging peer-to-peer PCIe transactions. And as long
> as the existing sophisticated initiators are confined to the RDMA
> subsystem, that is where this support belongs to.
I would not object to this stuff living in the PCI subsystem, but
living in rdma and having this narrrow focus that it should only
work with IB is not good.
> On a different note, this reminds me that the current patch set may be
> missing a way to disable the use of platform PCIe atomics when the
> target is the BAR of a peer device.
There is a general open question with all PCI peer to peer
transactions on how to negotiate all the relevant PCI
parameters. Supported vendor extensions and supported standardized
features seems like just one piece of a larger problem. Again well
outside the scope of IB.
Jason
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
[not found] ` <CAPSaadx3vNBSxoWuvjrTp2n8_-DVqofttFGZRR+X8zdWwV86nw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-02-17 8:44 ` Christoph Hellwig
2016-02-17 15:25 ` Haggai Eran
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2016-02-17 8:44 UTC (permalink / raw)
To: davide rossetti
Cc: Jason Gunthorpe, Haggai Eran, Kovalyov Artemy,
dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
leon-Dj0/KMMU01E@public.gmane.org, Sagi Grimberg
[disclaimer: I've been involved with ZONE_DEVICE support and the pmem
driver and wrote parts of the code and discussed a lot of the tradeoffs
on how we handle I/O to memory in BARs]
On Tue, Feb 16, 2016 at 08:13:58PM -0800, davide rossetti wrote:
> 1) I see mm as appropriate for real memory, i.e. something that
> user-space apps can pass around.
mm is memory management, and this clearly falls under the umbrella,
so it absolutely needs to be under mm/ and reviewed by the linux-mm
crowd.
> This is not totally true for BAR
> memory, for instance:
> a) as long as CPU initiated atomic ops are not supported on BAR space
> of PCIe devices.
> b) OTOT, CPU reading from BAR is awful (BW being abysmal,~10MB/s),
> while high BW writing requires use of vector instructions (at least on
> x86_64).
> Bottom line is, BAR mappings are not like plain memory.
That doesn't change how the are managed. We've always suppored mapping
BARs to userspace in various drivers, and the only real news with things
like the pmem driver with DAX or some of the things people want to do
with the NVMe controller memoery buffer is that there are much bigger
quantities of it, and:
a) people want to be able have cachable mappings of various kinds
instead of the old uncachable default.
b) we want to be able to DMA (including RDMA) to the regions in the
BARs.
a) is something that needs smaller amounts in all kinds of areas to be
done properly, but in principle GPU drivers have been doing this forever
using all kinds of hacks.
b) is the real issue. The Linux DMA support code doesn't really operate
on just physical addresses, but on page structures, and we don't
allocate for BARs. We investigated two ways to address this: 1) allow
DMA operations without struct page and 2) create struct page structures
for BARs that we want to be able to use DMA operations on. For various
reasons version 2) was favored and this is how we ended up with
ZONE_DEVICE. Read the linux-mm and linux-nvdimm lists for the lenghty
discussions how we ended up here.
Additional issues like which instructions to use for access build on top
of these basic building blocks.
> 2) Instead, I see appropriate that two sophisticated devices, like an
> IB NIC and a storage/accelerator device, can freely target each other
> for I/O, i.e. exchanging peer-to-peer PCIe transactions. And as long
> as the existing sophisticated initiators are confined to the RDMA
> subsystem, that is where this support belongs to.
It doesn't. There is absolutely nothing RDMA specific here - please
work with the overall community to do the right thing here.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
2016-02-17 4:44 ` Jason Gunthorpe
@ 2016-02-17 8:49 ` Christoph Hellwig
[not found] ` <20160217084959.GB13616-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2016-02-17 8:49 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: davide rossetti, Haggai Eran, Kovalyov Artemy,
dledford@redhat.com, linux-rdma@vger.kernel.org,
linux-mm@kvack.org, leon@leon.ro, Sagi Grimberg
On Tue, Feb 16, 2016 at 09:44:17PM -0700, Jason Gunthorpe wrote:
> On Tue, Feb 16, 2016 at 08:13:58PM -0800, davide rossetti wrote:
>
> > Bottom line is, BAR mappings are not like plain memory.
>
> As I understand it the actual use of this in fact when user space
> manages to map BAR memory into it's address space and attempts to do DMA
> from it. So, I'm not sure I agree at all with this assement.
>
> ie I gather with NVMe the desire is this could happen through the
> filesystem with the right open/mmap flags.
Lot's of confusion here. NVMe is a block device interface - there
is not real point in mapping anything in there to userspace unless
you use an entirely userspace driver through the normal userspace
PCI driver interface. For pmem (which some people confusingly call
NVM) mapping the byte addressable persistent memory to userspace using
DAX makes a lot of sense, and a lot of work around that is going
on currently.
For NVMe 1.2 there is a new feature called the controller memory
buffer, which basically is a giant BAR that can be used instead
of host memory for the submission and completion queues of the
device, as well as for actual data sent to and reived from the device.
Some people are tlaking about using this as the target of RDMA
operations, but I don't think this patch series would be anywhere
near useful for this mode of operation.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
2016-02-17 8:44 ` Christoph Hellwig
@ 2016-02-17 15:25 ` Haggai Eran
[not found] ` <56C490DF.1090100-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 28+ messages in thread
From: Haggai Eran @ 2016-02-17 15:25 UTC (permalink / raw)
To: Christoph Hellwig, davide rossetti
Cc: Jason Gunthorpe, Kovalyov Artemy, dledford@redhat.com,
linux-rdma@vger.kernel.org, linux-mm@kvack.org, Leon Romanovsky,
Sagi Grimberg
On 17/02/2016 10:44, Christoph Hellwig wrote:
> That doesn't change how the are managed. We've always suppored mapping
> BARs to userspace in various drivers, and the only real news with things
> like the pmem driver with DAX or some of the things people want to do
> with the NVMe controller memoery buffer is that there are much bigger
> quantities of it, and:
>
> a) people want to be able have cachable mappings of various kinds
> instead of the old uncachable default.
What if we do want an uncachable mapping for our device's BAR. Can we still
expose it under ZONE_DEVICE?
> b) we want to be able to DMA (including RDMA) to the regions in the
> BARs.
>
> a) is something that needs smaller amounts in all kinds of areas to be
> done properly, but in principle GPU drivers have been doing this forever
> using all kinds of hacks.
>
> b) is the real issue. The Linux DMA support code doesn't really operate
> on just physical addresses, but on page structures, and we don't
> allocate for BARs. We investigated two ways to address this: 1) allow
> DMA operations without struct page and 2) create struct page structures
> for BARs that we want to be able to use DMA operations on. For various
> reasons version 2) was favored and this is how we ended up with
> ZONE_DEVICE. Read the linux-mm and linux-nvdimm lists for the lenghty
> discussions how we ended up here.
I was wondering what are your thoughts regarding the other questions we raised
about ZONE_DEVICE.
How can we overcome the section-alignment requirement in the current code? Our
HCA's BARs are usually smaller than 128MB.
Sagi also asked how should a peer device who got a ZONE_DEVICE page know it
should stop using it (the CMB example).
Regards,
Haggai
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC 0/7] Peer-direct memory
[not found] ` <56C09C7E.4060808-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2016-02-18 14:44 ` Stephen Bates
2016-02-21 9:06 ` Haggai Eran
0 siblings, 1 reply; 28+ messages in thread
From: Stephen Bates @ 2016-02-18 14:44 UTC (permalink / raw)
To: Sagi Grimberg, Jason Gunthorpe, Christoph Hellwig,
'Logan Gunthorpe' (logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org)
Cc: Artemy Kovalyov, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
leon-Dj0/KMMU01E@public.gmane.org,
haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org
Sagi
> CC'ing sbates who played with this stuff at some point...
Thanks for inviting me to this party Sagi ;-). Here are some comments and responses based on our experiences. Apologies in advance for the list format:
1. As it stands in 4.5-rc4 devm_memremap_pages will not work with iomem. Myself and (mostly) Logan (cc'ed here) developed the ability to do that in an out of tree patch for memremap.c. We also developed a simple example driver for a PCIe device that exposes DRAM on the card via a BAR. We used this code to provide some feedback to Dan (e.g. [1]-[3]). At this time we are preparing an RFC to extend devm_memremap_pages for IO memory and we hope to have that ready soon but there is no guarantee our approach is acceptable to the community. My hope is that it will be a good starting point for moving forward...
2. The two good things about Peer-Direct are that is works and it is here today. That said, I do think an approach based on ZONE_DEVICE is more general and a preferred way to allow IO devices to communicate with each other. The question is can we find such an approach that is acceptable to the community? As noted in point 1 I hope the coming RFC will initiate a discussion. I have also requested attendance at LSF/MM to discuss this topic (among others).
3. As of now the section alignment requirement is somewhat relaxed. I quote from [4].
"I could loosen the restriction a bit to allow one unaligned mapping
per section. However, if another mapping request came along that
tried to map a free part of the section it would fail because the code
depends on a "1 dev_pagemap per section" relationship. Seems an ok
compromise to me..."
This is implemented in 4.5-rc4 (see memremap.c line 315).
4. The out of tree patch we did allows one to register the device memory as IO memory. However, we were only concerned with DRAM exposed on the BAR and so were not affected by the "i/o side effects" issues. Someone would need to think about how this applies to IOMEM that does have side-effects when accessed.
5. I concur with Sagi's comment below that one approach we can use to inform 3rd party device drives about vanishing memory regions is via mmu_notifiers. However this needs to be fleshed out and tied into the relevant driver(s).
6. In full disclosure, my main interest in this ties in to NVM Express devices which can act as DMA masters and expose regions of IOMEM at the same time (via CMBs). I want to be able to tie these devices together with other IO devices (like RDMA NICs, FPGA and GPGPU based offload engines, other NVMe devices and storage adaptors) in a peer-2-peer fashion and may not always have a RDMA device in the mix...
Cheers
Stephen
[1] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002357.html
[2] https://lwn.net/Articles/667148/
[3] https://lists.01.org/pipermail/linux-nvdimm/2015-December/003104.html
[4] https://lists.01.org/pipermail/linux-nvdimm/2015-December/003141.html
> -----Original Message-----
> From: Sagi Grimberg [mailto:sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org]
> Sent: Sunday, February 14, 2016 8:26 AM
> To: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>; Christoph
> Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Artemy Kovalyov <artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-mm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; leon-Dj0/KMMU01E@public.gmane.org;
> haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org; sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org; Stephen Bates
> <Stephen.Bates-PwyqCcigF0Q@public.gmane.org>
> Subject: Re: [RFC 0/7] Peer-direct memory
>
>
> >> Agreed on all points. Nevermind that seems to be missing a user of
> >> all the newly added infrastructure, which should come with the series.
> >
> > Yep!
> >
> > I've heard from the people working on NVMe cards that this approach
> > isn't needed, desired, or correct. I'm deeply skeptical that there
> > would be an in-kernel user.
>
> Hey Jason,
>
> So regarding NVMe, a possible user for this would be a user-space NVMe
> target implementation which wants to use a CMB (controller memory buffer
> on the device bar) as an intermediate memory for RDMA. In case the NVMe
> unplugs we can't rely on user-space to handle it, and need a way to tell the
> HCA to tear down the CMB mappings.
>
> I do think that this problem is solvable if we create a mmu_notitifier client for
> this mapping (assuming that device unplug will trigger mmu invalidate
> callouts on the relevant pages). But I'm still not exactly sure how
> ZONE_DEVICE works with I/O device bar though (I'll appreciate if someone
> can educate me here).
>
> CC'ing sbates who played with this stuff at some point...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
[not found] ` <20160217084959.GB13616-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2016-02-18 17:12 ` Jason Gunthorpe
0 siblings, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2016-02-18 17:12 UTC (permalink / raw)
To: Christoph Hellwig
Cc: davide rossetti, Haggai Eran, Kovalyov Artemy,
dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
leon-Dj0/KMMU01E@public.gmane.org, Sagi Grimberg
On Wed, Feb 17, 2016 at 12:49:59AM -0800, Christoph Hellwig wrote:
> PCI driver interface. For pmem (which some people confusingly call
> NVM) mapping the byte addressable persistent memory to userspace using
> DAX makes a lot of sense, and a lot of work around that is going
> on currently.
Right, this is what I was refering to, 'pmem' like capability done
with NVMe hardware on PCIe.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
[not found] ` <56C490DF.1090100-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-02-19 18:54 ` Dan Williams
0 siblings, 0 replies; 28+ messages in thread
From: Dan Williams @ 2016-02-19 18:54 UTC (permalink / raw)
To: Haggai Eran
Cc: Christoph Hellwig, davide rossetti, Jason Gunthorpe,
Kovalyov Artemy, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Leon Romanovsky,
Sagi Grimberg
On Wed, Feb 17, 2016 at 7:25 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 17/02/2016 10:44, Christoph Hellwig wrote:
>> That doesn't change how the are managed. We've always suppored mapping
>> BARs to userspace in various drivers, and the only real news with things
>> like the pmem driver with DAX or some of the things people want to do
>> with the NVMe controller memoery buffer is that there are much bigger
>> quantities of it, and:
>>
>> a) people want to be able have cachable mappings of various kinds
>> instead of the old uncachable default.
> What if we do want an uncachable mapping for our device's BAR. Can we still
> expose it under ZONE_DEVICE?
>
>> b) we want to be able to DMA (including RDMA) to the regions in the
>> BARs.
>>
>> a) is something that needs smaller amounts in all kinds of areas to be
>> done properly, but in principle GPU drivers have been doing this forever
>> using all kinds of hacks.
>>
>> b) is the real issue. The Linux DMA support code doesn't really operate
>> on just physical addresses, but on page structures, and we don't
>> allocate for BARs. We investigated two ways to address this: 1) allow
>> DMA operations without struct page and 2) create struct page structures
>> for BARs that we want to be able to use DMA operations on. For various
>> reasons version 2) was favored and this is how we ended up with
>> ZONE_DEVICE. Read the linux-mm and linux-nvdimm lists for the lenghty
>> discussions how we ended up here.
>
> I was wondering what are your thoughts regarding the other questions we raised
> about ZONE_DEVICE.
>
> How can we overcome the section-alignment requirement in the current code? Our
> HCA's BARs are usually smaller than 128MB.
This may not help, but note that the section-alignment only bites when
trying to have 2 mappings with different lifetimes in a single
section. It's otherwise fine to map a full section for a smaller
single range, you'll just end up with pages that won't be used.
However, this assumes that you are fine with everything in that
section being mapped cacheable, you couldn't mix uncacheable mappings
in that same range.
> Sagi also asked how should a peer device who got a ZONE_DEVICE page know it
> should stop using it (the CMB example).
ZONE_DEVICE pages come with a per-cpu reference counter via
page->pgmap. See get_dev_pagemap(), get_zone_device_page(), and
put_zone_device_page().
However this gets confusing quickly when a 'pfn' and a 'page' start
referencing mmio space instead of host memory. It seems like we need
new data types because a dma_addr_t does not necessarily reflect the
peer-to-peer address as seen by the device.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
2016-02-18 14:44 ` Stephen Bates
@ 2016-02-21 9:06 ` Haggai Eran
[not found] ` <56C97E13.9090101-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 28+ messages in thread
From: Haggai Eran @ 2016-02-21 9:06 UTC (permalink / raw)
To: Stephen Bates, Sagi Grimberg, Jason Gunthorpe, Christoph Hellwig,
'Logan Gunthorpe' (logang@deltatee.com)
Cc: Artemy Kovalyov, dledford@redhat.com, linux-rdma@vger.kernel.org,
linux-mm@kvack.org, Leon Romanovsky, sagig@mellanox.com
On 18/02/2016 16:44, Stephen Bates wrote:
> Sagi
>
>> CC'ing sbates who played with this stuff at some point...
>
> Thanks for inviting me to this party Sagi ;-). Here are some comments and responses based on our experiences. Apologies in advance for the list format:
>
> 1. As it stands in 4.5-rc4 devm_memremap_pages will not work with iomem. Myself and (mostly) Logan (cc'ed here) developed the ability to do that in an out of tree patch for memremap.c. We also developed a simple example driver for a PCIe device that exposes DRAM on the card via a BAR. We used this code to provide some feedback to Dan (e.g. [1]-[3]). At this time we are preparing an RFC to extend devm_memremap_pages for IO memory and we hope to have that ready soon but there is no guarantee our approach is acceptable to the community. My hope is that it will be a good starting point for moving forward...
I'd be happy to see your RFC when you are ready. I see in the thread
of [3] that you are using write-combining. Do you think your patchset
will also be suitable for uncachable memory?
> 2. The two good things about Peer-Direct are that is works and it is here today. That said, I do think an approach based on ZONE_DEVICE is more general and a preferred way to allow IO devices to communicate with each other. The question is can we find such an approach that is acceptable to the community? As noted in point 1 I hope the coming RFC will initiate a discussion. I have also requested attendance at LSF/MM to discuss this topic (among others).
>
> 3. As of now the section alignment requirement is somewhat relaxed. I quote from [4].
>
> "I could loosen the restriction a bit to allow one unaligned mapping
> per section. However, if another mapping request came along that
> tried to map a free part of the section it would fail because the code
> depends on a "1 dev_pagemap per section" relationship. Seems an ok
> compromise to me..."
>
> This is implemented in 4.5-rc4 (see memremap.c line 315).
I don't think that's enough for our purposes. We have devices with
rather small BARs (32MB) and multiple PFs that all need to expose their
BAR to peer to peer access. One can expect these PFs will be assigned
adjacent addresses and they will break the "one dev_pagemap per
section" rule.
> 4. The out of tree patch we did allows one to register the device memory as IO memory. However, we were only concerned with DRAM exposed on the BAR and so were not affected by the "i/o side effects" issues. Someone would need to think about how this applies to IOMEM that does have side-effects when accessed.
With this RFC, we map parts of the HCA BAR that were mmapped to a process
(both uncacheable and write-combining) and map them to a peer device
(another HCA). As long as the kernel doesn't do anything else with
these pages, and leaves them to be controlled by the user-space
application and/or the peer device, I don't see a problem with mapping
IO memory with side effects. However, I'm not an expert here, and I'd
be happy to hear what others think about this.
> 5. I concur with Sagi's comment below that one approach we can use to inform 3rd party device drives about vanishing memory regions is via mmu_notifiers. However this needs to be fleshed out and tied into the relevant driver(s).
>
> 6. In full disclosure, my main interest in this ties in to NVM Express devices which can act as DMA masters and expose regions of IOMEM at the same time (via CMBs). I want to be able to tie these devices together with other IO devices (like RDMA NICs, FPGA and GPGPU based offload engines, other NVMe devices and storage adaptors) in a peer-2-peer fashion and may not always have a RDMA device in the mix...
I understand.
Regards,
Haggai
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC 0/7] Peer-direct memory
[not found] ` <56C97E13.9090101-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-02-24 23:45 ` Stephen Bates
2016-02-25 11:27 ` Haggai Eran
0 siblings, 1 reply; 28+ messages in thread
From: Stephen Bates @ 2016-02-24 23:45 UTC (permalink / raw)
To: Haggai Eran, Sagi Grimberg, Jason Gunthorpe, Christoph Hellwig,
'Logan Gunthorpe' (logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org)
Cc: Artemy Kovalyov, dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Leon Romanovsky,
sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org
Haggi
> I'd be happy to see your RFC when you are ready. I see in the thread of [3]
> that you are using write-combining. Do you think your patchset will also be
> suitable for uncachable memory?
Great, we hope to have the RFC soon. It will be able to accept different flags for devm_memremap() call with regards to caching. Though one question I have is when does the caching flag affect Peer-2-Peer memory accesses? I can see caching causing issues when performing accesses from the CPU but P2P accesses should bypass any caches in the system?
> I don't think that's enough for our purposes. We have devices with rather
> small BARs (32MB) and multiple PFs that all need to expose their BAR to peer
> to peer access. One can expect these PFs will be assigned adjacent addresses
> and they will break the "one dev_pagemap per section" rule.
On the cards and systems I have checked even small BARs tend to be separated by more than one section's worth of memory. As I understand it the allocation of BAR addresses is very ARCH and BIOS specific. Let's discuss this once the RFC comes out and see what options exist to address your concerns.
>
> > 4. The out of tree patch we did allows one to register the device memory as
> IO memory. However, we were only concerned with DRAM exposed on the
> BAR and so were not affected by the "i/o side effects" issues. Someone
> would need to think about how this applies to IOMEM that does have side-
> effects when accessed.
> With this RFC, we map parts of the HCA BAR that were mmapped to a
> process (both uncacheable and write-combining) and map them to a peer
> device (another HCA). As long as the kernel doesn't do anything else with
> these pages, and leaves them to be controlled by the user-space application
> and/or the peer device, I don't see a problem with mapping IO memory with
> side effects. However, I'm not an expert here, and I'd be happy to hear what
> others think about this.
See above. I think the upcoming RFC should provide support for both caching and uncashed mappings. I concur that even if the mappings are flagged as cachable there should be no issues as long as all accesses are from the peer-direct device.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC 0/7] Peer-direct memory
2016-02-24 23:45 ` Stephen Bates
@ 2016-02-25 11:27 ` Haggai Eran
0 siblings, 0 replies; 28+ messages in thread
From: Haggai Eran @ 2016-02-25 11:27 UTC (permalink / raw)
To: Stephen Bates, Sagi Grimberg, Jason Gunthorpe, Christoph Hellwig,
'Logan Gunthorpe' (logang@deltatee.com)
Cc: Artemy Kovalyov, dledford@redhat.com, linux-rdma@vger.kernel.org,
linux-mm@kvack.org, Leon Romanovsky, sagig@mellanox.com
On 25/02/2016 01:45, Stephen Bates wrote:
> Great, we hope to have the RFC soon. It will be able to accept different flags for devm_memremap() call with regards to caching. Though one question I have is when does the caching flag affect Peer-2-Peer memory accesses? I can see caching causing issues when performing accesses from the CPU but P2P accesses should bypass any caches in the system?
I don't think the caching flag will affect peer to peer directly, but we need
to keep the BAR mapped to the host the same way it is today. If we change the
driver to map page structs returned from devm_memremap_pages() instead of using
io_remap_pfn_range() it needs to continue working with host uses and not only
with peers.
Regards,
Haggai
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2016-02-25 11:27 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-11 16:12 [RFC 0/7] Peer-direct memory Artemy Kovalyov
[not found] ` <1455207177-11949-1-git-send-email-artemyko-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-11 16:12 ` [RFC 1/7] IB/core: Introduce peer client interface Artemy Kovalyov
2016-02-11 16:12 ` [RFC 2/7] IB/core: Get/put peer memory client Artemy Kovalyov
2016-02-11 16:12 ` [RFC 3/7] IB/core: Umem tunneling peer memory APIs Artemy Kovalyov
2016-02-11 16:12 ` [RFC 4/7] IB/core: Infrastructure to manage peer core context Artemy Kovalyov
2016-02-11 16:12 ` [RFC 5/7] IB/core: Invalidation support for peer memory Artemy Kovalyov
2016-02-11 16:12 ` [RFC 6/7] IB/core: Peer memory client for IO memory Artemy Kovalyov
2016-02-11 16:12 ` [RFC 7/7] IB/mlx5: Invalidation support for MR over peer memory Artemy Kovalyov
2016-02-11 19:18 ` [RFC 0/7] Peer-direct memory Jason Gunthorpe
[not found] ` <20160211191838.GA23675-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-12 20:13 ` Christoph Hellwig
[not found] ` <20160212201328.GA14122-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-02-12 20:36 ` Jason Gunthorpe
[not found] ` <20160212203649.GA10540-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-02-14 15:25 ` Sagi Grimberg
[not found] ` <56C09C7E.4060808-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2016-02-18 14:44 ` Stephen Bates
2016-02-21 9:06 ` Haggai Eran
[not found] ` <56C97E13.9090101-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-24 23:45 ` Stephen Bates
2016-02-25 11:27 ` Haggai Eran
2016-02-14 14:09 ` Haggai Eran
2016-02-14 14:05 ` Haggai Eran
2016-02-14 14:27 ` Haggai Eran
2016-02-16 18:22 ` Jason Gunthorpe
2016-02-17 4:03 ` davide rossetti
[not found] ` <CAPSaadxbFCOcKV=c3yX7eGw9Wqzn3jvPRZe2LMWYmiQcijT4nw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-02-17 4:13 ` davide rossetti
2016-02-17 4:44 ` Jason Gunthorpe
2016-02-17 8:49 ` Christoph Hellwig
[not found] ` <20160217084959.GB13616-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-02-18 17:12 ` Jason Gunthorpe
[not found] ` <CAPSaadx3vNBSxoWuvjrTp2n8_-DVqofttFGZRR+X8zdWwV86nw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-02-17 8:44 ` Christoph Hellwig
2016-02-17 15:25 ` Haggai Eran
[not found] ` <56C490DF.1090100-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-02-19 18:54 ` Dan Williams
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).