* [RFC 00/20] On demand paging
@ 2014-03-02 10:49 Haggai Eran
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
The following set of patches implements on-demand paging (ODP) support
in the RDMA stack and in the mlx5_ib Infiniband driver.
What is on-demand paging?
Applications register memory with an RDMA adapter using system calls,
and subsequently post IO operations that refer to the corresponding
virtual addresses directly to HW. Until now, this was achieved by
pinning the memory during the registration calls. The goal of on demand
paging is to avoid pinning the pages of registered memory regions (MRs).
This will allow users the same flexibility they get when swapping any
other part of their processes address spaces. Instead of requiring the
entire MR to fit in physical memory, we can allow the MR to be larger,
and only fit the current working set in physical memory.
This can make programming with RDMA much simpler. Today, developers that
are working with more data than their RAM can hold need either to
deregister and reregister memory regions throughout their process's
life, or keep a single memory region and copy the data to it. On demand
paging will allow these developers to register a single MR at the
beginning of their process's life, and let the operating system manage
which pages needs to be fetched at a given time. In the future, we might
be able to provide a single memory access key for each process that
would provide the entire process's address as one large memory region,
and the developers wouldn't need to register memory regions at all.
How does page faults generally work?
With pinned memory regions, the driver would map the virtual addresses
to bus addresses, and pass these addresses to the HCA to associate them
with the new MR. With ODP, the driver is now allowed to mark some of the
pages in the MR as not-present. When the HCA attempts to perform memory
access for a communication operation, it notices the page is not
present, and raises a page fault event to the driver. In addition, the
HCA performs whatever operation is required by the transport protocol to
suspend communication until the page fault is resolved.
Upon receiving the page fault interrupt, the driver first needs to know
on which virtual address the page fault occurred, and on what memory
key. When handling send/receive operations, this information is inside
the work queue. The driver reads the needed work queue elements, and
parses them to gather the address and memory key. For other RDMA
operations, the event generated by the HCA only contains the virtual
address and rkey, as there are no work queue elements involved.
Having the rkey, the driver can find the relevant memory region in its
data structures, and calculate the actual pages needed to complete the
operation. It then uses get_user_pages to retrieve the needed pages back
to the memory, obtains dma mapping, and passes the addresses to the HCA.
Finally, the driver notifies the HCA it can continue operation on the
queue pair that encountered the page fault. The pages that
get_user_pages returned are unpinned immediately by releasing their
reference.
How are invalidations handled?
The patches add infrastructure to subscribe the RDMA stack as an mmu
notifier client [1]. Each process that uses ODP register a notifier client.
When receiving page invalidation notifications, they are passed to the
mlx5_ib driver, which updates the HCA with new, not-present mappings.
Only after flushing the HCA's page table caches the notifier returns,
allowing the kernel to release the pages.
What operations are supported?
Currently only send, receive and RDMA write operations are supported on the
RC protocol, and also send operations on the UD protocol. We hope to
implement support for other transports and operations in the future.
The structure of the patchset
First, the patches apply against the for-next branch in the
roland/infiniband.git tree, with the signature patches [2] applied, and also
the patch that refactors umem to use a linear SG table [3].
Patches 1-5:
The first set of patches adds page fault support to the IB core layer,
allowing MRs to be registered without their pages to be pinned. The first
patch adds capability bits, configuration options, and a method for
querying whether the paging capabilities from user-space. The next two
patches (2-3) make some necessary changes to the ib_umem type. Patches
4 and 5 add paging support and invalidation support respectively.
Patches 6-9:
The next set of patches contain some minor fixes to the mlx5 driver that
were needed. Patches 6-7 fix a two bugs that may affect the paging code,
and patches 8-9 add code to store missing information in mlx5 structures
that is needed for the paging code to work correctly.
Patches 10-16:
This set of patches add small size new functionality to the mlx5 driver and
builds toward paging support. Patches 10-11 make changes to UMR mechanism
(an internal mechanism used by mlx5 to update device page mappings).
Patch 12 adds infrastructure support for page fault handling to the
mlx5_core module. Patch 13 queries the device for paging capabilities, and
patch 15 adds a function to do partial device page table updates. Finally,
patch 16 adds a helper function to read information from user-space work
queues in the driver's context.
Patches 17-20:
The fourth part of this patch set finally adds paging support to the mlx5
driver. Patch 17 adds in mlx5_ib the infrastructure to handle page faults
coming from mlx5_core. Patch 18 adds the code to handle UD send page faults
and RC send and receive page faults. Patch 19 adds support for page faults
caused by RDMA write operations, and patch 20 adds invalidation support to
the mlx5 driver, allowing pages to be unmapped dynamically.
[1] Integrating KVM with the Linux Memory Management (presentation),
Andrea Archangeli
http://www.linux-kvm.org/wiki/images/3/33/KvmForum2008%24kdf2008_15.pdf
[2] [PATCH v5 00/10] Introduce Signature feature
http://marc.info/?l=linux-rdma&m=139315796528067&w=2
[3] [PATCH] IB: Refactor umem to use linear SG table
http://marc.info/?l=linux-rdma&m=139090922810663&w=2
Haggai Eran (13):
IB/core: Replace ib_umem's offset field with a full address
IB/core: Add umem function to read data from user-space
IB/mlx5: Fix error handling in reg_umr
IB/mlx5: Add MR to radix tree in reg_mr_callback
mlx5: Store MR attributes in mlx5_mr_core during creation and after
UMR
IB/mlx5: Set QP offsets and parameters for user QPs and not just for
kernel QPs
IB/mlx5: Enhance UMR support to allow partial page table update
net/mlx5_core: Add support for page faults events and low level
handling
IB/mlx5: Implement the ODP capability query verb
IB/mlx5: Changes in memory region creation to support on-demand
paging
IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation
IB/mlx5: Add function to read WQE from user-space
IB/mlx5: Handle page faults
Sagi Grimberg (2):
IB/core: Add flags for on demand paging support
IB/mlx5: Page faults handling infrastructure
Shachar Raindel (5):
IB/core: Add support for on demand paging regions
IB/core: Implement support for MMU notifiers regarding on demand
paging regions
IB/mlx5: Refactor UMR to have its own context struct
IB/mlx5: Add support for RDMA write responder page faults
IB/mlx5: Implement on demand paging by adding support for MMU
notifiers
drivers/infiniband/Kconfig | 11 +
drivers/infiniband/core/Makefile | 1 +
drivers/infiniband/core/umem.c | 63 +-
drivers/infiniband/core/umem_odp.c | 611 +++++++++++++++++++
drivers/infiniband/core/umem_rbtree.c | 94 +++
drivers/infiniband/core/uverbs.h | 1 +
drivers/infiniband/core/uverbs_cmd.c | 84 +++
drivers/infiniband/core/uverbs_main.c | 7 +-
drivers/infiniband/hw/amso1100/c2_provider.c | 2 +-
drivers/infiniband/hw/ehca/ehca_mrmw.c | 2 +-
drivers/infiniband/hw/ipath/ipath_mr.c | 2 +-
drivers/infiniband/hw/mlx5/Makefile | 1 +
drivers/infiniband/hw/mlx5/main.c | 42 +-
drivers/infiniband/hw/mlx5/mem.c | 67 ++-
drivers/infiniband/hw/mlx5/mlx5_ib.h | 124 +++-
drivers/infiniband/hw/mlx5/mr.c | 378 +++++++++---
drivers/infiniband/hw/mlx5/odp.c | 777 +++++++++++++++++++++++++
drivers/infiniband/hw/mlx5/qp.c | 199 +++++--
drivers/infiniband/hw/nes/nes_verbs.c | 4 +-
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +-
drivers/infiniband/hw/qib/qib_mr.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 11 +-
drivers/net/ethernet/mellanox/mlx5/core/fw.c | 35 +-
drivers/net/ethernet/mellanox/mlx5/core/main.c | 8 +-
drivers/net/ethernet/mellanox/mlx5/core/mr.c | 4 +
drivers/net/ethernet/mellanox/mlx5/core/qp.c | 134 ++++-
include/linux/mlx5/device.h | 73 ++-
include/linux/mlx5/driver.h | 21 +-
include/linux/mlx5/qp.h | 63 ++
include/rdma/ib_umem.h | 29 +-
include/rdma/ib_umem_odp.h | 156 +++++
include/rdma/ib_verbs.h | 47 +-
include/uapi/rdma/ib_user_verbs.h | 18 +-
33 files changed, 2917 insertions(+), 156 deletions(-)
create mode 100644 drivers/infiniband/core/umem_odp.c
create mode 100644 drivers/infiniband/core/umem_rbtree.c
create mode 100644 drivers/infiniband/hw/mlx5/odp.c
create mode 100644 include/rdma/ib_umem_odp.h
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread
* [RFC 01/20] IB/core: Add flags for on demand paging support
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 02/20] IB/core: Replace ib_umem's offset field with a full address Haggai Eran
` (20 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
From: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
* Add a configuration option for enable on-demand paging support in the
infiniband subsystem (CONFIG_INFINIBAND_ON_DEMAND_PAGING). In a later patch,
this configuration option will select the MMU_NOTIFIER configuration option
to enable mmu notifiers.
* Add a flag for on demand paging (ODP) support in the IB device capabilities.
* Add a flag to request ODP MR in the access flags to reg_mr.
* Fail registrations done with the ODP flag when the low-level driver doesn't
support this.
* Change the conditions in which an MR will be writable to explicitly
specify the access flags. This is to avoid making an MR writable just
because it is an ODP MR.
* Add a query_odp_caps verb to query from user-space for ODP capabilities.
Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/Kconfig | 10 ++++++
drivers/infiniband/core/umem.c | 8 +++--
drivers/infiniband/core/uverbs.h | 1 +
drivers/infiniband/core/uverbs_cmd.c | 63 +++++++++++++++++++++++++++++++++++
drivers/infiniband/core/uverbs_main.c | 5 ++-
include/rdma/ib_verbs.h | 28 ++++++++++++++--
include/uapi/rdma/ib_user_verbs.h | 18 +++++++++-
7 files changed, 126 insertions(+), 7 deletions(-)
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 7708939..089a2c2 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -38,6 +38,16 @@ config INFINIBAND_USER_MEM
depends on INFINIBAND_USER_ACCESS != n
default y
+config INFINIBAND_ON_DEMAND_PAGING
+ bool "InfiniBand on-demand paging support"
+ depends on INFINIBAND_USER_MEM
+ default y
+ ---help---
+ On demand paging support for the InfiniBand subsystem.
+ Together with driver support this allows registration of
+ memory regions without pinning their pages, fetching the
+ pages on demand instead.
+
config INFINIBAND_ADDR_TRANS
bool
depends on INFINIBAND
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a3a2e9c..1fba9d3 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -106,13 +106,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
umem->offset = addr & ~PAGE_MASK;
umem->page_size = PAGE_SIZE;
/*
- * We ask for writable memory if any access flags other than
- * "remote read" are set. "Local write" and "remote write"
+ * We ask for writable memory if any of the following
+ * access flags are set. "Local write" and "remote write"
* obviously require write access. "Remote atomic" can do
* things like fetch and add, which will modify memory, and
* "MW bind" can change permissions by binding a window.
*/
- umem->writable = !!(access & ~IB_ACCESS_REMOTE_READ);
+ umem->writable = !!(access &
+ (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
+ IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND));
/* We assume the memory is from hugetlb until proved otherwise */
umem->hugetlb = 1;
diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index a283274..d1cefc8 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -257,5 +257,6 @@ IB_UVERBS_DECLARE_CMD(close_xrcd);
IB_UVERBS_DECLARE_EX_CMD(create_flow);
IB_UVERBS_DECLARE_EX_CMD(destroy_flow);
+IB_UVERBS_DECLARE_EX_CMD(query_odp_caps);
#endif /* UVERBS_H */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index ea6203e..2795d86 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -947,6 +947,22 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
goto err_free;
}
+
+ if (cmd.access_flags & IB_ACCESS_ON_DEMAND) {
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ struct ib_device_attr attr;
+ ret = ib_query_device(pd->device, &attr);
+ if (ret || !(attr.device_cap_flags &
+ IB_DEVICE_ON_DEMAND_PAGING)) {
+ ret = -EINVAL;
+ goto err_put;
+ }
+#else
+ ret = -EINVAL;
+ goto err_put;
+#endif
+ }
+
mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va,
cmd.access_flags, &udata);
if (IS_ERR(mr)) {
@@ -1160,6 +1176,53 @@ ssize_t ib_uverbs_dealloc_mw(struct ib_uverbs_file *file,
return in_len;
}
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+int ib_uverbs_ex_query_odp_caps(struct ib_uverbs_file *file,
+ struct ib_udata *ucore,
+ struct ib_udata *uhw)
+{
+ struct ib_uverbs_query_odp_caps cmd;
+ struct ib_uverbs_query_odp_caps_resp resp;
+ struct ib_device_attr attr;
+ int err;
+
+ if (ucore->inlen < sizeof(cmd))
+ return -EINVAL;
+
+ if (ucore->outlen < sizeof(resp))
+ return -ENOSPC;
+
+ err = ib_copy_from_udata(&cmd, ucore, sizeof(cmd));
+ if (err)
+ return err;
+
+ ucore->inbuf += sizeof(cmd);
+ ucore->inlen -= sizeof(cmd);
+
+ if (cmd.comp_mask)
+ return -EINVAL;
+
+ err = ib_query_device(file->device->ib_dev, &attr);
+
+ if (err)
+ return err;
+
+ memset(&resp, 0, sizeof(resp));
+ resp.comp_mask = 0;
+ resp.general_caps = attr.odp_caps.general_caps;
+ resp.per_transport_caps.rc_odp_caps =
+ attr.odp_caps.per_transport_caps.rc_odp_caps;
+ resp.per_transport_caps.uc_odp_caps =
+ attr.odp_caps.per_transport_caps.uc_odp_caps;
+ resp.per_transport_caps.ud_odp_caps =
+ attr.odp_caps.per_transport_caps.ud_odp_caps;
+
+ err = ib_copy_to_udata(ucore,
+ &resp, sizeof(resp));
+ return err;
+}
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file,
const char __user *buf, int in_len,
int out_len)
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 08219fb..a767298 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -121,7 +121,10 @@ static int (*uverbs_ex_cmd_table[])(struct ib_uverbs_file *file,
struct ib_udata *ucore,
struct ib_udata *uhw) = {
[IB_USER_VERBS_EX_CMD_CREATE_FLOW] = ib_uverbs_ex_create_flow,
- [IB_USER_VERBS_EX_CMD_DESTROY_FLOW] = ib_uverbs_ex_destroy_flow
+ [IB_USER_VERBS_EX_CMD_DESTROY_FLOW] = ib_uverbs_ex_destroy_flow,
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ [IB_USER_VERBS_EX_CMD_QUERY_ODP_CAPS] = ib_uverbs_ex_query_odp_caps,
+#endif
};
static void ib_uverbs_add_one(struct ib_device *device);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index edfe0d5..129261c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -123,7 +123,8 @@ enum ib_device_cap_flags {
IB_DEVICE_MEM_WINDOW_TYPE_2A = (1<<23),
IB_DEVICE_MEM_WINDOW_TYPE_2B = (1<<24),
IB_DEVICE_MANAGED_FLOW_STEERING = (1<<29),
- IB_DEVICE_SIGNATURE_HANDOVER = (1<<30)
+ IB_DEVICE_SIGNATURE_HANDOVER = (1<<30),
+ IB_DEVICE_ON_DEMAND_PAGING = (1<<31),
};
enum ib_signature_prot_cap {
@@ -143,6 +144,27 @@ enum ib_atomic_cap {
IB_ATOMIC_GLOB
};
+enum ib_odp_general_cap_bits {
+ IB_ODP_SUPPORT = 1 << 0,
+};
+
+enum ib_odp_transport_cap_bits {
+ IB_ODP_SUPPORT_SEND = 1 << 0,
+ IB_ODP_SUPPORT_RECV = 1 << 1,
+ IB_ODP_SUPPORT_WRITE = 1 << 2,
+ IB_ODP_SUPPORT_READ = 1 << 3,
+ IB_ODP_SUPPORT_ATOMIC = 1 << 4,
+};
+
+struct ib_odp_caps {
+ uint64_t general_caps;
+ struct {
+ uint32_t rc_odp_caps;
+ uint32_t uc_odp_caps;
+ uint32_t ud_odp_caps;
+ } per_transport_caps;
+};
+
struct ib_device_attr {
u64 fw_ver;
__be64 sys_image_guid;
@@ -186,6 +208,7 @@ struct ib_device_attr {
u8 local_ca_ack_delay;
int sig_prot_cap;
int sig_guard_cap;
+ struct ib_odp_caps odp_caps;
};
enum ib_mtu {
@@ -1076,7 +1099,8 @@ enum ib_access_flags {
IB_ACCESS_REMOTE_READ = (1<<2),
IB_ACCESS_REMOTE_ATOMIC = (1<<3),
IB_ACCESS_MW_BIND = (1<<4),
- IB_ZERO_BASED = (1<<5)
+ IB_ZERO_BASED = (1<<5),
+ IB_ACCESS_ON_DEMAND = (1<<6),
};
struct ib_phys_buf {
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index cbfdd4c..b8a478d 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -91,7 +91,8 @@ enum {
enum {
IB_USER_VERBS_EX_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_THRESHOLD,
- IB_USER_VERBS_EX_CMD_DESTROY_FLOW
+ IB_USER_VERBS_EX_CMD_DESTROY_FLOW,
+ IB_USER_VERBS_EX_CMD_QUERY_ODP_CAPS,
};
/*
@@ -280,6 +281,21 @@ struct ib_uverbs_dereg_mr {
__u32 mr_handle;
};
+struct ib_uverbs_query_odp_caps {
+ __u64 comp_mask;
+};
+
+struct ib_uverbs_query_odp_caps_resp {
+ __u64 comp_mask;
+ __u64 general_caps;
+ struct {
+ __u32 rc_odp_caps;
+ __u32 uc_odp_caps;
+ __u32 ud_odp_caps;
+ } per_transport_caps;
+ __u32 reserved;
+};
+
struct ib_uverbs_alloc_mw {
__u64 response;
__u32 pd_handle;
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 02/20] IB/core: Replace ib_umem's offset field with a full address
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-03-02 10:49 ` [RFC 01/20] IB/core: Add flags for on demand paging support Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 03/20] IB/core: Add umem function to read data from user-space Haggai Eran
` (19 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
In order to allow umems that do not pin memory we need the umem to keep track
of its region's address.
This makes the offset field redundant, and so this patch removes it.
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/umem.c | 6 +++---
drivers/infiniband/hw/amso1100/c2_provider.c | 2 +-
drivers/infiniband/hw/ehca/ehca_mrmw.c | 2 +-
drivers/infiniband/hw/ipath/ipath_mr.c | 2 +-
drivers/infiniband/hw/nes/nes_verbs.c | 4 ++--
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +-
drivers/infiniband/hw/qib/qib_mr.c | 2 +-
include/rdma/ib_umem.h | 25 ++++++++++++++++++++++++-
8 files changed, 34 insertions(+), 11 deletions(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 1fba9d3..ab14b33 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -103,7 +103,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
umem->context = context;
umem->length = size;
- umem->offset = addr & ~PAGE_MASK;
+ umem->address = addr;
umem->page_size = PAGE_SIZE;
/*
* We ask for writable memory if any of the following
@@ -133,7 +133,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
if (!vma_list)
umem->hugetlb = 0;
- npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT;
+ npages = ib_umem_num_pages(umem);
down_write(¤t->mm->mmap_sem);
@@ -242,7 +242,7 @@ void ib_umem_release(struct ib_umem *umem)
return;
}
- diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT;
+ diff = ib_umem_num_pages(umem);
/*
* We may be called with the mm's mmap_sem already held. This
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index 8af33cf..056c405 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -476,7 +476,7 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
c2mr->umem->page_size,
i,
length,
- c2mr->umem->offset,
+ ib_umem_offset(c2mr->umem),
&kva,
c2_convert_access(acc),
c2mr);
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 7168f59..d64c9cc 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -399,7 +399,7 @@ reg_user_mr_fallback:
pginfo.num_kpages = num_kpages;
pginfo.num_hwpages = num_hwpages;
pginfo.u.usr.region = e_mr->umem;
- pginfo.next_hwpage = e_mr->umem->offset / hwpage_size;
+ pginfo.next_hwpage = ib_umem_offset(e_mr->umem) / hwpage_size;
pginfo.u.usr.next_sg = pginfo.u.usr.region->sg_head.sgl;
ret = ehca_reg_mr(shca, e_mr, (u64 *)virt, length, mr_access_flags,
e_pd, &pginfo, &e_mr->ib.ib_mr.lkey,
diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c b/drivers/infiniband/hw/ipath/ipath_mr.c
index 5e61e9b..c7278f6 100644
--- a/drivers/infiniband/hw/ipath/ipath_mr.c
+++ b/drivers/infiniband/hw/ipath/ipath_mr.c
@@ -214,7 +214,7 @@ struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
mr->mr.user_base = start;
mr->mr.iova = virt_addr;
mr->mr.length = length;
- mr->mr.offset = umem->offset;
+ mr->mr.offset = ib_umem_offset(umem);
mr->mr.access_flags = mr_access_flags;
mr->mr.max_segs = n;
mr->umem = umem;
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index 32d3682..e049750 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -2342,7 +2342,7 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
(unsigned long int)start, (unsigned long int)virt, (u32)length,
region->offset, region->page_size);
- skip_pages = ((u32)region->offset) >> 12;
+ skip_pages = ((u32)ib_umem_offset(region)) >> 12;
if (ib_copy_from_udata(&req, udata, sizeof(req))) {
ib_umem_release(region);
@@ -2407,7 +2407,7 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
region_length -= skip_pages << 12;
for (page_index = skip_pages; page_index < chunk_pages; page_index++) {
skip_pages = 0;
- if ((page_count != 0) && (page_count<<12)-(region->offset&(4096-1)) >= region->length)
+ if ((page_count != 0) && (page_count << 12) - (ib_umem_offset(region) & (4096 - 1)) >= region->length)
goto enough_pages;
if ((page_count&0x01FF) == 0) {
if (page_count >= 1024 * 512) {
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 0de3473..fe5264a 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -802,7 +802,7 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
goto umem_err;
mr->hwmr.pbe_size = mr->umem->page_size;
- mr->hwmr.fbo = mr->umem->offset;
+ mr->hwmr.fbo = ib_umem_offset(mr->umem);
mr->hwmr.va = usr_addr;
mr->hwmr.len = len;
mr->hwmr.remote_wr = (acc & IB_ACCESS_REMOTE_WRITE) ? 1 : 0;
diff --git a/drivers/infiniband/hw/qib/qib_mr.c b/drivers/infiniband/hw/qib/qib_mr.c
index 9bbb553..a77fb4f 100644
--- a/drivers/infiniband/hw/qib/qib_mr.c
+++ b/drivers/infiniband/hw/qib/qib_mr.c
@@ -258,7 +258,7 @@ struct ib_mr *qib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
mr->mr.user_base = start;
mr->mr.iova = virt_addr;
mr->mr.length = length;
- mr->mr.offset = umem->offset;
+ mr->mr.offset = ib_umem_offset(umem);
mr->mr.access_flags = mr_access_flags;
mr->umem = umem;
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 1ea0b65..0e120f4 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -42,7 +42,7 @@ struct ib_ucontext;
struct ib_umem {
struct ib_ucontext *context;
size_t length;
- int offset;
+ unsigned long address;
int page_size;
int writable;
int hugetlb;
@@ -54,6 +54,29 @@ struct ib_umem {
int npages;
};
+/* Returns the offset of the umem start relative to the first page. */
+static inline int ib_umem_offset(struct ib_umem *umem)
+{
+ return umem->address & ~PAGE_MASK;
+}
+
+/* Returns the first page of an ODP umem. */
+static inline unsigned long ib_umem_start(struct ib_umem *umem)
+{
+ return umem->address & PAGE_MASK;
+}
+
+/* Returns the address of the page after the last one of an ODP umem. */
+static inline unsigned long ib_umem_end(struct ib_umem *umem)
+{
+ return PAGE_ALIGN(umem->address + umem->length);
+}
+
+static inline size_t ib_umem_num_pages(struct ib_umem *umem)
+{
+ return (ib_umem_end(umem) - ib_umem_start(umem)) >> PAGE_SHIFT;
+}
+
#ifdef CONFIG_INFINIBAND_USER_MEM
struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 03/20] IB/core: Add umem function to read data from user-space
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-03-02 10:49 ` [RFC 01/20] IB/core: Add flags for on demand paging support Haggai Eran
2014-03-02 10:49 ` [RFC 02/20] IB/core: Replace ib_umem's offset field with a full address Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 04/20] IB/core: Add support for on demand paging regions Haggai Eran
` (18 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
In some drivers there's a need to read data from a user space area that
was pinned using ib_umem, when running from a different process context.
The ib_umem_copy_from function allows reading data from the physical pages
pinned in the ib_umem struct.
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/umem.c | 25 +++++++++++++++++++++++++
include/rdma/ib_umem.h | 2 ++
2 files changed, 27 insertions(+)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index ab14b33..138442a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -287,3 +287,28 @@ int ib_umem_page_count(struct ib_umem *umem)
return n;
}
EXPORT_SYMBOL(ib_umem_page_count);
+
+/*
+ * Copy from the given ib_umem's pages to the given buffer.
+ *
+ * umem - the umem to copy from
+ * start - offset to start copying from
+ * dst - destination buffer
+ * length - buffer length
+ *
+ * Returns the number of copied bytes, or an error code.
+ */
+int ib_umem_copy_from(struct ib_umem *umem, size_t start, void *dst,
+ size_t length)
+{
+ size_t end = start + length;
+
+ if (start > umem->length || end > umem->length || end < start) {
+ pr_err("ib_umem_copy_from not in range.");
+ return -EINVAL;
+ }
+
+ return sg_pcopy_to_buffer(umem->sg_head.sgl, umem->nmap, dst, length,
+ start + ib_umem_offset(umem));
+}
+EXPORT_SYMBOL(ib_umem_copy_from);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 0e120f4..6af91b3 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -83,6 +83,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
size_t size, int access, int dmasync);
void ib_umem_release(struct ib_umem *umem);
int ib_umem_page_count(struct ib_umem *umem);
+int ib_umem_copy_from(struct ib_umem *umem, size_t start, void *dst,
+ size_t length);
#else /* CONFIG_INFINIBAND_USER_MEM */
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 04/20] IB/core: Add support for on demand paging regions
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (2 preceding siblings ...)
2014-03-02 10:49 ` [RFC 03/20] IB/core: Add umem function to read data from user-space Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 05/20] IB/core: Implement support for MMU notifiers regarding " Haggai Eran
` (17 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
From: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
* Extend the umem struct to keep the ODP related data.
* Allocate and initialize the ODP related information in the umem
(page_list, dma_list) and freeing as needed in the end of the run.
* Store a reference to the process PID struct in the ucontext. Used to
safely obtain the task_struct and the mm during fault handling, without
preventing the task destruction if needed.
* Add 2 helper functions: ib_umem_odp_map_dma_pages and
ib_umem_odp_unmap_dma_pages. These functions get the DMA addresses of
specific pages of the umem (and, currently, pin them).
* Support for page faults only - IB core will keep the reference on the pages
used and call put_page when freeing an ODP umem area. Invalidations support
will be added in a later patch.
Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/core/Makefile | 1 +
drivers/infiniband/core/umem.c | 24 +++
drivers/infiniband/core/umem_odp.c | 310 ++++++++++++++++++++++++++++++++++
drivers/infiniband/core/uverbs_cmd.c | 5 +
drivers/infiniband/core/uverbs_main.c | 2 +
include/rdma/ib_umem.h | 2 +
include/rdma/ib_umem_odp.h | 100 +++++++++++
include/rdma/ib_verbs.h | 3 +
8 files changed, 447 insertions(+)
create mode 100644 drivers/infiniband/core/umem_odp.c
create mode 100644 include/rdma/ib_umem_odp.h
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 3ab3865..5af6f4a 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \
ib_core-y := packer.o ud_header.o verbs.o sysfs.o \
device.o fmr_pool.o cache.o netlink.o
ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
ib_mad-y := mad.o smi.o agent.o mad_rmpp.o
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 138442a..e9798e0 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -39,6 +39,7 @@
#include <linux/hugetlb.h>
#include <linux/dma-attrs.h>
#include <linux/slab.h>
+#include <rdma/ib_umem_odp.h>
#include "uverbs.h"
@@ -69,6 +70,10 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
/**
* ib_umem_get - Pin and DMA map userspace memory.
+ *
+ * If access flags indicate ODP memory, avoid pinning. Instead, stores
+ * the mm for future page fault handling.
+ *
* @context: userspace context to pin memory for
* @addr: userspace virtual address to start at
* @size: length of region to pin
@@ -116,6 +121,17 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
(IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND));
+ if (access & IB_ACCESS_ON_DEMAND) {
+ ret = ib_umem_odp_get(context, umem);
+ if (ret) {
+ kfree(umem);
+ return ERR_PTR(ret);
+ }
+ return umem;
+ }
+
+ umem->odp_data = NULL;
+
/* We assume the memory is from hugetlb until proved otherwise */
umem->hugetlb = 1;
@@ -234,6 +250,11 @@ void ib_umem_release(struct ib_umem *umem)
struct mm_struct *mm;
unsigned long diff;
+ if (umem->odp_data) {
+ ib_umem_odp_release(umem);
+ return;
+ }
+
__ib_umem_release(umem->context->device, umem, 1);
mm = get_task_mm(current);
@@ -278,6 +299,9 @@ int ib_umem_page_count(struct ib_umem *umem)
int n;
struct scatterlist *sg;
+ if (umem->odp_data)
+ return ib_umem_num_pages(umem);
+
shift = ilog2(umem->page_size);
n = 0;
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
new file mode 100644
index 0000000..67b7816
--- /dev/null
+++ b/drivers/infiniband/core/umem_odp.c
@@ -0,0 +1,310 @@
+/*
+ * Copyright (c) 2014 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <linux/pid.h>
+#include <linux/slab.h>
+#include <linux/export.h>
+#include <linux/vmalloc.h>
+
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_umem.h>
+#include <rdma/ib_umem_odp.h>
+
+int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
+{
+ int ret_val;
+ struct pid *our_pid;
+
+ /* Prevent creating ODP MRs in child processes */
+ rcu_read_lock();
+ our_pid = get_task_pid(current->group_leader, PIDTYPE_PID);
+ rcu_read_unlock();
+ put_pid(our_pid);
+ if (context->tgid != our_pid)
+ return -EINVAL;
+
+ umem->hugetlb = 0;
+ umem->odp_data = kzalloc(sizeof(*umem->odp_data), GFP_KERNEL);
+ if (!umem->odp_data)
+ return -ENOMEM;
+
+ mutex_init(&umem->odp_data->umem_mutex);
+
+ umem->odp_data->page_list = vzalloc(ib_umem_num_pages(umem) *
+ sizeof(*umem->odp_data->page_list));
+ if (!umem->odp_data->page_list) {
+ ret_val = -ENOMEM;
+ goto out_odp_data;
+ }
+
+ umem->odp_data->dma_list = vzalloc(ib_umem_num_pages(umem) *
+ sizeof(*umem->odp_data->dma_list));
+ if (!umem->odp_data->dma_list) {
+ ret_val = -ENOMEM;
+ goto out_page_list;
+ }
+
+ return 0;
+
+out_page_list:
+ vfree(umem->odp_data->page_list);
+out_odp_data:
+ kfree(umem->odp_data);
+ return ret_val;
+}
+
+void ib_umem_odp_release(struct ib_umem *umem)
+{
+ /*
+ * Ensure that no more pages are mapped in the umem.
+ *
+ * It is the driver's responsibility to ensure, before calling us,
+ * that the hardware will not attempt to access the MR any more.
+ */
+ ib_umem_odp_unmap_dma_pages(umem, ib_umem_start(umem),
+ ib_umem_end(umem));
+
+ vfree(umem->odp_data->dma_list);
+ vfree(umem->odp_data->page_list);
+ kfree(umem);
+}
+
+/*
+ * Map for DMA and insert a single page into the on-demand paging page tables.
+ *
+ * @umem: the umem to insert the page to.
+ * @page_index: index in the umem to add the page to.
+ * @page: the page struct to map and add.
+ * @access_mask: access permissions needed for this page.
+ * @current_seq: sequence number for synchronization with invalidations.
+ * the sequence number is taken from
+ * umem->odp_data->notifiers_seq.
+ *
+ * The function returns -EFAULT if the DMA mapping operation fails.
+ *
+ * The page is released via put_page even if the operation failed. For
+ * on-demand pinning, the page is released whenever it isn't stored in the
+ * umem.
+ */
+static int ib_umem_odp_map_dma_single_page(
+ struct ib_umem *umem,
+ int page_index,
+ struct page *page,
+ u64 access_mask,
+ unsigned long current_seq)
+{
+ struct ib_device *dev = umem->context->device;
+ dma_addr_t dma_addr;
+ int stored_page = 0;
+ int ret = 0;
+ mutex_lock(&umem->odp_data->umem_mutex);
+ if (!(umem->odp_data->dma_list[page_index])) {
+ dma_addr = ib_dma_map_page(dev,
+ page,
+ 0, PAGE_SIZE,
+ DMA_BIDIRECTIONAL);
+ if (ib_dma_mapping_error(dev, dma_addr)) {
+ ret = -EFAULT;
+ goto out;
+ }
+ umem->odp_data->dma_list[page_index] = dma_addr | access_mask;
+ umem->odp_data->page_list[page_index] = page;
+ stored_page = 1;
+ } else if (umem->odp_data->page_list[page_index] == page) {
+ /* Same page. Note that when we add support for different
+ * permissions for different pages in the same umem, we will
+ * need to re-map the page for DMA here with the new
+ * permissions. */
+ WARN_ON(access_mask != (umem->odp_data->dma_list[page_index] &
+ ~ODP_DMA_ADDR_MASK));
+ } else {
+ pr_err("error: got different pages in IB device and from get_user_pages. IB device page: %p, gup page: %p\n",
+ umem->odp_data->page_list[page_index], page);
+ }
+
+out:
+ mutex_unlock(&umem->odp_data->umem_mutex);
+
+ if (!stored_page)
+ put_page(page);
+
+ return ret;
+}
+
+/**
+ * ib_umem_odp_map_dma_pages - Pin and DMA map userspace memory in an ODP MR.
+ *
+ * Pins the range of pages passed in the argument, and maps them to
+ * DMA addresses. The DMA addresses of the mapped pages is updated in
+ * umem->odp_data->dma_list.
+ *
+ * Returns the number of pages mapped in success, negative error code
+ * for failure.
+ *
+ * @umem: the umem to map and pin
+ * @user_virt: the address from which we need to map.
+ * @bcnt: the minimal number of bytes to pin and map. The mapping might be
+ * bigger due to alignment, and may also be smaller in case of an error
+ * pinning or mapping a page. The actual pages mapped is returned in
+ * the return value.
+ * @access_mask: bit mask of the requested access permissions for the given
+ * range.
+ * @current_seq: the MMU notifiers sequance value for synchronization with
+ * invalidations. the sequance number is read from
+ * umem->odp_data->notifiers_seq before calling this function
+ */
+int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 user_virt, u64 bcnt,
+ u64 access_mask, unsigned long current_seq)
+{
+ struct task_struct *owning_process = NULL;
+ struct mm_struct *owning_mm = NULL;
+ struct page **local_page_list = NULL;
+ u64 off;
+ int j, k, ret = 0, start_idx, npages = 0;
+
+ if (access_mask == 0)
+ return -EINVAL;
+
+ if (user_virt < umem->address ||
+ user_virt + bcnt > umem->address + umem->length)
+ return -EFAULT;
+
+ local_page_list = (struct page **)__get_free_page(GFP_KERNEL);
+ if (!local_page_list)
+ return -ENOMEM;
+
+ off = user_virt & (~PAGE_MASK);
+ user_virt = user_virt & PAGE_MASK;
+ bcnt += off; /* Charge for the first page offset as well. */
+
+ start_idx = (user_virt - ib_umem_start(umem)) >> PAGE_SHIFT;
+ k = start_idx;
+
+ owning_process = get_pid_task(umem->context->tgid, PIDTYPE_PID);
+ if (owning_process == NULL) {
+ ret = -EINVAL;
+ goto out_no_task;
+ }
+
+ owning_mm = get_task_mm(owning_process);
+ if (owning_mm == NULL) {
+ ret = -EINVAL;
+ goto out_put_task;
+ }
+
+ while (bcnt > 0) {
+ down_read(&owning_mm->mmap_sem);
+ /*
+ * Note: this might result in redundent page getting. We can
+ * avoid this by checking dma_list to be 0 before calling
+ * get_user_pages. However, this make the code much more
+ * complex (and doesn't gain us much performance in most use
+ * cases).
+ */
+ npages = get_user_pages(owning_process, owning_mm,
+ user_virt,
+ min_t(size_t, max_t(size_t, 1,
+ bcnt >> PAGE_SHIFT),
+ PAGE_SIZE / sizeof(struct page *)),
+ access_mask & ODP_WRITE_ALLOWED_BIT, 0,
+ local_page_list, NULL);
+ up_read(&owning_mm->mmap_sem);
+
+ if (npages < 0)
+ break;
+
+ bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
+ user_virt += npages << PAGE_SHIFT;
+ for (j = 0; j < npages; ++j) {
+ ret = ib_umem_odp_map_dma_single_page(umem, k,
+ local_page_list[j], access_mask, current_seq);
+ if (ret < 0)
+ break;
+ k++;
+ }
+
+ if (ret < 0) {
+ /* Release left over pages when handling errors. */
+ for (++j; j < npages; ++j)
+ put_page(local_page_list[j]);
+ }
+ }
+
+ if (ret >= 0) {
+ if (npages < 0 && k == start_idx)
+ ret = npages;
+ else
+ ret = k - start_idx;
+ }
+
+ mmput(owning_mm);
+out_put_task:
+ put_task_struct(owning_process);
+out_no_task:
+ free_page((unsigned long) local_page_list);
+ return ret;
+}
+EXPORT_SYMBOL(ib_umem_odp_map_dma_pages);
+
+void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 virt,
+ u64 bound)
+{
+ int idx;
+ u64 addr;
+ struct ib_device *dev = umem->context->device;
+ virt = max_t(u64, virt, ib_umem_start(umem));
+ bound = min_t(u64, bound, ib_umem_end(umem));
+ for (addr = virt; addr < bound; addr += (u64)umem->page_size) {
+ idx = (addr - ib_umem_start(umem)) / PAGE_SIZE;
+ mutex_lock(&umem->odp_data->umem_mutex);
+ if (umem->odp_data->page_list[idx]) {
+ struct page *page = umem->odp_data->page_list[idx];
+ struct page *head_page = compound_trans_head(page);
+ dma_addr_t dma_addr = umem->odp_data->dma_list[idx] &
+ ODP_DMA_ADDR_MASK;
+
+ WARN_ON(!dma_addr);
+
+ ib_dma_unmap_page(dev, dma_addr, PAGE_SIZE,
+ DMA_BIDIRECTIONAL);
+ if (umem->writable)
+ set_page_dirty_lock(head_page);
+ put_page(page);
+ }
+ mutex_unlock(&umem->odp_data->umem_mutex);
+ }
+}
+EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
+
+
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 2795d86..6c6be37 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -36,6 +36,7 @@
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/slab.h>
+#include <linux/sched.h>
#include <asm/uaccess.h>
@@ -325,6 +326,9 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
INIT_LIST_HEAD(&ucontext->ah_list);
INIT_LIST_HEAD(&ucontext->xrcd_list);
INIT_LIST_HEAD(&ucontext->rule_list);
+ rcu_read_lock();
+ ucontext->tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
+ rcu_read_unlock();
ucontext->closing = 0;
resp.num_comp_vectors = file->device->num_comp_vectors;
@@ -371,6 +375,7 @@ err_fd:
put_unused_fd(resp.async_fd);
err_free:
+ put_pid(ucontext->tgid);
ibdev->dealloc_ucontext(ucontext);
err:
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index a767298..a7384f8 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -298,6 +298,8 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
kfree(uobj);
}
+ put_pid(context->tgid);
+
return context->device->dealloc_ucontext(context);
}
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 6af91b3..55fc858 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -38,6 +38,7 @@
#include <linux/workqueue.h>
struct ib_ucontext;
+struct ib_umem_odp;
struct ib_umem {
struct ib_ucontext *context;
@@ -49,6 +50,7 @@ struct ib_umem {
struct work_struct work;
struct mm_struct *mm;
unsigned long diff;
+ struct ib_umem_odp *odp_data;
struct sg_table sg_head;
int nmap;
int npages;
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
new file mode 100644
index 0000000..375ce28
--- /dev/null
+++ b/include/rdma/ib_umem_odp.h
@@ -0,0 +1,100 @@
+/*
+ * Copyright (c) 2014 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef IB_UMEM_ODP_H
+#define IB_UMEM_ODP_H
+
+#include <rdma/ib_umem.h>
+
+struct ib_umem_odp {
+ /*
+ * An array of the pages included in the on-demand paging umem.
+ * Indices of pages that are currently not mapped into the device will
+ * contain NULL.
+ */
+ struct page **page_list;
+ /*
+ * An array of the same size as page_list, with DMA addresses mapped
+ * for pages the pages in page_list. The lower two bits designate
+ * access permissions. See ODP_READ_ALLOWED_BIT and
+ * ODP_WRITE_ALLOWED_BIT.
+ */
+ dma_addr_t *dma_list;
+ /*
+ * The umem_mutex protects the page_list and dma_list fields of an ODP
+ * umem, allowing only a single thread to map/unmap pages.
+ */
+ struct mutex umem_mutex;
+ void *private; /* for the HW driver to use. */
+
+ atomic_t notifiers_seq;
+ atomic_t notifiers_count;
+};
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+
+int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem);
+
+void ib_umem_odp_release(struct ib_umem *umem);
+
+/*
+ * The lower 2 bits of the DMA address signal the R/W permissions for
+ * the entry. To upgrade the permissions, provide the appropriate
+ * bitmask to the map_dma_pages function.
+ *
+ * Be aware that upgrading a mapped address might result in change of
+ * the DMA address for the page.
+ */
+#define ODP_READ_ALLOWED_BIT (1<<0ULL)
+#define ODP_WRITE_ALLOWED_BIT (1<<1ULL)
+
+#define ODP_DMA_ADDR_MASK (~(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT))
+
+int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 start_offset, u64 bcnt,
+ u64 access_mask, unsigned long current_seq);
+
+void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 start_offset,
+ u64 bound);
+
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
+static inline int ib_umem_odp_get(struct ib_ucontext *context,
+ struct ib_umem *umem)
+{
+ return -EINVAL;
+}
+
+static inline void ib_umem_odp_release(struct ib_umem *umem) {}
+
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
+#endif /* IB_UMEM_ODP_H */
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 129261c..c557bb5 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1153,6 +1153,9 @@ struct ib_ucontext {
struct list_head xrcd_list;
struct list_head rule_list;
int closing;
+
+ /* For ODP support: */
+ struct pid *tgid;
};
struct ib_uobject {
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 05/20] IB/core: Implement support for MMU notifiers regarding on demand paging regions
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (3 preceding siblings ...)
2014-03-02 10:49 ` [RFC 04/20] IB/core: Add support for on demand paging regions Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 06/20] IB/mlx5: Fix error handling in reg_umr Haggai Eran
` (16 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran, Yuval Dagan
From: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
* Add an interval tree implementation for ODP umems. Create an interval tree
for each ucontext (including a count of the number of ODP MRs in this
context, mutex, etc.), and register ODP umems in the interval tree.
* Add MMU notifiers handling functions, using the interval tree to notify only
the relevant umems and underlying MRs.
* Register to receive MMU notifier events from the MM subsystem upon ODP MR
registration (and unregister accordingly).
* Add a completion object to synchronize the destruction of ODP umems.
* Add mechanism to abort page faults when there's a concurrent invalidation,
and modify mlx5_ib to match the new interface to
ib_umem_odp_map_dma_single_page.
The way we synchronize between concurrent invalidations and page faults is by
keeping a counter of currently running invalidations, and a sequence number
that is incremented whenever an invalidation is caught. The page fault code
checks the counter and also verifies that the sequence number hasn't
progressed before it updates the umem's page tables. This is similar to what
the kvm module does.
There's currently a rare race in the code when registering a umem in the
middle of an ongoing notifier. The proper fix is to either serialize the
insertion to our umem tree with mm_lock_all or use a ucontext wide running
notifiers count for retries decision. Either is ugly and can lead to some sort
of starvation. The current workaround is ugly as well - now the user can end
up with mapped addresses that are not in the user's address space (although it
is highly unlikely).
Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Yuval Dagan <yuvalda-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/Kconfig | 1 +
drivers/infiniband/core/Makefile | 2 +-
drivers/infiniband/core/umem.c | 2 +-
drivers/infiniband/core/umem_odp.c | 317 +++++++++++++++++++++++++++++++++-
drivers/infiniband/core/umem_rbtree.c | 94 ++++++++++
drivers/infiniband/core/uverbs_cmd.c | 16 ++
include/rdma/ib_umem_odp.h | 56 ++++++
include/rdma/ib_verbs.h | 16 ++
8 files changed, 494 insertions(+), 10 deletions(-)
create mode 100644 drivers/infiniband/core/umem_rbtree.c
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 089a2c2..b899531 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -41,6 +41,7 @@ config INFINIBAND_USER_MEM
config INFINIBAND_ON_DEMAND_PAGING
bool "InfiniBand on-demand paging support"
depends on INFINIBAND_USER_MEM
+ select MMU_NOTIFIER
default y
---help---
On demand paging support for the InfiniBand subsystem.
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 5af6f4a..021c563 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \
ib_core-y := packer.o ud_header.o verbs.o sysfs.o \
device.o fmr_pool.o cache.o netlink.o
ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
-ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
ib_mad-y := mad.o smi.o agent.o mad_rmpp.o
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index e9798e0..014977f 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -72,7 +72,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
* ib_umem_get - Pin and DMA map userspace memory.
*
* If access flags indicate ODP memory, avoid pinning. Instead, stores
- * the mm for future page fault handling.
+ * the mm for future page fault handling in conjuction with MMU notifiers.
*
* @context: userspace context to pin memory for
* @addr: userspace virtual address to start at
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 67b7816..f16559c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -41,26 +41,204 @@
#include <rdma/ib_umem.h>
#include <rdma/ib_umem_odp.h>
+void ib_umem_notifier_start_account(struct ib_umem *item)
+{
+ int notifiers_count;
+ mutex_lock(&item->odp_data->umem_mutex);
+ /*
+ * Avoid performing another locked operation, as we are
+ * already protected by the wrapping mutex.
+ */
+ notifiers_count = atomic_read(&item->odp_data->notifiers_count) + 1;
+ if (notifiers_count == 1)
+ reinit_completion(&item->odp_data->notifier_completion);
+ atomic_set(&item->odp_data->notifiers_count,
+ notifiers_count);
+ mutex_unlock(&item->odp_data->umem_mutex);
+}
+EXPORT_SYMBOL(ib_umem_notifier_start_account);
+
+void ib_umem_notifier_end_account(struct ib_umem *item)
+{
+ int notifiers_count, notifiers_seq;
+ mutex_lock(&item->odp_data->umem_mutex);
+ /*
+ * This sequence increase will notify the QP page fault that
+ * the page that is going to be mapped in the spte could have
+ * been freed.
+ */
+ notifiers_seq = atomic_read(&item->odp_data->notifiers_seq) + 1;
+ atomic_set(&item->odp_data->notifiers_seq,
+ notifiers_seq);
+ /*
+ * The above sequence increase must be visible before the
+ * below count decrease, which is ensured by the smp_wmb below
+ * in conjunction with the smp_rmb in mmu_notifier_retry().
+ */
+ smp_wmb();
+
+ notifiers_count = atomic_read(&item->odp_data->notifiers_count);
+ /*
+ * This is a workaround for the unlikely case where we register a umem
+ * in the middle of an ongoing notifier.
+ */
+ if (notifiers_count > 0)
+ notifiers_count -= 1;
+ else
+ pr_warn("Got notifier end call without a previous start call");
+ atomic_set(&item->odp_data->notifiers_count,
+ notifiers_count);
+ if (notifiers_count == 0)
+ complete_all(&item->odp_data->notifier_completion);
+ mutex_unlock(&item->odp_data->umem_mutex);
+}
+
+
+static int ib_umem_notifier_release_trampoline(struct ib_umem *item, u64 start,
+ u64 end, void *cookie) {
+ /*
+ * Increase the number of notifiers running, to
+ * prevent any further fault handling on this MR.
+ */
+ ib_umem_notifier_start_account(item);
+ item->odp_data->dying = 1;
+ /* Make sure that the fact the umem is dying is out before we release
+ * all pending page faults. */
+ smp_wmb();
+ complete_all(&item->odp_data->notifier_completion);
+ item->context->invalidate_range(item, ib_umem_start(item),
+ ib_umem_end(item));
+ return 0;
+}
+
+static void ib_umem_notifier_release(struct mmu_notifier *mn,
+ struct mm_struct *mm)
+{
+ struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
+
+ if (!context->invalidate_range)
+ return;
+
+ down_read(&context->umem_mutex);
+
+ rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
+ ULLONG_MAX,
+ ib_umem_notifier_release_trampoline,
+ NULL);
+
+ up_read(&context->umem_mutex);
+}
+
+static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
+ u64 end, void *cookie)
+{
+ ib_umem_notifier_start_account(item);
+ item->context->invalidate_range(item, start, start + PAGE_SIZE);
+ ib_umem_notifier_end_account(item);
+ return 0;
+}
+
+static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long address)
+{
+ struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
+
+ if (!context->invalidate_range)
+ return;
+
+ down_read(&context->umem_mutex);
+ rbt_ib_umem_for_each_in_range(&context->umem_tree, address,
+ address + PAGE_SIZE,
+ invalidate_page_trampoline, NULL);
+ up_read(&context->umem_mutex);
+}
+
+static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
+ u64 end, void *cookie)
+{
+ ib_umem_notifier_start_account(item);
+ item->context->invalidate_range(item, start, end);
+ return 0;
+}
+
+static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
+
+ if (!context->invalidate_range)
+ return;
+
+ down_read(&context->umem_mutex);
+ rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
+ end,
+ invalidate_range_start_trampoline, NULL);
+ up_read(&context->umem_mutex);
+}
+
+static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
+ u64 end, void *cookie)
+{
+ ib_umem_notifier_end_account(item);
+ return 0;
+}
+
+static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
+
+ if (!context->invalidate_range)
+ return;
+
+ down_read(&context->umem_mutex);
+ rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
+ end,
+ invalidate_range_end_trampoline, NULL);
+ up_read(&context->umem_mutex);
+}
+
+static struct mmu_notifier_ops ib_umem_notifiers = {
+ .release = ib_umem_notifier_release,
+ .invalidate_page = ib_umem_notifier_invalidate_page,
+ .invalidate_range_start = ib_umem_notifier_invalidate_range_start,
+ .invalidate_range_end = ib_umem_notifier_invalidate_range_end,
+};
+
int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
{
int ret_val;
struct pid *our_pid;
+ struct mm_struct *mm = get_task_mm(current);
+ BUG_ON(!mm);
/* Prevent creating ODP MRs in child processes */
rcu_read_lock();
our_pid = get_task_pid(current->group_leader, PIDTYPE_PID);
rcu_read_unlock();
put_pid(our_pid);
- if (context->tgid != our_pid)
- return -EINVAL;
+ if (context->tgid != our_pid) {
+ ret_val = -EINVAL;
+ goto out_mm;
+ }
umem->hugetlb = 0;
umem->odp_data = kzalloc(sizeof(*umem->odp_data), GFP_KERNEL);
- if (!umem->odp_data)
- return -ENOMEM;
+ if (!umem->odp_data) {
+ ret_val = -ENOMEM;
+ goto out_mm;
+ }
+ umem->odp_data->umem = umem;
mutex_init(&umem->odp_data->umem_mutex);
+ init_completion(&umem->odp_data->notifier_completion);
+
umem->odp_data->page_list = vzalloc(ib_umem_num_pages(umem) *
sizeof(*umem->odp_data->page_list));
if (!umem->odp_data->page_list) {
@@ -75,17 +253,66 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
goto out_page_list;
}
+ /*
+ * When using MMU notifiers, we will get a
+ * notification before the "current" task (and MM) is
+ * destroyed. We use the umem_mutex lock to synchronize.
+ */
+ down_write(&context->umem_mutex);
+ context->odp_mrs_count++;
+ if (likely(ib_umem_start(umem) != ib_umem_end(umem)))
+ rbt_ib_umem_insert(&umem->odp_data->interval_tree,
+ &context->umem_tree);
+ downgrade_write(&context->umem_mutex);
+
+ if (context->odp_mrs_count == 1) {
+ /*
+ * Note that at this point, no MMU notifier is running
+ * for this context!
+ */
+ INIT_HLIST_NODE(&context->mn.hlist);
+ context->mn.ops = &ib_umem_notifiers;
+ /*
+ * Lock-dep detects a false positive for mmap_sem vs.
+ * umem_mutex, due to not grasping downgrade_write correctly.
+ */
+ lockdep_off();
+ ret_val = mmu_notifier_register(&context->mn, mm);
+ lockdep_on();
+ if (ret_val) {
+ pr_err("Failed to register mmu_notifier %d\n", ret_val);
+ ret_val = -EBUSY;
+ goto out_mutex;
+ }
+ }
+
+ up_read(&context->umem_mutex);
+
+ /*
+ * Note that doing an mmput can cause a notifier for the relevant mm.
+ * If the notifier is called while we hold the umem_mutex, this will
+ * cause a deadlock. Therefore, we release the reference only after we
+ * released the mutex.
+ */
+ mmput(mm);
return 0;
+out_mutex:
+ up_read(&context->umem_mutex);
+ vfree(umem->odp_data->dma_list);
out_page_list:
vfree(umem->odp_data->page_list);
out_odp_data:
kfree(umem->odp_data);
+out_mm:
+ mmput(mm);
return ret_val;
}
void ib_umem_odp_release(struct ib_umem *umem)
{
+ struct ib_ucontext *context = umem->context;
+
/*
* Ensure that no more pages are mapped in the umem.
*
@@ -95,6 +322,49 @@ void ib_umem_odp_release(struct ib_umem *umem)
ib_umem_odp_unmap_dma_pages(umem, ib_umem_start(umem),
ib_umem_end(umem));
+ down_write(&context->umem_mutex);
+ if (likely(ib_umem_start(umem) != ib_umem_end(umem)))
+ rbt_ib_umem_remove(&umem->odp_data->interval_tree,
+ &context->umem_tree);
+ context->odp_mrs_count--;
+
+ /*
+ * Downgrade the lock to a read lock. This ensures that the notifiers
+ * (who lock the mutex for reading) will be able to finish, and we
+ * will be able to enventually obtain the mmu notifiers SRCU. Note
+ * that since we are doing it atomically, no other user could register
+ * and unregister while we do the check.
+ */
+ downgrade_write(&context->umem_mutex);
+ if (!context->odp_mrs_count) {
+ struct task_struct *owning_process = NULL;
+ struct mm_struct *owning_mm = NULL;
+ owning_process = get_pid_task(context->tgid,
+ PIDTYPE_PID);
+ if (owning_process == NULL)
+ /*
+ * The process is already dead, notifier were removed
+ * already.
+ */
+ goto out;
+
+ owning_mm = get_task_mm(owning_process);
+ if (owning_mm == NULL)
+ /*
+ * The process' mm is already dead, notifier were
+ * removed already.
+ */
+ goto out_put_task;
+ mmu_notifier_unregister(&context->mn, owning_mm);
+
+ mmput(owning_mm);
+
+out_put_task:
+ put_task_struct(owning_process);
+ }
+out:
+ up_read(&context->umem_mutex);
+
vfree(umem->odp_data->dma_list);
vfree(umem->odp_data->page_list);
kfree(umem);
@@ -111,7 +381,8 @@ void ib_umem_odp_release(struct ib_umem *umem)
* the sequence number is taken from
* umem->odp_data->notifiers_seq.
*
- * The function returns -EFAULT if the DMA mapping operation fails.
+ * The function returns -EFAULT if the DMA mapping operation fails. It returns
+ * -EAGAIN if a concurrent invalidation prevents us from updating the page. It
*
* The page is released via put_page even if the operation failed. For
* on-demand pinning, the page is released whenever it isn't stored in the
@@ -129,6 +400,15 @@ static int ib_umem_odp_map_dma_single_page(
int stored_page = 0;
int ret = 0;
mutex_lock(&umem->odp_data->umem_mutex);
+ /*
+ * Note: we avoid writing if seq is different from the initial seq, to
+ * handle case of a racing notifier. This check also allows us to bail
+ * early if we have a notifier running in parallel with us.
+ */
+ if (ib_umem_mmu_notifier_retry(umem, current_seq)) {
+ ret = -EAGAIN;
+ goto out;
+ }
if (!(umem->odp_data->dma_list[page_index])) {
dma_addr = ib_dma_map_page(dev,
page,
@@ -156,7 +436,8 @@ static int ib_umem_odp_map_dma_single_page(
out:
mutex_unlock(&umem->odp_data->umem_mutex);
- if (!stored_page)
+ /* On Demand Paging - avoid pinning the page */
+ if (umem->context->invalidate_range || !stored_page)
put_page(page);
return ret;
@@ -171,6 +452,8 @@ out:
*
* Returns the number of pages mapped in success, negative error code
* for failure.
+ * An -EAGAIN error code is returned when a concurrent mmu notifier prevents
+ * the function from completing its task.
*
* @umem: the umem to map and pin
* @user_virt: the address from which we need to map.
@@ -285,6 +568,11 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 virt,
struct ib_device *dev = umem->context->device;
virt = max_t(u64, virt, ib_umem_start(umem));
bound = min_t(u64, bound, ib_umem_end(umem));
+ /* Note that during the run of this function, the
+ * notifiers_count of the MR is > 0, preventing any racing
+ * faults from completion. We might be racing with other
+ * invalidations, so we must make sure we free each page only
+ * once. */
for (addr = virt; addr < bound; addr += (u64)umem->page_size) {
idx = (addr - ib_umem_start(umem)) / PAGE_SIZE;
mutex_lock(&umem->odp_data->umem_mutex);
@@ -299,8 +587,21 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 virt,
ib_dma_unmap_page(dev, dma_addr, PAGE_SIZE,
DMA_BIDIRECTIONAL);
if (umem->writable)
- set_page_dirty_lock(head_page);
- put_page(page);
+ /*
+ * set_page_dirty prefers being called with
+ * the page lock. However, MMU notifiers are
+ * called sometimes with and sometimes without
+ * the lock. We rely on the umem_mutex instead
+ * to prevent other mmu notifiers from
+ * continuing and allowing the page mapping to
+ * be removed.
+ */
+ set_page_dirty(head_page);
+ /* on demand pinning support */
+ if (!umem->context->invalidate_range)
+ put_page(page);
+ umem->odp_data->page_list[idx] = NULL;
+ umem->odp_data->dma_list[idx] = 0;
}
mutex_unlock(&umem->odp_data->umem_mutex);
}
diff --git a/drivers/infiniband/core/umem_rbtree.c b/drivers/infiniband/core/umem_rbtree.c
new file mode 100644
index 0000000..727d788
--- /dev/null
+++ b/drivers/infiniband/core/umem_rbtree.c
@@ -0,0 +1,94 @@
+/*
+ * Copyright (c) 2014 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/interval_tree_generic.h>
+#include <linux/sched.h>
+#include <linux/gfp.h>
+#include <rdma/ib_umem_odp.h>
+
+/*
+ * The ib_umem list keeps track of memory regions for which the HW
+ * device request to receive notification when the related memory
+ * mapping is changed.
+ *
+ * ib_umem_lock protects the list.
+ */
+
+static inline u64 node_start(struct umem_odp_node *n)
+{
+ struct ib_umem_odp *umem_odp =
+ container_of(n, struct ib_umem_odp, interval_tree);
+
+ return ib_umem_start(umem_odp->umem);
+}
+
+/* Note that the representation of the intervals in the interval tree
+ * considers the ending point as contained in the interval, while the
+ * function ib_umem_end returns the first address which is not contained
+ * in the umem.
+ */
+static inline u64 node_last(struct umem_odp_node *n)
+{
+ struct ib_umem_odp *umem_odp =
+ container_of(n, struct ib_umem_odp, interval_tree);
+
+ return ib_umem_end(umem_odp->umem) - 1;
+}
+
+INTERVAL_TREE_DEFINE(struct umem_odp_node, rb, u64, __subtree_last,
+ node_start, node_last, , rbt_ib_umem)
+
+/* @last is not a part of the interval. See comment for function
+ * node_last.
+ */
+int rbt_ib_umem_for_each_in_range(struct rb_root *root,
+ u64 start, u64 last,
+ umem_call_back cb,
+ void *cookie)
+{
+ int ret_val = 0;
+ struct umem_odp_node *node;
+ struct ib_umem_odp *umem;
+
+ if (unlikely(start == last))
+ return ret_val;
+
+ for (node = rbt_ib_umem_iter_first(root, start, last - 1); node;
+ node = rbt_ib_umem_iter_next(node, start, last - 1)) {
+ umem = container_of(node, struct ib_umem_odp, interval_tree);
+ ret_val = cb(umem->umem, start, last, cookie) || ret_val;
+ }
+
+ return ret_val;
+}
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 6c6be37..816b85c 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -289,6 +289,9 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
struct ib_uverbs_get_context_resp resp;
struct ib_udata udata;
struct ib_device *ibdev = file->device->ib_dev;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ struct ib_device_attr dev_attr;
+#endif
struct ib_ucontext *ucontext;
struct file *filp;
int ret;
@@ -331,6 +334,19 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
rcu_read_unlock();
ucontext->closing = 0;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ ucontext->umem_tree = RB_ROOT;
+ init_rwsem(&ucontext->umem_mutex);
+ ucontext->odp_mrs_count = 0;
+
+ ret = ib_query_device(ibdev, &dev_attr);
+ if (ret)
+ goto err_free;
+ if (!(dev_attr.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
+ ucontext->invalidate_range = NULL;
+
+#endif
+
resp.num_comp_vectors = file->device->num_comp_vectors;
ret = get_unused_fd_flags(O_CLOEXEC);
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 375ce28..bcdaa2e 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -34,6 +34,12 @@
#define IB_UMEM_ODP_H
#include <rdma/ib_umem.h>
+#include <linux/interval_tree.h>
+
+struct umem_odp_node {
+ u64 __subtree_last;
+ struct rb_node rb;
+};
struct ib_umem_odp {
/*
@@ -58,6 +64,14 @@ struct ib_umem_odp {
atomic_t notifiers_seq;
atomic_t notifiers_count;
+
+ struct ib_umem *umem;
+
+ /* Tree tracking */
+ struct umem_odp_node interval_tree;
+
+ struct completion notifier_completion;
+ int dying;
};
#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
@@ -85,6 +99,48 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 start_offset, u64 bcnt,
void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 start_offset,
u64 bound);
+void rbt_ib_umem_insert(struct umem_odp_node *node, struct rb_root *root);
+void rbt_ib_umem_remove(struct umem_odp_node *node, struct rb_root *root);
+typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
+ void *cookie);
+/*
+ * Call the callback on each ib_umem in the range. Returns the logical or of
+ * the return values of the functions called.
+ */
+int rbt_ib_umem_for_each_in_range(struct rb_root *root, u64 start, u64 end,
+ umem_call_back cb, void *cookie);
+
+struct umem_odp_node *rbt_ib_umem_iter_first(struct rb_root *root,
+ u64 start, u64 last);
+struct umem_odp_node *rbt_ib_umem_iter_next(struct umem_odp_node *node,
+ u64 start, u64 last);
+
+static inline int ib_umem_mmu_notifier_retry(struct ib_umem *item,
+ unsigned long mmu_seq)
+{
+ /*
+ * This code is strongly based on the KVM code from
+ * mmu_notifier_retry. Should be called with
+ * item->odp_data->umem_mutex locked.
+ */
+ if (unlikely(atomic_read(&item->odp_data->notifiers_count)))
+ return 1;
+ /*
+ * Ensure the read of mmu_notifier_count happens before the read
+ * of mmu_notifier_seq. This interacts with the smp_wmb() in
+ * mmu_notifier_invalidate_range_end to make sure that the caller
+ * either sees the old (non-zero) value of mmu_notifier_count or
+ * the new (incremented) value of mmu_notifier_seq.
+ */
+ smp_rmb();
+ if (atomic_read(&item->odp_data->notifiers_seq) != mmu_seq)
+ return 1;
+ return 0;
+}
+
+void ib_umem_notifier_start_account(struct ib_umem *item);
+void ib_umem_notifier_end_account(struct ib_umem *item);
+
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
static inline int ib_umem_odp_get(struct ib_ucontext *context,
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index c557bb5..7dff94d 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -51,6 +51,7 @@
#include <uapi/linux/if_ether.h>
#include <linux/atomic.h>
+#include <linux/mmu_notifier.h>
#include <asm/uaccess.h>
extern struct workqueue_struct *ib_wq;
@@ -1141,6 +1142,8 @@ struct ib_fmr_attr {
u8 page_shift;
};
+struct ib_umem;
+
struct ib_ucontext {
struct ib_device *device;
struct list_head pd_list;
@@ -1156,6 +1159,19 @@ struct ib_ucontext {
/* For ODP support: */
struct pid *tgid;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ struct rb_root umem_tree;
+ /*
+ * Protects .umem_rbroot and tree, as well as odp_mrs_count and
+ * mmu notifiers registration.
+ */
+ struct rw_semaphore umem_mutex;
+ void (*invalidate_range)(struct ib_umem *umem,
+ unsigned long start, unsigned long end);
+
+ struct mmu_notifier mn;
+ int odp_mrs_count;
+#endif
};
struct ib_uobject {
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 06/20] IB/mlx5: Fix error handling in reg_umr
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (4 preceding siblings ...)
2014-03-02 10:49 ` [RFC 05/20] IB/core: Implement support for MMU notifiers regarding " Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 07/20] IB/mlx5: Add MR to radix tree in reg_mr_callback Haggai Eran
` (15 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
If ib_post_send fails when posting the UMR work request in reg_umr, the code
doesn't release the temporary pas buffer allocated, and doesn't dma_unmap it.
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mr.c | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 43fb96d..24a68aa 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -730,7 +730,7 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
struct mlx5_ib_mr *mr;
struct ib_sge sg;
int size = sizeof(u64) * npages;
- int err;
+ int err = 0;
int i;
for (i = 0; i < 1; i++) {
@@ -751,7 +751,7 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
mr->pas = kmalloc(size + MLX5_UMR_ALIGN - 1, GFP_KERNEL);
if (!mr->pas) {
err = -ENOMEM;
- goto error;
+ goto free_mr;
}
mlx5_ib_populate_pas(dev, umem, page_shift,
@@ -760,9 +760,8 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
mr->dma = dma_map_single(ddev, mr_align(mr->pas, MLX5_UMR_ALIGN), size,
DMA_TO_DEVICE);
if (dma_mapping_error(ddev, mr->dma)) {
- kfree(mr->pas);
err = -ENOMEM;
- goto error;
+ goto free_pas;
}
memset(&wr, 0, sizeof(wr));
@@ -778,26 +777,28 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
err = ib_post_send(umrc->qp, &wr, &bad);
if (err) {
mlx5_ib_warn(dev, "post send failed, err %d\n", err);
- up(&umrc->sem);
- goto error;
+ goto unmap_dma;
}
wait_for_completion(&mr->done);
- up(&umrc->sem);
+ if (mr->status != IB_WC_SUCCESS) {
+ mlx5_ib_warn(dev, "reg umr failed\n");
+ err = -EFAULT;
+ }
+unmap_dma:
+ up(&umrc->sem);
dma_unmap_single(ddev, mr->dma, size, DMA_TO_DEVICE);
+
+free_pas:
kfree(mr->pas);
- if (mr->status != IB_WC_SUCCESS) {
- mlx5_ib_warn(dev, "reg umr failed\n");
- err = -EFAULT;
- goto error;
+free_mr:
+ if (err) {
+ free_cached_mr(dev, mr);
+ return ERR_PTR(err);
}
return mr;
-
-error:
- free_cached_mr(dev, mr);
- return ERR_PTR(err);
}
static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, u64 virt_addr,
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 07/20] IB/mlx5: Add MR to radix tree in reg_mr_callback
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (5 preceding siblings ...)
2014-03-02 10:49 ` [RFC 06/20] IB/mlx5: Fix error handling in reg_umr Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 08/20] mlx5: Store MR attributes in mlx5_mr_core during creation and after UMR Haggai Eran
` (14 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
For memory regions that are allocated using reg_umr, the suffix of
mlx5_core_create_mkey isn't being called. Instead the creation is completed in
a callback function (reg_mr_callback). This means that these MRs aren't being
added to the MR radix tree. The patch add them in the callback.
A later patch that adds page fault handling uses the MR radix tree to find the
MR pointer given an key that is read from the page fault event from hardware,
or from the WQE that caused the page fault. Therefore we need all MRs,
including those created with the reg_mr_callback to be in that tree.
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mr.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 24a68aa..f447257 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -73,6 +73,8 @@ static void reg_mr_callback(int status, void *context)
struct mlx5_cache_ent *ent = &cache->ent[c];
u8 key;
unsigned long flags;
+ struct mlx5_mr_table *table = &dev->mdev.priv.mr_table;
+ int err;
spin_lock_irqsave(&ent->lock, flags);
ent->pending--;
@@ -107,6 +109,13 @@ static void reg_mr_callback(int status, void *context)
ent->cur++;
ent->size++;
spin_unlock_irqrestore(&ent->lock, flags);
+
+ write_lock_irq(&table->lock);
+ err = radix_tree_insert(&table->tree, mlx5_base_mkey(mr->mmr.key),
+ &mr->mmr);
+ if (err)
+ pr_err("Error inserting to mr tree. 0x%x\n", -err);
+ write_unlock_irq(&table->lock);
}
static int add_keys(struct mlx5_ib_dev *dev, int c, int num)
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 08/20] mlx5: Store MR attributes in mlx5_mr_core during creation and after UMR
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (6 preceding siblings ...)
2014-03-02 10:49 ` [RFC 07/20] IB/mlx5: Add MR to radix tree in reg_mr_callback Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 09/20] IB/mlx5: Set QP offsets and parameters for user QPs and not just for kernel QPs Haggai Eran
` (13 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
The patch stores iova, pd and size during mr creation and after UMRs that
modify them. It removes the unused access flags field.
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mr.c | 4 ++++
drivers/net/ethernet/mellanox/mlx5/core/mr.c | 4 ++++
include/linux/mlx5/driver.h | 1 -
3 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index f447257..66b7290 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -794,6 +794,10 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
err = -EFAULT;
}
+ mr->mmr.iova = virt_addr;
+ mr->mmr.size = len;
+ mr->mmr.pd = to_mpd(pd)->pdn;
+
unmap_dma:
up(&umrc->sem);
dma_unmap_single(ddev, mr->dma, size, DMA_TO_DEVICE);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mr.c b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
index 4cc9276..ac52a0f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
@@ -82,7 +82,11 @@ int mlx5_core_create_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr,
return mlx5_cmd_status_to_err(&lout.hdr);
}
+ mr->iova = be64_to_cpu(in->seg.start_addr);
+ mr->size = be64_to_cpu(in->seg.len);
mr->key = mlx5_idx_to_mkey(be32_to_cpu(lout.mkey) & 0xffffff) | key;
+ mr->pd = be32_to_cpu(in->seg.flags_pd) & 0xffffff;
+
mlx5_core_dbg(dev, "out 0x%x, key 0x%x, mkey 0x%x\n",
be32_to_cpu(lout.mkey), key, mr->key);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 93cef63..2bce4aa 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -427,7 +427,6 @@ struct mlx5_core_mr {
u64 size;
u32 key;
u32 pd;
- u32 access;
};
struct mlx5_core_srq {
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 09/20] IB/mlx5: Set QP offsets and parameters for user QPs and not just for kernel QPs
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (7 preceding siblings ...)
2014-03-02 10:49 ` [RFC 08/20] mlx5: Store MR attributes in mlx5_mr_core during creation and after UMR Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 10/20] IB/mlx5: Enhance UMR support to allow partial page table update Haggai Eran
` (12 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
For user QPs, the creation process does not currently initialize the fields:
* qp->rq.offset
* qp->sq.offset
* qp->sq.wqe_shift
These fields are used for handling page faults to calculate where to read the
faulting WQE from.
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/qp.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index b5cf2c4..335bcbe 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -574,6 +574,10 @@ static int create_user_qp(struct mlx5_ib_dev *dev, struct ib_pd *pd,
uar_index = uuarn_to_uar_index(&context->uuari, uuarn);
mlx5_ib_dbg(dev, "uuarn 0x%x, uar_index 0x%x\n", uuarn, uar_index);
+ qp->rq.offset = 0;
+ qp->sq.wqe_shift = ilog2(MLX5_SEND_WQE_BB);
+ qp->sq.offset = qp->rq.wqe_cnt << qp->rq.wqe_shift;
+
err = set_user_buf_size(dev, qp, &ucmd);
if (err)
goto err_uuar;
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 10/20] IB/mlx5: Enhance UMR support to allow partial page table update
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (8 preceding siblings ...)
2014-03-02 10:49 ` [RFC 09/20] IB/mlx5: Set QP offsets and parameters for user QPs and not just for kernel QPs Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 11/20] IB/mlx5: Refactor UMR to have its own context struct Haggai Eran
` (11 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
The current UMR interface doesn't allow partial updates to a memory region's
page tables. This patch changes the interface to allow that.
It also changes the way the UMR operation validates the memory region's state.
When set, IB_SEND_UMR_CHECK_FREE will cause the UMR operation to fail if the
MKEY is in the free state. When it is unchecked the operation will check that
it isn't in the free state.
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mlx5_ib.h | 15 ++++++
drivers/infiniband/hw/mlx5/mr.c | 22 ++++----
drivers/infiniband/hw/mlx5/qp.c | 97 +++++++++++++++++++++++-------------
include/linux/mlx5/device.h | 9 ++++
4 files changed, 100 insertions(+), 43 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 5054158..afe39e7 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -111,6 +111,8 @@ struct mlx5_ib_pd {
*/
#define MLX5_IB_SEND_UMR_UNREG IB_SEND_RESERVED_START
+#define MLX5_IB_SEND_UMR_CHECK_FREE (IB_SEND_RESERVED_START << 1)
+#define MLX5_IB_SEND_UMR_UPDATE_MTT (IB_SEND_RESERVED_START << 2)
#define MLX5_IB_QPT_REG_UMR IB_QPT_RESERVED1
#define MLX5_IB_WR_UMR IB_WR_RESERVED1
@@ -206,6 +208,19 @@ enum mlx5_ib_qp_flags {
MLX5_IB_QP_SIGNATURE_HANDLING = 1 << 1,
};
+struct mlx5_umr_wr {
+ union {
+ u64 virt_addr;
+ u64 offset;
+ } target;
+ struct ib_pd *pd;
+ unsigned int page_shift;
+ unsigned int npages;
+ u32 length;
+ int access_flags;
+ u32 mkey;
+};
+
struct mlx5_shared_mr_info {
int mr_id;
struct ib_umem *umem;
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 66b7290..dd6a4bb 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -37,6 +37,7 @@
#include <linux/export.h>
#include <linux/delay.h>
#include <rdma/ib_umem.h>
+#include <rdma/ib_verbs.h>
#include "mlx5_ib.h"
enum {
@@ -675,6 +676,7 @@ static void prep_umr_reg_wqe(struct ib_pd *pd, struct ib_send_wr *wr,
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct ib_mr *mr = dev->umrc.mr;
+ struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr->wr.fast_reg;
sg->addr = dma;
sg->length = ALIGN(sizeof(u64) * n, 64);
@@ -689,21 +691,23 @@ static void prep_umr_reg_wqe(struct ib_pd *pd, struct ib_send_wr *wr,
wr->num_sge = 0;
wr->opcode = MLX5_IB_WR_UMR;
- wr->wr.fast_reg.page_list_len = n;
- wr->wr.fast_reg.page_shift = page_shift;
- wr->wr.fast_reg.rkey = key;
- wr->wr.fast_reg.iova_start = virt_addr;
- wr->wr.fast_reg.length = len;
- wr->wr.fast_reg.access_flags = access_flags;
- wr->wr.fast_reg.page_list = (struct ib_fast_reg_page_list *)pd;
+
+ umrwr->npages = n;
+ umrwr->page_shift = page_shift;
+ umrwr->mkey = key;
+ umrwr->target.virt_addr = virt_addr;
+ umrwr->length = len;
+ umrwr->access_flags = access_flags;
+ umrwr->pd = pd;
}
static void prep_umr_unreg_wqe(struct mlx5_ib_dev *dev,
struct ib_send_wr *wr, u32 key)
{
- wr->send_flags = MLX5_IB_SEND_UMR_UNREG;
+ struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr->wr.fast_reg;
+ wr->send_flags = MLX5_IB_SEND_UMR_UNREG | MLX5_IB_SEND_UMR_CHECK_FREE;
wr->opcode = MLX5_IB_WR_UMR;
- wr->wr.fast_reg.rkey = key;
+ umrwr->mkey = key;
}
void mlx5_umr_cq_handler(struct ib_cq *cq, void *cq_context)
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 335bcbe..e48b699 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -70,15 +70,6 @@ static const u32 mlx5_ib_opcode[] = {
[MLX5_IB_WR_UMR] = MLX5_OPCODE_UMR,
};
-struct umr_wr {
- u64 virt_addr;
- struct ib_pd *pd;
- unsigned int page_shift;
- unsigned int npages;
- u32 length;
- int access_flags;
- u32 mkey;
-};
static int is_qp0(enum ib_qp_type qp_type)
{
@@ -1818,37 +1809,71 @@ static void set_frwr_umr_segment(struct mlx5_wqe_umr_ctrl_seg *umr,
umr->mkey_mask = frwr_mkey_mask();
}
+
+static __be64 get_umr_reg_mr_mask(void)
+{
+ u64 result;
+
+ result = MLX5_MKEY_MASK_LEN |
+ MLX5_MKEY_MASK_PAGE_SIZE |
+ MLX5_MKEY_MASK_START_ADDR |
+ MLX5_MKEY_MASK_PD |
+ MLX5_MKEY_MASK_LR |
+ MLX5_MKEY_MASK_LW |
+ MLX5_MKEY_MASK_KEY |
+ MLX5_MKEY_MASK_RR |
+ MLX5_MKEY_MASK_RW |
+ MLX5_MKEY_MASK_A |
+ MLX5_MKEY_MASK_FREE;
+
+ return cpu_to_be64(result);
+}
+
+static __be64 get_umr_unreg_mr_mask(void)
+{
+ u64 result;
+
+ result = MLX5_MKEY_MASK_FREE;
+
+ return cpu_to_be64(result);
+}
+
+static __be64 get_umr_update_mtt_mask(void)
+{
+ u64 result;
+
+ result = MLX5_MKEY_MASK_FREE;
+
+ return cpu_to_be64(result);
+}
+
static void set_reg_umr_segment(struct mlx5_wqe_umr_ctrl_seg *umr,
struct ib_send_wr *wr)
{
- struct umr_wr *umrwr = (struct umr_wr *)&wr->wr.fast_reg;
- u64 mask;
+ struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr->wr.fast_reg;
memset(umr, 0, sizeof(*umr));
+ if (wr->send_flags & MLX5_IB_SEND_UMR_CHECK_FREE)
+ umr->flags = MLX5_UMR_CHECK_FREE; /* fail if free */
+ else
+ umr->flags = MLX5_UMR_CHECK_NOT_FREE; /* fail if not free */
+
if (!(wr->send_flags & MLX5_IB_SEND_UMR_UNREG)) {
- umr->flags = 1 << 5; /* fail if not free */
umr->klm_octowords = get_klm_octo(umrwr->npages);
- mask = MLX5_MKEY_MASK_LEN |
- MLX5_MKEY_MASK_PAGE_SIZE |
- MLX5_MKEY_MASK_START_ADDR |
- MLX5_MKEY_MASK_PD |
- MLX5_MKEY_MASK_LR |
- MLX5_MKEY_MASK_LW |
- MLX5_MKEY_MASK_KEY |
- MLX5_MKEY_MASK_RR |
- MLX5_MKEY_MASK_RW |
- MLX5_MKEY_MASK_A |
- MLX5_MKEY_MASK_FREE;
- umr->mkey_mask = cpu_to_be64(mask);
+ if (wr->send_flags & MLX5_IB_SEND_UMR_UPDATE_MTT) {
+ umr->mkey_mask = get_umr_update_mtt_mask();
+ umr->bsf_octowords = get_klm_octo(umrwr->target.offset);
+ umr->flags |= MLX5_UMR_TRANSLATION_OFFSET_EN;
+ } else {
+ umr->mkey_mask = get_umr_reg_mr_mask();
+ }
} else {
- umr->flags = 2 << 5; /* fail if free */
- mask = MLX5_MKEY_MASK_FREE;
- umr->mkey_mask = cpu_to_be64(mask);
+ umr->mkey_mask = get_umr_unreg_mr_mask();
}
if (!wr->num_sge)
- umr->flags |= (1 << 7); /* inline */
+ umr->flags |= MLX5_UMR_INLINE;
}
static u8 get_umr_flags(int acc)
@@ -1882,19 +1907,23 @@ static void set_mkey_segment(struct mlx5_mkey_seg *seg, struct ib_send_wr *wr,
static void set_reg_mkey_segment(struct mlx5_mkey_seg *seg, struct ib_send_wr *wr)
{
+ struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr->wr.fast_reg;
+
memset(seg, 0, sizeof(*seg));
if (wr->send_flags & MLX5_IB_SEND_UMR_UNREG) {
seg->status = 1 << 6;
return;
}
- seg->flags = convert_access(wr->wr.fast_reg.access_flags);
- seg->flags_pd = cpu_to_be32(to_mpd((struct ib_pd *)wr->wr.fast_reg.page_list)->pdn);
- seg->start_addr = cpu_to_be64(wr->wr.fast_reg.iova_start);
- seg->len = cpu_to_be64(wr->wr.fast_reg.length);
- seg->log2_page_size = wr->wr.fast_reg.page_shift;
+ seg->flags = convert_access(umrwr->access_flags);
+ if (!(wr->send_flags & MLX5_IB_SEND_UMR_UPDATE_MTT)) {
+ seg->flags_pd = cpu_to_be32(to_mpd(umrwr->pd)->pdn);
+ seg->start_addr = cpu_to_be64(umrwr->target.virt_addr);
+ }
+ seg->len = cpu_to_be64(umrwr->length);
+ seg->log2_page_size = umrwr->page_shift;
seg->qpn_mkey7_0 = cpu_to_be32(0xffffff00 |
- mlx5_mkey_variant(wr->wr.fast_reg.rkey));
+ mlx5_mkey_variant(umrwr->mkey));
}
static void set_frwr_pages(struct mlx5_wqe_data_seg *dseg,
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 407bdb6..9d7d239 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -131,6 +131,15 @@ enum {
MLX5_MKEY_MASK_FREE = 1ull << 29,
};
+enum {
+ MLX5_UMR_TRANSLATION_OFFSET_EN = (1 << 4),
+
+ MLX5_UMR_CHECK_NOT_FREE = (1 << 5),
+ MLX5_UMR_CHECK_FREE = (2 << 5),
+
+ MLX5_UMR_INLINE = (1 << 7),
+};
+
enum mlx5_event {
MLX5_EVENT_TYPE_COMP = 0x0,
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 11/20] IB/mlx5: Refactor UMR to have its own context struct
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (9 preceding siblings ...)
2014-03-02 10:49 ` [RFC 10/20] IB/mlx5: Enhance UMR support to allow partial page table update Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 12/20] net/mlx5_core: Add support for page faults events and low level handling Haggai Eran
` (10 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
From: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Instead of having the UMR context part of each memory region,
allocate a struct on the stack. This allows queuing multiple UMRs that access
the same memory region.
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mlx5_ib.h | 13 ++++++++++--
drivers/infiniband/hw/mlx5/mr.c | 40 ++++++++++++++++++------------------
2 files changed, 31 insertions(+), 22 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index afe39e7..29f58c1 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -279,8 +279,6 @@ struct mlx5_ib_mr {
__be64 *pas;
dma_addr_t dma;
int npages;
- struct completion done;
- enum ib_wc_status status;
struct mlx5_ib_dev *dev;
struct mlx5_create_mkey_mbox_out out;
struct mlx5_core_sig_ctx *sig;
@@ -292,6 +290,17 @@ struct mlx5_ib_fast_reg_page_list {
dma_addr_t map;
};
+struct mlx5_ib_umr_context {
+ enum ib_wc_status status;
+ struct completion done;
+};
+
+static inline void mlx5_ib_init_umr_context(struct mlx5_ib_umr_context *context)
+{
+ context->status = -1;
+ init_completion(&context->done);
+}
+
struct umr_common {
struct ib_pd *pd;
struct ib_cq *cq;
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index dd6a4bb..fa0bcd6 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -712,7 +712,7 @@ static void prep_umr_unreg_wqe(struct mlx5_ib_dev *dev,
void mlx5_umr_cq_handler(struct ib_cq *cq, void *cq_context)
{
- struct mlx5_ib_mr *mr;
+ struct mlx5_ib_umr_context *context;
struct ib_wc wc;
int err;
@@ -725,9 +725,9 @@ void mlx5_umr_cq_handler(struct ib_cq *cq, void *cq_context)
if (err == 0)
break;
- mr = (struct mlx5_ib_mr *)(unsigned long)wc.wr_id;
- mr->status = wc.status;
- complete(&mr->done);
+ context = (struct mlx5_ib_umr_context *)wc.wr_id;
+ context->status = wc.status;
+ complete(&context->done);
}
ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
}
@@ -739,6 +739,7 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct device *ddev = dev->ib_dev.dma_device;
struct umr_common *umrc = &dev->umrc;
+ struct mlx5_ib_umr_context umr_context;
struct ib_send_wr wr, *bad;
struct mlx5_ib_mr *mr;
struct ib_sge sg;
@@ -778,24 +779,21 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
}
memset(&wr, 0, sizeof(wr));
- wr.wr_id = (u64)(unsigned long)mr;
+ wr.wr_id = (u64)(unsigned long)&umr_context;
prep_umr_reg_wqe(pd, &wr, &sg, mr->dma, npages, mr->mmr.key, page_shift, virt_addr, len, access_flags);
- /* We serialize polls so one process does not kidnap another's
- * completion. This is not a problem since wr is completed in
- * around 1 usec
- */
+ mlx5_ib_init_umr_context(&umr_context);
down(&umrc->sem);
- init_completion(&mr->done);
err = ib_post_send(umrc->qp, &wr, &bad);
if (err) {
mlx5_ib_warn(dev, "post send failed, err %d\n", err);
goto unmap_dma;
- }
- wait_for_completion(&mr->done);
- if (mr->status != IB_WC_SUCCESS) {
- mlx5_ib_warn(dev, "reg umr failed\n");
- err = -EFAULT;
+ } else {
+ wait_for_completion(&umr_context.done);
+ if (umr_context.status != IB_WC_SUCCESS) {
+ mlx5_ib_warn(dev, "reg umr failed\n");
+ err = -EFAULT;
+ }
}
mr->mmr.iova = virt_addr;
@@ -944,24 +942,26 @@ error:
static int unreg_umr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
{
struct umr_common *umrc = &dev->umrc;
+ struct mlx5_ib_umr_context umr_context;
struct ib_send_wr wr, *bad;
int err;
memset(&wr, 0, sizeof(wr));
- wr.wr_id = (u64)(unsigned long)mr;
+ wr.wr_id = (u64)(unsigned long)&umr_context;
prep_umr_unreg_wqe(dev, &wr, mr->mmr.key);
+ mlx5_ib_init_umr_context(&umr_context);
down(&umrc->sem);
- init_completion(&mr->done);
err = ib_post_send(umrc->qp, &wr, &bad);
if (err) {
up(&umrc->sem);
mlx5_ib_dbg(dev, "err %d\n", err);
goto error;
+ } else {
+ wait_for_completion(&umr_context.done);
+ up(&umrc->sem);
}
- wait_for_completion(&mr->done);
- up(&umrc->sem);
- if (mr->status != IB_WC_SUCCESS) {
+ if (umr_context.status != IB_WC_SUCCESS) {
mlx5_ib_warn(dev, "unreg umr failed\n");
err = -EFAULT;
goto error;
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 12/20] net/mlx5_core: Add support for page faults events and low level handling
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (10 preceding siblings ...)
2014-03-02 10:49 ` [RFC 11/20] IB/mlx5: Refactor UMR to have its own context struct Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 13/20] IB/mlx5: Implement the ODP capability query verb Haggai Eran
` (9 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
* Add a handler function pointer in the mlx5_core_qp struct for page fault
events. Handle page fault events by calling the handler function, if not
NULL.
* Add on-demand paging capability query command.
* Export command for resuming QPs after page faults.
* Add various constants related to paging support.
Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mr.c | 6 +-
drivers/infiniband/hw/mlx5/qp.c | 4 +-
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 11 +-
drivers/net/ethernet/mellanox/mlx5/core/fw.c | 35 ++++++-
drivers/net/ethernet/mellanox/mlx5/core/main.c | 8 +-
drivers/net/ethernet/mellanox/mlx5/core/qp.c | 134 ++++++++++++++++++++++++-
include/linux/mlx5/device.h | 60 ++++++++++-
include/linux/mlx5/driver.h | 18 ++++
include/linux/mlx5/qp.h | 53 ++++++++++
9 files changed, 308 insertions(+), 21 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index fa0bcd6..d1e8426 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -147,7 +147,7 @@ static int add_keys(struct mlx5_ib_dev *dev, int c, int num)
mr->order = ent->order;
mr->umred = 1;
mr->dev = dev;
- in->seg.status = 1 << 6;
+ in->seg.status = MLX5_MKEY_STATUS_FREE;
in->seg.xlt_oct_size = cpu_to_be32((npages + 1) / 2);
in->seg.qpn_mkey7_0 = cpu_to_be32(0xffffff << 8);
in->seg.flags = MLX5_ACCESS_MODE_MTT | MLX5_PERM_UMR_EN;
@@ -1029,7 +1029,7 @@ struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
goto err_free;
}
- in->seg.status = 1 << 6; /* free */
+ in->seg.status = MLX5_MKEY_STATUS_FREE;
in->seg.xlt_oct_size = cpu_to_be32(ndescs);
in->seg.qpn_mkey7_0 = cpu_to_be32(0xffffff << 8);
in->seg.flags_pd = cpu_to_be32(to_mpd(pd)->pdn);
@@ -1144,7 +1144,7 @@ struct ib_mr *mlx5_ib_alloc_fast_reg_mr(struct ib_pd *pd,
goto err_free;
}
- in->seg.status = 1 << 6; /* free */
+ in->seg.status = MLX5_MKEY_STATUS_FREE;
in->seg.xlt_oct_size = cpu_to_be32((max_page_list_len + 1) / 2);
in->seg.qpn_mkey7_0 = cpu_to_be32(0xffffff << 8);
in->seg.flags = MLX5_PERM_UMR_EN | MLX5_ACCESS_MODE_MTT;
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index e48b699..385501b 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -1890,7 +1890,7 @@ static void set_mkey_segment(struct mlx5_mkey_seg *seg, struct ib_send_wr *wr,
{
memset(seg, 0, sizeof(*seg));
if (li) {
- seg->status = 1 << 6;
+ seg->status = MLX5_MKEY_STATUS_FREE;
return;
}
@@ -1911,7 +1911,7 @@ static void set_reg_mkey_segment(struct mlx5_mkey_seg *seg, struct ib_send_wr *w
memset(seg, 0, sizeof(*seg));
if (wr->send_flags & MLX5_IB_SEND_UMR_UNREG) {
- seg->status = 1 << 6;
+ seg->status = MLX5_MKEY_STATUS_FREE;
return;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 64a61b2..23bccbe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -157,6 +157,8 @@ static const char *eqe_type_str(u8 type)
return "MLX5_EVENT_TYPE_CMD";
case MLX5_EVENT_TYPE_PAGE_REQUEST:
return "MLX5_EVENT_TYPE_PAGE_REQUEST";
+ case MLX5_EVENT_TYPE_PAGE_FAULT:
+ return "MLX5_EVENT_TYPE_PAGE_FAULT";
default:
return "Unrecognized event";
}
@@ -275,6 +277,9 @@ static int mlx5_eq_int(struct mlx5_core_dev *dev, struct mlx5_eq *eq)
}
break;
+ case MLX5_EVENT_TYPE_PAGE_FAULT:
+ mlx5_eq_pagefault(dev, eqe);
+ break;
default:
mlx5_core_warn(dev, "Unhandled event 0x%x on EQ 0x%x\n", eqe->type, eq->eqn);
@@ -441,8 +446,12 @@ void mlx5_eq_cleanup(struct mlx5_core_dev *dev)
int mlx5_start_eqs(struct mlx5_core_dev *dev)
{
struct mlx5_eq_table *table = &dev->priv.eq_table;
+ u32 async_event_mask = MLX5_ASYNC_EVENT_MASK;
int err;
+ if (dev->caps.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG)
+ async_event_mask |= (1ull << MLX5_EVENT_TYPE_PAGE_FAULT);
+
err = mlx5_create_map_eq(dev, &table->cmd_eq, MLX5_EQ_VEC_CMD,
MLX5_NUM_CMD_EQE, 1ull << MLX5_EVENT_TYPE_CMD,
"mlx5_cmd_eq", &dev->priv.uuari.uars[0]);
@@ -454,7 +463,7 @@ int mlx5_start_eqs(struct mlx5_core_dev *dev)
mlx5_cmd_use_events(dev);
err = mlx5_create_map_eq(dev, &table->async_eq, MLX5_EQ_VEC_ASYNC,
- MLX5_NUM_ASYNC_EQE, MLX5_ASYNC_EVENT_MASK,
+ MLX5_NUM_ASYNC_EQE, async_event_mask,
"mlx5_async_eq", &dev->priv.uuari.uars[0]);
if (err) {
mlx5_core_warn(dev, "failed to create async EQ %d\n", err);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw.c b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
index f012658..eed8b9d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
@@ -80,7 +80,7 @@ int mlx5_cmd_query_hca_cap(struct mlx5_core_dev *dev,
memset(&in, 0, sizeof(in));
in.hdr.opcode = cpu_to_be16(MLX5_CMD_OP_QUERY_HCA_CAP);
- in.hdr.opmod = cpu_to_be16(0x1);
+ in.hdr.opmod = cpu_to_be16(MLX5_CMD_OPMOD_QUERY_HCA_CAP_CUR_CAPS);
err = mlx5_cmd_exec(dev, &in, sizeof(in), out, sizeof(*out));
if (err)
goto out_out;
@@ -146,6 +146,39 @@ out_out:
return err;
}
+int mlx5_query_odp_caps(struct mlx5_core_dev *dev, struct mlx5_odp_caps *caps)
+{
+ int err;
+ struct mlx5_cmd_query_hca_cap_mbox_in in;
+ struct mlx5_cmd_query_hca_cap_mbox_out out;
+
+ if (!(dev->caps.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG))
+ return -ENOTSUPP;
+
+ memset(&in, 0, sizeof(in));
+ in.hdr.opcode = cpu_to_be16(MLX5_CMD_OP_QUERY_HCA_CAP);
+ in.hdr.opmod = cpu_to_be16(MLX5_CMD_OPMOD_QUERY_HCA_CAP_ODP_CUR_CAPS);
+ err = mlx5_cmd_exec(dev, &in, sizeof(in), &out, sizeof(out));
+ if (err)
+ goto out;
+
+ if (out.hdr.status) {
+ err = mlx5_cmd_status_to_err(&out.hdr);
+ goto out;
+ }
+
+ *caps = out.odp_caps;
+
+ mlx5_core_dbg(dev, "on-demand paging capabilities:\nrc: %08x\nuc: %08x\nud: %08x\n",
+ be32_to_cpu(out.odp_caps.per_transport_caps.rc_odp_caps),
+ be32_to_cpu(out.odp_caps.per_transport_caps.uc_odp_caps),
+ be32_to_cpu(out.odp_caps.per_transport_caps.ud_odp_caps));
+
+out:
+ return err;
+}
+EXPORT_SYMBOL(mlx5_query_odp_caps);
+
int mlx5_cmd_init_hca(struct mlx5_core_dev *dev)
{
struct mlx5_cmd_init_hca_mbox_in in;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 96a0617..5a5ff5a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -184,11 +184,6 @@ static void copy_rw_fields(struct mlx5_hca_cap *to, struct mlx5_hca_cap *from)
to->flags = cpu_to_be64(v64);
}
-enum {
- HCA_CAP_OPMOD_GET_MAX = 0,
- HCA_CAP_OPMOD_GET_CUR = 1,
-};
-
static int handle_hca_cap(struct mlx5_core_dev *dev)
{
struct mlx5_cmd_query_hca_cap_mbox_out *query_out = NULL;
@@ -210,7 +205,8 @@ static int handle_hca_cap(struct mlx5_core_dev *dev)
}
query_ctx.hdr.opcode = cpu_to_be16(MLX5_CMD_OP_QUERY_HCA_CAP);
- query_ctx.hdr.opmod = cpu_to_be16(HCA_CAP_OPMOD_GET_CUR);
+ query_ctx.hdr.opmod =
+ cpu_to_be16(MLX5_CMD_OPMOD_QUERY_HCA_CAP_CUR_CAPS);
err = mlx5_cmd_exec(dev, &query_ctx, sizeof(query_ctx),
query_out, sizeof(*query_out));
if (err)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index 5105762..aa6ed06 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -39,7 +39,7 @@
#include "mlx5_core.h"
-void mlx5_qp_event(struct mlx5_core_dev *dev, u32 qpn, int event_type)
+static struct mlx5_core_qp *mlx5_core_get_qp(struct mlx5_core_dev *dev, u32 qpn)
{
struct mlx5_qp_table *table = &dev->priv.qp_table;
struct mlx5_core_qp *qp;
@@ -52,6 +52,19 @@ void mlx5_qp_event(struct mlx5_core_dev *dev, u32 qpn, int event_type)
spin_unlock(&table->lock);
+ return qp;
+}
+
+static void mlx5_core_put_qp(struct mlx5_core_qp *qp)
+{
+ if (atomic_dec_and_test(&qp->refcount))
+ complete(&qp->free);
+}
+
+void mlx5_qp_event(struct mlx5_core_dev *dev, u32 qpn, int event_type)
+{
+ struct mlx5_core_qp *qp = mlx5_core_get_qp(dev, qpn);
+
if (!qp) {
mlx5_core_warn(dev, "Async event for bogus QP 0x%x\n", qpn);
return;
@@ -59,8 +72,92 @@ void mlx5_qp_event(struct mlx5_core_dev *dev, u32 qpn, int event_type)
qp->event(qp, event_type);
- if (atomic_dec_and_test(&qp->refcount))
- complete(&qp->free);
+ mlx5_core_put_qp(qp);
+}
+
+void mlx5_eq_pagefault(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe)
+{
+ struct mlx5_eqe_page_fault *pf_eqe = &eqe->data.page_fault;
+ int qpn = be32_to_cpu(pf_eqe->flags_qpn) & MLX5_QPN_MASK;
+ struct mlx5_core_qp *qp = mlx5_core_get_qp(dev, qpn);
+ struct mlx5_pagefault pfault;
+
+ if (!qp) {
+ mlx5_core_warn(dev, "ODP event for non-existent QP %06x\n",
+ qpn);
+ return;
+ }
+
+ pfault.event_subtype = eqe->sub_type;
+ pfault.flags = (be32_to_cpu(pf_eqe->flags_qpn) >> MLX5_QPN_BITS) &
+ (MLX5_PFAULT_REQUESTOR | MLX5_PFAULT_WRITE | MLX5_PFAULT_RDMA);
+ pfault.bytes_committed = be32_to_cpu(
+ pf_eqe->bytes_committed);
+
+ mlx5_core_dbg(dev,
+ "PAGE_FAULT: subtype: 0x%02x, flags: 0x%02x,\n",
+ eqe->sub_type, pfault.flags);
+
+ switch (eqe->sub_type) {
+ case MLX5_PFAULT_SUBTYPE_RDMA:
+ /* RDMA based event */
+ pfault.rdma.r_key =
+ be32_to_cpu(pf_eqe->rdma.r_key);
+ pfault.rdma.packet_size =
+ be16_to_cpu(pf_eqe->rdma.packet_length);
+ pfault.rdma.rdma_op_len =
+ be32_to_cpu(pf_eqe->rdma.rdma_op_len);
+ pfault.rdma.rdma_va =
+ be64_to_cpu(pf_eqe->rdma.rdma_va);
+ mlx5_core_dbg(dev,
+ "PAGE_FAULT: qpn: 0x%06x, r_key: 0x%08x,\n",
+ qpn, pfault.rdma.r_key);
+ mlx5_core_dbg(dev,
+ "PAGE_FAULT: rdma_op_len: 0x%08x,\n",
+ pfault.rdma.rdma_op_len);
+ mlx5_core_dbg(dev,
+ "PAGE_FAULT: rdma_va: 0x%016llx,\n",
+ pfault.rdma.rdma_va);
+ mlx5_core_dbg(dev,
+ "PAGE_FAULT: bytes_committed: 0x%06x\n",
+ pfault.bytes_committed);
+ break;
+
+ case MLX5_PFAULT_SUBTYPE_WQE:
+ /* WQE based event */
+ pfault.wqe.wqe_index =
+ be16_to_cpu(pf_eqe->wqe.wqe_index);
+ pfault.wqe.packet_size =
+ be16_to_cpu(pf_eqe->wqe.packet_length);
+ mlx5_core_dbg(dev,
+ "PAGE_FAULT: qpn: 0x%06x, wqe_index: 0x%04x,\n",
+ qpn, pfault.wqe.wqe_index);
+ mlx5_core_dbg(dev,
+ "PAGE_FAULT: bytes_committed: 0x%06x\n",
+ pfault.bytes_committed);
+ break;
+
+ default:
+ mlx5_core_warn(dev,
+ "Unsupported page fault event sub-type: 0x%02hhx, QP %06x\n",
+ eqe->sub_type, qpn);
+ /* Unsupported page faults should still be resolved by the
+ * page fault handler
+ */
+ }
+
+ if (qp->pfault_handler) {
+ qp->pfault_handler(qp, &pfault);
+ } else {
+ mlx5_core_err(dev,
+ "ODP event for QP %08x, without a fault handler in QP\n",
+ qpn);
+ /* Page fault will remain unresolved. QP will hang until it is
+ * destroyed
+ */
+ }
+
+ mlx5_core_put_qp(qp);
}
int mlx5_core_create_qp(struct mlx5_core_dev *dev,
@@ -138,8 +235,7 @@ int mlx5_core_destroy_qp(struct mlx5_core_dev *dev,
radix_tree_delete(&table->tree, qp->qpn);
spin_unlock_irqrestore(&table->lock, flags);
- if (atomic_dec_and_test(&qp->refcount))
- complete(&qp->free);
+ mlx5_core_put_qp(qp);
wait_for_completion(&qp->free);
memset(&in, 0, sizeof(in));
@@ -300,3 +396,31 @@ int mlx5_core_xrcd_dealloc(struct mlx5_core_dev *dev, u32 xrcdn)
return err;
}
EXPORT_SYMBOL_GPL(mlx5_core_xrcd_dealloc);
+
+int mlx5_core_page_fault_resume(struct mlx5_core_dev *dev, u32 qpn,
+ u8 flags, int error)
+{
+ struct mlx5_page_fault_resume_mbox_in in;
+ struct mlx5_page_fault_resume_mbox_out out;
+ int err;
+
+ memset(&in, 0, sizeof(in));
+ memset(&out, 0, sizeof(out));
+ in.hdr.opcode = cpu_to_be16(MLX5_CMD_OP_PAGE_FAULT_RESUME);
+ in.hdr.opmod = 0;
+ flags &= (MLX5_PAGE_FAULT_RESUME_REQUESTOR |
+ MLX5_PAGE_FAULT_RESUME_WRITE |
+ MLX5_PAGE_FAULT_RESUME_RDMA);
+ flags |= (error ? MLX5_PAGE_FAULT_RESUME_ERROR : 0);
+ in.flags_qpn = cpu_to_be32((qpn & MLX5_QPN_MASK) |
+ (flags << MLX5_QPN_BITS));
+ err = mlx5_cmd_exec(dev, &in, sizeof(in), &out, sizeof(out));
+ if (err)
+ return err;
+
+ if (out.hdr.status)
+ err = mlx5_cmd_status_to_err(&out.hdr);
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(mlx5_core_page_fault_resume);
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 9d7d239..0a6fe1b 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -71,6 +71,15 @@ enum {
};
enum {
+ MLX5_MKEY_INBOX_PG_ACCESS = 1 << 31
+};
+
+enum {
+ MLX5_PFAULT_SUBTYPE_WQE = 0,
+ MLX5_PFAULT_SUBTYPE_RDMA = 1,
+};
+
+enum {
MLX5_PERM_LOCAL_READ = 1 << 2,
MLX5_PERM_LOCAL_WRITE = 1 << 3,
MLX5_PERM_REMOTE_READ = 1 << 4,
@@ -166,6 +175,8 @@ enum mlx5_event {
MLX5_EVENT_TYPE_CMD = 0x0a,
MLX5_EVENT_TYPE_PAGE_REQUEST = 0xb,
+
+ MLX5_EVENT_TYPE_PAGE_FAULT = 0xc,
};
enum {
@@ -350,6 +361,21 @@ struct mlx5_hca_cap {
u8 rsvd28[76];
};
+enum mlx5_odp_transport_cap_bits {
+ MLX5_ODP_SUPPORT_SEND = 1 << 31,
+ MLX5_ODP_SUPPORT_RECV = 1 << 30,
+ MLX5_ODP_SUPPORT_WRITE = 1 << 29,
+};
+
+struct mlx5_odp_caps {
+ char reserved[0x10];
+ struct {
+ __be32 rc_odp_caps;
+ __be32 uc_odp_caps;
+ __be32 ud_odp_caps;
+ } per_transport_caps;
+ char reserved2[0xe4];
+};
struct mlx5_cmd_query_hca_cap_mbox_in {
struct mlx5_inbox_hdr hdr;
@@ -360,10 +386,12 @@ struct mlx5_cmd_query_hca_cap_mbox_in {
struct mlx5_cmd_query_hca_cap_mbox_out {
struct mlx5_outbox_hdr hdr;
u8 rsvd0[8];
- struct mlx5_hca_cap hca_cap;
+ union {
+ struct mlx5_hca_cap hca_cap;
+ struct mlx5_odp_caps odp_caps;
+ };
};
-
struct mlx5_cmd_set_hca_cap_mbox_in {
struct mlx5_inbox_hdr hdr;
u8 rsvd[8];
@@ -500,6 +528,27 @@ struct mlx5_eqe_page_req {
__be32 rsvd1[5];
};
+struct mlx5_eqe_page_fault {
+ __be32 bytes_committed;
+ union {
+ struct {
+ u16 reserved1;
+ __be16 wqe_index;
+ u16 reserved2;
+ __be16 packet_length;
+ u8 reserved3[12];
+ } __packed wqe;
+ struct {
+ __be32 r_key;
+ u16 reserved1;
+ __be16 packet_length;
+ __be32 rdma_op_len;
+ __be64 rdma_va;
+ } __packed rdma;
+ } __packed;
+ __be32 flags_qpn;
+} __packed;
+
union ev_data {
__be32 raw[7];
struct mlx5_eqe_cmd cmd;
@@ -512,6 +561,7 @@ union ev_data {
struct mlx5_eqe_congestion cong;
struct mlx5_eqe_stall_vl stall_vl;
struct mlx5_eqe_page_req req_pages;
+ struct mlx5_eqe_page_fault page_fault;
} __packed;
struct mlx5_eqe {
@@ -838,6 +888,10 @@ struct mlx5_query_eq_mbox_out {
struct mlx5_eq_context ctx;
};
+enum {
+ MLX5_MKEY_STATUS_FREE = 1 << 6,
+};
+
struct mlx5_mkey_seg {
/* This is a two bit field occupying bits 31-30.
* bit 31 is always 0,
@@ -874,7 +928,7 @@ struct mlx5_query_special_ctxs_mbox_out {
struct mlx5_create_mkey_mbox_in {
struct mlx5_inbox_hdr hdr;
__be32 input_mkey_index;
- u8 rsvd0[4];
+ __be32 flags;
struct mlx5_mkey_seg seg;
u8 rsvd1[16];
__be32 xlat_oct_act_size;
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 2bce4aa..4f162e8 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -113,6 +113,7 @@ enum {
MLX5_CMD_OP_QUERY_MKEY = 0x201,
MLX5_CMD_OP_DESTROY_MKEY = 0x202,
MLX5_CMD_OP_QUERY_SPECIAL_CONTEXTS = 0x203,
+ MLX5_CMD_OP_PAGE_FAULT_RESUME = 0x204,
MLX5_CMD_OP_CREATE_EQ = 0x301,
MLX5_CMD_OP_DESTROY_EQ = 0x302,
@@ -174,6 +175,13 @@ enum {
};
enum {
+ MLX5_CMD_OPMOD_QUERY_HCA_CAP_MAX_CAPS = 0,
+ MLX5_CMD_OPMOD_QUERY_HCA_CAP_CUR_CAPS = 1,
+ MLX5_CMD_OPMOD_QUERY_HCA_CAP_ODP_MAX_CAPS = 4,
+ MLX5_CMD_OPMOD_QUERY_HCA_CAP_ODP_CUR_CAPS = 5,
+};
+
+enum {
MLX5_REG_PCAP = 0x5001,
MLX5_REG_PMTU = 0x5003,
MLX5_REG_PTYS = 0x5004,
@@ -187,6 +195,13 @@ enum {
MLX5_REG_HOST_ENDIANNESS = 0x7004,
};
+enum mlx5_page_fault_resume_flags {
+ MLX5_PAGE_FAULT_RESUME_REQUESTOR = 1 << 0,
+ MLX5_PAGE_FAULT_RESUME_WRITE = 1 << 1,
+ MLX5_PAGE_FAULT_RESUME_RDMA = 1 << 2,
+ MLX5_PAGE_FAULT_RESUME_ERROR = 1 << 7,
+};
+
enum dbg_rsc_type {
MLX5_DBG_RSC_QP,
MLX5_DBG_RSC_EQ,
@@ -750,6 +765,7 @@ void mlx5_eq_cleanup(struct mlx5_core_dev *dev);
void mlx5_fill_page_array(struct mlx5_buf *buf, __be64 *pas);
void mlx5_cq_completion(struct mlx5_core_dev *dev, u32 cqn);
void mlx5_qp_event(struct mlx5_core_dev *dev, u32 qpn, int event_type);
+void mlx5_eq_pagefault(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe);
void mlx5_srq_event(struct mlx5_core_dev *dev, u32 srqn, int event_type);
struct mlx5_core_srq *mlx5_core_get_srq(struct mlx5_core_dev *dev, u32 srqn);
void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, unsigned long vector);
@@ -786,6 +802,8 @@ void mlx5_cmdif_debugfs_cleanup(struct mlx5_core_dev *dev);
int mlx5_core_create_psv(struct mlx5_core_dev *dev, u32 pdn,
int npsvs, u32 *sig_index);
int mlx5_core_destroy_psv(struct mlx5_core_dev *dev, int psv_num);
+int mlx5_query_odp_caps(struct mlx5_core_dev *dev,
+ struct mlx5_odp_caps *odp_caps);
static inline u32 mlx5_mkey_to_idx(u32 mkey)
{
diff --git a/include/linux/mlx5/qp.h b/include/linux/mlx5/qp.h
index 48212c6..8db515c 100644
--- a/include/linux/mlx5/qp.h
+++ b/include/linux/mlx5/qp.h
@@ -41,6 +41,9 @@
#define MLX5_DIF_SIZE 8
#define MLX5_STRIDE_BLOCK_OP 0x400
+#define MLX5_QPN_BITS 24
+#define MLX5_QPN_MASK ((1 << MLX5_QPN_BITS) - 1)
+
enum mlx5_qp_optpar {
MLX5_QP_OPTPAR_ALT_ADDR_PATH = 1 << 0,
MLX5_QP_OPTPAR_RRE = 1 << 1,
@@ -340,8 +343,45 @@ struct mlx5_stride_block_ctrl_seg {
__be16 num_entries;
};
+enum mlx5_pagefault_flags {
+ MLX5_PFAULT_REQUESTOR = 1 << 0,
+ MLX5_PFAULT_WRITE = 1 << 1,
+ MLX5_PFAULT_RDMA = 1 << 2,
+};
+
+/* Contains the details of a pagefault. */
+struct mlx5_pagefault {
+ u32 bytes_committed;
+ u8 event_subtype;
+ enum mlx5_pagefault_flags flags;
+ union {
+ /* Initiator or send message responder pagefault details. */
+ struct {
+ /* Received packet size, only valid for responders. */
+ u32 packet_size;
+ /*
+ * WQE index. Refers to either the send queue or
+ * receive queue, according to event_subtype.
+ */
+ u16 wqe_index;
+ } wqe;
+ /* RDMA responder pagefault details */
+ struct {
+ u32 r_key;
+ /*
+ * Received packet size, minimal size page fault
+ * resolution required for forward progress.
+ */
+ u32 packet_size;
+ u32 rdma_op_len;
+ u64 rdma_va;
+ } rdma;
+ };
+};
+
struct mlx5_core_qp {
void (*event) (struct mlx5_core_qp *, int);
+ void (*pfault_handler)(struct mlx5_core_qp *, struct mlx5_pagefault *);
int qpn;
atomic_t refcount;
struct completion free;
@@ -511,6 +551,17 @@ static inline struct mlx5_core_mr *__mlx5_mr_lookup(struct mlx5_core_dev *dev, u
return radix_tree_lookup(&dev->priv.mr_table.tree, key);
}
+struct mlx5_page_fault_resume_mbox_in {
+ struct mlx5_inbox_hdr hdr;
+ __be32 flags_qpn;
+ u8 reserved[4];
+};
+
+struct mlx5_page_fault_resume_mbox_out {
+ struct mlx5_outbox_hdr hdr;
+ u8 rsvd[8];
+};
+
int mlx5_core_create_qp(struct mlx5_core_dev *dev,
struct mlx5_core_qp *qp,
struct mlx5_create_qp_mbox_in *in,
@@ -530,6 +581,8 @@ void mlx5_init_qp_table(struct mlx5_core_dev *dev);
void mlx5_cleanup_qp_table(struct mlx5_core_dev *dev);
int mlx5_debug_qp_add(struct mlx5_core_dev *dev, struct mlx5_core_qp *qp);
void mlx5_debug_qp_remove(struct mlx5_core_dev *dev, struct mlx5_core_qp *qp);
+int mlx5_core_page_fault_resume(struct mlx5_core_dev *dev, u32 qpn,
+ u8 context, int error);
static inline const char *mlx5_qp_type_str(int type)
{
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 13/20] IB/mlx5: Implement the ODP capability query verb
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (11 preceding siblings ...)
2014-03-02 10:49 ` [RFC 12/20] net/mlx5_core: Add support for page faults events and low level handling Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 14/20] IB/mlx5: Changes in memory region creation to support on-demand paging Haggai Eran
` (8 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
The patch adds infrastructure to support the ODP capability query verb in the
mlx5 driver. The verb will read the capabilities from the device, and enable
only those capabilities that both the driver and the device supports.
At this point ODP is not supported, so no capability is copied from the
device, but the patch exposes the global ODP device capability bit to
designate that the new verb can be called.
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/Makefile | 1 +
drivers/infiniband/hw/mlx5/main.c | 13 +++++++
drivers/infiniband/hw/mlx5/mlx5_ib.h | 8 +++++
drivers/infiniband/hw/mlx5/odp.c | 69 ++++++++++++++++++++++++++++++++++++
4 files changed, 91 insertions(+)
create mode 100644 drivers/infiniband/hw/mlx5/odp.c
diff --git a/drivers/infiniband/hw/mlx5/Makefile b/drivers/infiniband/hw/mlx5/Makefile
index 4ea0135..27a7015 100644
--- a/drivers/infiniband/hw/mlx5/Makefile
+++ b/drivers/infiniband/hw/mlx5/Makefile
@@ -1,3 +1,4 @@
obj-$(CONFIG_MLX5_INFINIBAND) += mlx5_ib.o
mlx5_ib-y := main.o cq.o doorbell.o qp.o mem.o srq.o mr.o ah.o mad.o
+mlx5_ib-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += odp.o
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 7b9c078..f8015ac 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -318,6 +318,12 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
props->max_mcast_grp;
props->max_map_per_fmr = INT_MAX; /* no limit in ConnectIB */
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ if (dev->mdev.caps.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG)
+ props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
+ props->odp_caps = dev->odp_caps;
+#endif
+
out:
kfree(in_mad);
kfree(out_mad);
@@ -1397,6 +1403,10 @@ static int init_one(struct pci_dev *pdev,
(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) |
(1ull << IB_USER_VERBS_CMD_CREATE_XSRQ) |
(1ull << IB_USER_VERBS_CMD_OPEN_QP);
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ dev->ib_dev.uverbs_ex_cmd_mask =
+ (1ull << IB_USER_VERBS_EX_CMD_QUERY_ODP_CAPS);
+#endif
dev->ib_dev.query_device = mlx5_ib_query_device;
dev->ib_dev.query_port = mlx5_ib_query_port;
@@ -1441,6 +1451,9 @@ static int init_one(struct pci_dev *pdev,
dev->ib_dev.alloc_fast_reg_page_list = mlx5_ib_alloc_fast_reg_page_list;
dev->ib_dev.free_fast_reg_page_list = mlx5_ib_free_fast_reg_page_list;
dev->ib_dev.check_mr_status = mlx5_ib_check_mr_status;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ mlx5_ib_internal_query_odp_caps(dev);
+#endif
if (mdev->caps.flags & MLX5_DEV_CAP_FLAG_XRC) {
dev->ib_dev.alloc_xrcd = mlx5_ib_alloc_xrcd;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 29f58c1..767e791 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -392,6 +392,9 @@ struct mlx5_ib_dev {
struct mlx5_mr_cache cache;
struct timer_list delay_timer;
int fill_delay;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ struct ib_odp_caps odp_caps;
+#endif
};
static inline struct mlx5_ib_cq *to_mibcq(struct mlx5_core_cq *mcq)
@@ -569,6 +572,11 @@ void mlx5_umr_cq_handler(struct ib_cq *cq, void *cq_context);
int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
struct ib_mr_status *mr_status);
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+int mlx5_ib_query_odp_caps(struct ib_device *ibdev, struct ib_odp_caps *caps);
+int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
static inline void init_query_mad(struct ib_smp *mad)
{
mad->base_version = 1;
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
new file mode 100644
index 0000000..71ef604
--- /dev/null
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2014 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "mlx5_ib.h"
+
+#define COPY_ODP_BIT_MLX_TO_IB(reg, ib_caps, field_name, bit_name) do { \
+ if (be32_to_cpu(reg.field_name) & MLX5_ODP_SUPPORT_##bit_name) \
+ ib_caps->field_name |= IB_ODP_SUPPORT_##bit_name; \
+} while (0)
+
+int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
+{
+ int err;
+ struct mlx5_odp_caps hw_caps;
+ struct ib_odp_caps *caps = &dev->odp_caps;
+
+ memset(caps, 0, sizeof(*caps));
+
+ if (!(dev->mdev.caps.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG))
+ return 0;
+
+ err = mlx5_query_odp_caps(&dev->mdev, &hw_caps);
+ if (err)
+ goto out;
+
+ /* At this point we would copy the capability bits that the driver
+ * supports from the hw_caps struct to the caps struct. However, no
+ * such capabilities are supported so far. */
+out:
+ return err;
+}
+
+int mlx5_ib_query_odp_caps(struct ib_device *ibdev, struct ib_odp_caps *caps)
+{
+ struct mlx5_ib_dev *dev = to_mdev(ibdev);
+
+ memcpy(caps, &dev->odp_caps, sizeof(*caps));
+
+ return 0;
+}
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 14/20] IB/mlx5: Changes in memory region creation to support on-demand paging
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (12 preceding siblings ...)
2014-03-02 10:49 ` [RFC 13/20] IB/mlx5: Implement the ODP capability query verb Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 15/20] IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation Haggai Eran
` (7 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
This patch wraps together several changes needed for on-demand paging support
in the mlx5_ib_populate_pas function, and when registering memory regions.
* Instead of accepting a UMR bit telling the function to enable all access
flags, the function now accepts the access flags themselves.
* For on-demand paging memory regions, fill the memory tables from the
correct list, and enable/disable the access flags per-page according to
whether the page is present.
* A new bit is set to enable writing of access flags when using the firmware
create_mkey command.
* Disable contig pages when on-demand paging is enabled.
In addition the patch changes the UMR code to use PTR_ALIGN instead of our own
macro.
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mem.c | 54 ++++++++++++++++++++++++++++++++++--
drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 +++++++-
drivers/infiniband/hw/mlx5/mr.c | 32 +++++++++++----------
include/linux/mlx5/device.h | 3 ++
4 files changed, 83 insertions(+), 18 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index 8499aec..d760bfb 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -32,6 +32,7 @@
#include <linux/module.h>
#include <rdma/ib_umem.h>
+#include <rdma/ib_umem_odp.h>
#include "mlx5_ib.h"
/* @umem: umem object to scan
@@ -56,6 +57,17 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift,
struct scatterlist *sg;
int entry;
+ /* With ODP we must always match OS page size. */
+ if (umem->odp_data) {
+ *count = ib_umem_page_count(umem);
+ *shift = PAGE_SHIFT;
+ *ncont = *count;
+ if (order)
+ *order = ilog2(roundup_pow_of_two(*count));
+
+ return;
+ }
+
addr = addr >> PAGE_SHIFT;
tmp = (unsigned long)addr;
m = find_first_bit(&tmp, sizeof(tmp));
@@ -107,8 +119,31 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift,
*count = i;
}
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
+{
+ u64 mtt_entry = umem_dma & ODP_DMA_ADDR_MASK;
+
+ if (umem_dma & ODP_READ_ALLOWED_BIT)
+ mtt_entry |= MLX5_IB_MTT_READ;
+ if (umem_dma & ODP_WRITE_ALLOWED_BIT)
+ mtt_entry |= MLX5_IB_MTT_WRITE;
+
+ return mtt_entry;
+}
+#endif
+
+/*
+ * Populate the given array with bus addresses from the umem.
+ *
+ * dev - mlx5_ib device
+ * umem - umem to use to fill the pages
+ * page_shift - determines the page size used in the resulting array
+ * pas - bus addresses array to fill
+ * access_flags - access flags to set on all present pages
+ */
void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
- int page_shift, __be64 *pas, int umr)
+ int page_shift, __be64 *pas, int access_flags)
{
int shift = page_shift - PAGE_SHIFT;
int mask = (1 << shift) - 1;
@@ -118,6 +153,20 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
int len;
struct scatterlist *sg;
int entry;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ const bool odp = umem->odp_data != NULL;
+
+ if (odp) {
+ int num_pages = ib_umem_num_pages(umem);
+ WARN_ON(shift != 0);
+ WARN_ON(access_flags != (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE));
+ for (i = 0; i < num_pages; ++i) {
+ dma_addr_t pa = umem->odp_data->dma_list[i];
+ pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
+ }
+ return;
+ }
+#endif
i = 0;
for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
@@ -126,8 +175,7 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
for (k = 0; k < len; k++) {
if (!(i & mask)) {
cur = base + (k << PAGE_SHIFT);
- if (umr)
- cur |= 3;
+ cur |= access_flags;
pas[i >> shift] = cpu_to_be64(cur);
mlx5_ib_dbg(dev, "pas[%d] 0x%llx\n",
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 767e791..dd93790 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -268,6 +268,13 @@ struct mlx5_ib_xrcd {
u32 xrcdn;
};
+enum mlx5_ib_mtt_access_flags {
+ MLX5_IB_MTT_READ = (1 << 0),
+ MLX5_IB_MTT_WRITE = (1 << 1),
+};
+
+#define MLX5_IB_MTT_PRESENT (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE)
+
struct mlx5_ib_mr {
struct ib_mr ibmr;
struct mlx5_core_mr mmr;
@@ -562,7 +569,7 @@ void mlx5_ib_cleanup_fmr(struct mlx5_ib_dev *dev);
void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift,
int *ncont, int *order);
void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
- int page_shift, __be64 *pas, int umr);
+ int page_shift, __be64 *pas, int access_flags);
void mlx5_ib_copy_pas(u64 *old, u64 *new, int step, int num);
int mlx5_ib_get_cqe_size(struct mlx5_ib_dev *dev, struct ib_cq *ibcq);
int mlx5_mr_cache_init(struct mlx5_ib_dev *dev);
@@ -594,4 +601,7 @@ static inline u8 convert_access(int acc)
MLX5_PERM_LOCAL_READ;
}
+#define MLX5_MAX_UMR_SHIFT 17
+#define MLX5_MAX_UMR_PAGES (1 << MLX5_MAX_UMR_SHIFT)
+
#endif /* MLX5_IB_H */
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index d1e8426..5ea099e 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -48,13 +48,6 @@ enum {
MLX5_UMR_ALIGN = 2048
};
-static __be64 *mr_align(__be64 *ptr, int align)
-{
- unsigned long mask = align - 1;
-
- return (__be64 *)(((unsigned long)ptr + mask) & ~mask);
-}
-
static int order2idx(struct mlx5_ib_dev *dev, int order)
{
struct mlx5_mr_cache *cache = &dev->cache;
@@ -666,7 +659,7 @@ static int get_octo_len(u64 addr, u64 len, int page_size)
static int use_umr(int order)
{
- return order <= 17;
+ return order <= MLX5_MAX_UMR_SHIFT;
}
static void prep_umr_reg_wqe(struct ib_pd *pd, struct ib_send_wr *wr,
@@ -743,7 +736,8 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
struct ib_send_wr wr, *bad;
struct mlx5_ib_mr *mr;
struct ib_sge sg;
- int size = sizeof(u64) * npages;
+ int size;
+ __be64 *pas;
int err = 0;
int i;
@@ -762,17 +756,22 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
if (!mr)
return ERR_PTR(-EAGAIN);
+ /* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
+ * To avoid copying garbage after the pas array, we allocate
+ * a little more. */
+ size = ALIGN(sizeof(u64) * npages, MLX5_UMR_MTT_ALIGNMENT);
mr->pas = kmalloc(size + MLX5_UMR_ALIGN - 1, GFP_KERNEL);
if (!mr->pas) {
err = -ENOMEM;
goto free_mr;
}
- mlx5_ib_populate_pas(dev, umem, page_shift,
- mr_align(mr->pas, MLX5_UMR_ALIGN), 1);
+ pas = PTR_ALIGN(mr->pas, MLX5_UMR_ALIGN);
+ mlx5_ib_populate_pas(dev, umem, page_shift, pas, MLX5_IB_MTT_PRESENT);
+ /* Clear padding after the actual pages. */
+ memset(pas + npages, 0, size - npages * sizeof(u64));
- mr->dma = dma_map_single(ddev, mr_align(mr->pas, MLX5_UMR_ALIGN), size,
- DMA_TO_DEVICE);
+ mr->dma = dma_map_single(ddev, pas, size, DMA_TO_DEVICE);
if (dma_mapping_error(ddev, mr->dma)) {
err = -ENOMEM;
goto free_pas;
@@ -826,6 +825,7 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, u64 virt_addr,
struct mlx5_ib_mr *mr;
int inlen;
int err;
+ bool pg_cap = !!(dev->mdev.caps.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG);
mr = kzalloc(sizeof(*mr), GFP_KERNEL);
if (!mr)
@@ -837,8 +837,12 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, u64 virt_addr,
err = -ENOMEM;
goto err_1;
}
- mlx5_ib_populate_pas(dev, umem, page_shift, in->pas, 0);
+ mlx5_ib_populate_pas(dev, umem, page_shift, in->pas,
+ pg_cap ? MLX5_IB_MTT_PRESENT : 0);
+ /* The MLX5_MKEY_INBOX_PG_ACCESS bit allows setting the access flags
+ * in the page list submitted with the command. */
+ in->flags = pg_cap ? cpu_to_be32(MLX5_MKEY_INBOX_PG_ACCESS) : 0;
in->seg.flags = convert_access(access_flags) |
MLX5_ACCESS_MODE_MTT;
in->seg.flags_pd = cpu_to_be32(to_mpd(pd)->pdn);
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 0a6fe1b..e0abd4e 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -149,6 +149,9 @@ enum {
MLX5_UMR_INLINE = (1 << 7),
};
+#define MLX5_UMR_MTT_ALIGNMENT 0x40
+#define MLX5_UMR_MTT_MASK (MLX5_UMR_MTT_ALIGNMENT - 1)
+
enum mlx5_event {
MLX5_EVENT_TYPE_COMP = 0x0,
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 15/20] IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (13 preceding siblings ...)
2014-03-02 10:49 ` [RFC 14/20] IB/mlx5: Changes in memory region creation to support on-demand paging Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 16/20] IB/mlx5: Add function to read WQE from user-space Haggai Eran
` (6 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran, Shachar Raindel
The new function allows updating the page tables of a memory region after it
was created. This can be used to handle page faults and page invalidations.
Since mlx5_ib_update_mtt will need to work from within page invalidation, so
it must not block on memory allocation. It employs an atomic memory allocation
mechanism that is used as a fallback when kmalloc(GFP_ATOMIC) fails.
In order to reuse code from mlx5_ib_populate_pas, the patch splits this
function and add the needed parameters.
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-GN/RETWesp1BDgjK7y7TUQ@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mem.c | 19 +++--
drivers/infiniband/hw/mlx5/mlx5_ib.h | 5 ++
drivers/infiniband/hw/mlx5/mr.c | 130 ++++++++++++++++++++++++++++++++++-
include/linux/mlx5/device.h | 1 +
4 files changed, 148 insertions(+), 7 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index d760bfb..a1a748e 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -139,11 +139,14 @@ static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
* dev - mlx5_ib device
* umem - umem to use to fill the pages
* page_shift - determines the page size used in the resulting array
+ * offset - offset into the umem to start from
+ * num_pages - total number of pages to fill
* pas - bus addresses array to fill
* access_flags - access flags to set on all present pages
*/
-void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
- int page_shift, __be64 *pas, int access_flags)
+void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
+ int page_shift, size_t offset, size_t num_pages,
+ __be64 *pas, int access_flags)
{
int shift = page_shift - PAGE_SHIFT;
int mask = (1 << shift) - 1;
@@ -157,15 +160,16 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
const bool odp = umem->odp_data != NULL;
if (odp) {
- int num_pages = ib_umem_num_pages(umem);
WARN_ON(shift != 0);
WARN_ON(access_flags != (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE));
for (i = 0; i < num_pages; ++i) {
- dma_addr_t pa = umem->odp_data->dma_list[i];
+ dma_addr_t pa = umem->odp_data->dma_list[offset + i];
pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
}
return;
}
+
+ BUG_ON(!odp && offset);
#endif
i = 0;
@@ -188,6 +192,13 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
}
}
+void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
+ int page_shift, __be64 *pas, int access_flags)
+{
+ return __mlx5_ib_populate_pas(dev, umem, page_shift, 0,
+ ib_umem_num_pages(umem), pas,
+ access_flags);
+}
int mlx5_ib_get_buf_offset(u64 addr, int page_shift, u32 *offset)
{
u64 page_size;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index dd93790..f48a511 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -537,6 +537,8 @@ struct ib_mr *mlx5_ib_get_dma_mr(struct ib_pd *pd, int acc);
struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
u64 virt_addr, int access_flags,
struct ib_udata *udata);
+int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index,
+ int npages, int zap);
int mlx5_ib_dereg_mr(struct ib_mr *ibmr);
int mlx5_ib_destroy_mr(struct ib_mr *ibmr);
struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
@@ -568,6 +570,9 @@ int mlx5_ib_init_fmr(struct mlx5_ib_dev *dev);
void mlx5_ib_cleanup_fmr(struct mlx5_ib_dev *dev);
void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift,
int *ncont, int *order);
+void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
+ int page_shift, size_t offset, size_t num_pages,
+ __be64 *pas, int access_flags);
void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
int page_shift, __be64 *pas, int access_flags);
void mlx5_ib_copy_pas(u64 *old, u64 *new, int step, int num);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 5ea099e..1071cb5 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -44,9 +44,12 @@ enum {
MAX_PENDING_REG_MR = 8,
};
-enum {
- MLX5_UMR_ALIGN = 2048
-};
+#define MLX5_UMR_ALIGN 2048
+
+static __be64 mlx5_ib_update_mtt_emergency_buffer[
+ MLX5_UMR_MTT_MIN_CHUNK_SIZE/sizeof(__be64)]
+ __aligned(MLX5_UMR_ALIGN);
+static DEFINE_MUTEX(mlx5_ib_update_mtt_emergency_buffer_mutex);
static int order2idx(struct mlx5_ib_dev *dev, int order)
{
@@ -815,6 +818,127 @@ free_mr:
return mr;
}
+int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
+ int zap)
+{
+ struct mlx5_ib_dev *dev = mr->dev;
+ struct device *ddev = dev->ib_dev.dma_device;
+ struct umr_common *umrc = &dev->umrc;
+ struct mlx5_ib_umr_context umr_context;
+ struct ib_umem *umem = mr->umem;
+ int size;
+ __be64 *pas;
+ dma_addr_t dma;
+ struct ib_send_wr wr, *bad;
+ struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr.wr.fast_reg;
+ struct ib_sge sg;
+ int err = 0;
+ const int page_index_alignment = MLX5_UMR_MTT_ALIGNMENT / sizeof(u64);
+ const int page_index_mask = page_index_alignment - 1;
+ size_t pages_mapped = 0;
+ size_t pages_to_map = 0;
+ size_t pages_iter = 0;
+ int use_emergency_buf = 0;
+
+ /* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes,
+ * so we need to align the offset and length accordingly */
+ if (start_page_index & page_index_mask) {
+ npages += start_page_index & page_index_mask;
+ start_page_index &= ~page_index_mask;
+ }
+
+ pages_to_map = ALIGN(npages, page_index_alignment);
+
+ if (start_page_index + pages_to_map > MLX5_MAX_UMR_PAGES)
+ return -EINVAL;
+
+ size = sizeof(u64) * pages_to_map;
+ size = min_t(int, PAGE_SIZE, size);
+ /* We allocate with GFP_ATOMIC to avoid recursion into page-reclaim
+ * code, when we are called from an invalidation. The pas buffer must
+ * be 2k-aligned for Connect-IB. */
+ pas = (__be64 *)get_zeroed_page(GFP_ATOMIC);
+ if (!pas) {
+ mlx5_ib_warn(dev, "unable to allocate memory during MTT update, falling back to slower chunked mechanism.\n");
+ pas = mlx5_ib_update_mtt_emergency_buffer;
+ size = MLX5_UMR_MTT_MIN_CHUNK_SIZE;
+ use_emergency_buf = 1;
+ mutex_lock(&mlx5_ib_update_mtt_emergency_buffer_mutex);
+ memset(pas, 0, size);
+ }
+ pages_iter = size / sizeof(u64);
+ dma = dma_map_single(ddev, pas, size,
+ DMA_TO_DEVICE);
+ if (dma_mapping_error(ddev, mr->dma)) {
+ mlx5_ib_err(dev, "unable to map DMA during MTT update.\n");
+ err = -ENOMEM;
+ goto free_pas;
+ }
+
+ for (pages_mapped = 0;
+ pages_mapped < pages_to_map && !err;
+ pages_mapped += pages_iter, start_page_index += pages_iter) {
+ dma_sync_single_for_cpu(ddev, dma, size, DMA_TO_DEVICE);
+
+ npages = min_t(size_t,
+ pages_iter,
+ ib_umem_num_pages(umem) - start_page_index);
+
+ if (!zap) {
+ __mlx5_ib_populate_pas(dev, umem, PAGE_SHIFT,
+ start_page_index, npages, pas,
+ MLX5_IB_MTT_PRESENT);
+ /* Clear padding after the pages brought from the
+ * umem. */
+ memset(pas + npages, 0, size - npages * sizeof(u64));
+ }
+
+ dma_sync_single_for_device(ddev, dma, size, DMA_TO_DEVICE);
+
+ memset(&wr, 0, sizeof(wr));
+ wr.wr_id = (u64)(unsigned long)&umr_context;
+
+ sg.addr = dma;
+ sg.length = ALIGN(npages * sizeof(u64),
+ MLX5_UMR_MTT_ALIGNMENT);
+ sg.lkey = dev->umrc.mr->lkey;
+
+ wr.send_flags = MLX5_IB_SEND_UMR_CHECK_FREE |
+ MLX5_IB_SEND_UMR_UPDATE_MTT;
+ wr.sg_list = &sg;
+ wr.num_sge = 1;
+ wr.opcode = MLX5_IB_WR_UMR;
+ umrwr->npages = sg.length / sizeof(u64);
+ umrwr->page_shift = PAGE_SHIFT;
+ umrwr->mkey = mr->mmr.key;
+ umrwr->target.offset = start_page_index;
+
+ mlx5_ib_init_umr_context(&umr_context);
+ down(&umrc->sem);
+ err = ib_post_send(umrc->qp, &wr, &bad);
+ if (err) {
+ mlx5_ib_err(dev, "UMR post send failed, err %d\n", err);
+ } else {
+ wait_for_completion(&umr_context.done);
+ if (umr_context.status != IB_WC_SUCCESS) {
+ mlx5_ib_err(dev, "UMR completion failed, code %d\n",
+ umr_context.status);
+ err = -EFAULT;
+ }
+ }
+ up(&umrc->sem);
+ }
+ dma_unmap_single(ddev, dma, size, DMA_TO_DEVICE);
+
+free_pas:
+ if (!use_emergency_buf)
+ free_page((unsigned long)pas);
+ else
+ mutex_unlock(&mlx5_ib_update_mtt_emergency_buffer_mutex);
+
+ return err;
+}
+
static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, u64 virt_addr,
u64 length, struct ib_umem *umem,
int npages, int page_shift,
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index e0abd4e..852ef51 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -151,6 +151,7 @@ enum {
#define MLX5_UMR_MTT_ALIGNMENT 0x40
#define MLX5_UMR_MTT_MASK (MLX5_UMR_MTT_ALIGNMENT - 1)
+#define MLX5_UMR_MTT_MIN_CHUNK_SIZE MLX5_UMR_MTT_ALIGNMENT
enum mlx5_event {
MLX5_EVENT_TYPE_COMP = 0x0,
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 16/20] IB/mlx5: Add function to read WQE from user-space
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (14 preceding siblings ...)
2014-03-02 10:49 ` [RFC 15/20] IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 17/20] IB/mlx5: Page faults handling infrastructure Haggai Eran
` (5 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
Add a helper function mlx5_ib_read_user_wqe to read information from
user-space owned work queues. The function will be used in a later patch by
the page-fault handling code in mlx5_ib.
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +
drivers/infiniband/hw/mlx5/qp.c | 72 ++++++++++++++++++++++++++++++++++++
include/linux/mlx5/qp.h | 3 ++
3 files changed, 77 insertions(+)
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index f48a511..8d20408 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -525,6 +525,8 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
int mlx5_ib_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr,
struct ib_recv_wr **bad_wr);
void *mlx5_get_send_wqe(struct mlx5_ib_qp *qp, int n);
+int mlx5_ib_read_user_wqe(struct mlx5_ib_qp *qp, int send, int wqe_index,
+ void *buffer, u32 length);
struct ib_cq *mlx5_ib_create_cq(struct ib_device *ibdev, int entries,
int vector, struct ib_ucontext *context,
struct ib_udata *udata);
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 385501b..535fea3 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -101,6 +101,78 @@ void *mlx5_get_send_wqe(struct mlx5_ib_qp *qp, int n)
return get_wqe(qp, qp->sq.offset + (n << MLX5_IB_SQ_STRIDE));
}
+/*
+ * Copy a user-space WQE to kernel space.
+ *
+ * Copies at least a single WQE, but may copy more data.
+ *
+ * qp - QP to copy from.
+ * send - copy from the send queue when non-zero, use the receive queue
+ * otherwise.
+ * wqe_index - index to start copying from. For send work queues, the
+ * wqe_index is in units of MLX5_SEND_WQE_BB. For receive work queue, it is
+ * the number of work queue element in the queue.
+ * buffer - destination buffer.
+ * length - maximum number of bytes to copy.
+ *
+ * Return the number of bytes copied, or an error code.
+ */
+int mlx5_ib_read_user_wqe(struct mlx5_ib_qp *qp, int send, int wqe_index,
+ void *buffer, u32 length)
+{
+ struct ib_device *ibdev = qp->ibqp.device;
+ struct mlx5_ib_dev *dev = to_mdev(ibdev);
+ struct mlx5_ib_wq *wq = send ? &qp->sq : &qp->rq;
+ size_t offset;
+ size_t wq_end;
+ struct ib_umem *umem = qp->umem;
+ u32 first_copy_length;
+ int wqe_length;
+ int copied;
+ int ret;
+
+ if (wq->wqe_cnt == 0) {
+ mlx5_ib_dbg(dev, "mlx5_ib_read_user_wqe for a QP with wqe_cnt == 0. qp_type: 0x%x\n",
+ qp->ibqp.qp_type);
+ return -EINVAL;
+ }
+
+ offset = wq->offset + ((wqe_index % wq->wqe_cnt) << wq->wqe_shift);
+ wq_end = wq->offset + (wq->wqe_cnt << wq->wqe_shift);
+
+ if (send && length < sizeof(struct mlx5_wqe_ctrl_seg))
+ return -EINVAL;
+
+ if (offset > umem->length ||
+ (send && offset + sizeof(struct mlx5_wqe_ctrl_seg) > umem->length))
+ return -EINVAL;
+
+ first_copy_length = min_t(u32, offset + length, wq_end) - offset;
+ copied = ib_umem_copy_from(umem, offset, buffer, first_copy_length);
+ if (copied < first_copy_length)
+ return copied;
+
+ if (send) {
+ struct mlx5_wqe_ctrl_seg *ctrl = buffer;
+ int ds = be32_to_cpu(ctrl->qpn_ds) & MLX5_WQE_CTRL_DS_MASK;
+ wqe_length = ds * MLX5_WQE_DS_UNITS;
+ } else {
+ wqe_length = 1 << wq->wqe_shift;
+ }
+
+ if (wqe_length <= first_copy_length)
+ return first_copy_length;
+
+ ret = ib_umem_copy_from(umem, wq->offset,
+ buffer + first_copy_length, wqe_length -
+ first_copy_length);
+ if (ret < 0)
+ return ret;
+ copied += ret;
+
+ return copied;
+}
+
static void mlx5_ib_qp_event(struct mlx5_core_qp *qp, int type)
{
struct ib_qp *ibqp = &to_mibqp(qp)->ibqp;
diff --git a/include/linux/mlx5/qp.h b/include/linux/mlx5/qp.h
index 8db515c..52a2ea8 100644
--- a/include/linux/mlx5/qp.h
+++ b/include/linux/mlx5/qp.h
@@ -182,6 +182,9 @@ struct mlx5_wqe_ctrl_seg {
__be32 imm;
};
+#define MLX5_WQE_CTRL_DS_MASK 0x3f
+#define MLX5_WQE_DS_UNITS 16
+
struct mlx5_wqe_xrc_seg {
__be32 xrc_srqn;
u8 rsvd[12];
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 17/20] IB/mlx5: Page faults handling infrastructure
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (15 preceding siblings ...)
2014-03-02 10:49 ` [RFC 16/20] IB/mlx5: Add function to read WQE from user-space Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 18/20] IB/mlx5: Handle page faults Haggai Eran
` (4 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
From: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
* Refactor MR registration and cleanup, and fix reg_pages accounting.
* Create a work queue to handle page fault events in a kthread context.
* Register a fault handler to get events from the core for each QP.
The registered fault handler is empty in this patch, and only a later patch
implements it.
Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/main.c | 25 +++++-
drivers/infiniband/hw/mlx5/mlx5_ib.h | 66 +++++++++++++++-
drivers/infiniband/hw/mlx5/mr.c | 45 +++++++----
drivers/infiniband/hw/mlx5/odp.c | 145 +++++++++++++++++++++++++++++++++++
drivers/infiniband/hw/mlx5/qp.c | 22 ++++++
include/linux/mlx5/driver.h | 2 +-
6 files changed, 286 insertions(+), 19 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index f8015ac..535ccac 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -934,7 +934,7 @@ static ssize_t show_reg_pages(struct device *device,
struct mlx5_ib_dev *dev =
container_of(device, struct mlx5_ib_dev, ib_dev.dev);
- return sprintf(buf, "%d\n", dev->mdev.priv.reg_pages);
+ return sprintf(buf, "%d\n", atomic_read(&dev->mdev.priv.reg_pages));
}
static ssize_t show_hca(struct device *device, struct device_attribute *attr,
@@ -1468,7 +1468,6 @@ static int init_one(struct pci_dev *pdev,
goto err_eqs;
mutex_init(&dev->cap_mask_mutex);
- spin_lock_init(&dev->mr_lock);
err = create_dev_resources(&dev->devr);
if (err)
@@ -1489,6 +1488,10 @@ static int init_one(struct pci_dev *pdev,
goto err_umrc;
}
+ err = mlx5_ib_odp_init_one(dev);
+ if (err)
+ goto err_umrc;
+
dev->ib_active = true;
return 0;
@@ -1518,6 +1521,7 @@ static void remove_one(struct pci_dev *pdev)
{
struct mlx5_ib_dev *dev = mlx5_pci2ibdev(pdev);
+ mlx5_ib_odp_remove_one(dev);
destroy_umrc_res(dev);
ib_unregister_device(&dev->ib_dev);
destroy_dev_resources(&dev->devr);
@@ -1542,12 +1546,27 @@ static struct pci_driver mlx5_ib_driver = {
static int __init mlx5_ib_init(void)
{
- return pci_register_driver(&mlx5_ib_driver);
+ int err;
+
+ err = mlx5_ib_odp_init();
+ if (err)
+ return err;
+
+ err = pci_register_driver(&mlx5_ib_driver);
+ if (err)
+ goto clean_odp;
+
+ return err;
+
+clean_odp:
+ mlx5_ib_odp_cleanup();
+ return err;
}
static void __exit mlx5_ib_cleanup(void)
{
pci_unregister_driver(&mlx5_ib_driver);
+ mlx5_ib_odp_cleanup();
}
module_init(mlx5_ib_init);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 8d20408..f4240cd 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -149,6 +149,29 @@ enum {
MLX5_QP_EMPTY
};
+/*
+ * Connect-IB can trigger up to four concurrent pagefaults
+ * per-QP.
+ */
+enum mlx5_ib_pagefault_context {
+ MLX5_IB_PAGEFAULT_RESPONDER_READ,
+ MLX5_IB_PAGEFAULT_REQUESTOR_READ,
+ MLX5_IB_PAGEFAULT_RESPONDER_WRITE,
+ MLX5_IB_PAGEFAULT_REQUESTOR_WRITE,
+ MLX5_IB_PAGEFAULT_CONTEXTS
+};
+
+static inline enum mlx5_ib_pagefault_context
+ mlx5_ib_get_pagefault_context(struct mlx5_pagefault *pagefault)
+{
+ return pagefault->flags & (MLX5_PFAULT_REQUESTOR | MLX5_PFAULT_WRITE);
+}
+
+struct mlx5_ib_pfault {
+ struct work_struct work;
+ struct mlx5_pagefault mpfault;
+};
+
struct mlx5_ib_qp {
struct ib_qp ibqp;
struct mlx5_core_qp mqp;
@@ -194,6 +217,21 @@ struct mlx5_ib_qp {
/* Store signature errors */
bool signature_en;
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ /*
+ * A flag that is true for QP's that are in a state that doesn't
+ * allow page faults, and shouldn't schedule any more faults.
+ */
+ int disable_page_faults;
+ /*
+ * The disable_page_faults_lock protects a QP's disable_page_faults
+ * field, allowing for a thread to atomically check whether the QP
+ * allows page faults, and if so schedule a page fault.
+ */
+ spinlock_t disable_page_faults_lock;
+ struct mlx5_ib_pfault pagefaults[MLX5_IB_PAGEFAULT_CONTEXTS];
+#endif
};
struct mlx5_ib_cq_buf {
@@ -394,13 +432,17 @@ struct mlx5_ib_dev {
struct umr_common umrc;
/* sync used page count stats
*/
- spinlock_t mr_lock;
struct mlx5_ib_resources devr;
struct mlx5_mr_cache cache;
struct timer_list delay_timer;
int fill_delay;
#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
struct ib_odp_caps odp_caps;
+ /*
+ * Sleepable RCU that prevents destruction of MRs while they are still
+ * being used by a page fault handler.
+ */
+ struct srcu_struct mr_srcu;
#endif
};
@@ -587,8 +629,30 @@ int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
struct ib_mr_status *mr_status);
#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+extern struct workqueue_struct *mlx5_ib_page_fault_wq;
+
int mlx5_ib_query_odp_caps(struct ib_device *ibdev, struct ib_odp_caps *caps);
int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev);
+void mlx5_ib_mr_pfault_handler(struct mlx5_ib_qp *qp,
+ struct mlx5_ib_pfault *pfault);
+void mlx5_ib_odp_create_qp(struct mlx5_ib_qp *qp);
+int mlx5_ib_odp_init_one(struct mlx5_ib_dev *ibdev);
+void mlx5_ib_odp_remove_one(struct mlx5_ib_dev *ibdev);
+int __init mlx5_ib_odp_init(void);
+void mlx5_ib_odp_cleanup(void);
+void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp);
+void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp);
+
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
+static inline void mlx5_ib_odp_create_qp(struct mlx5_ib_qp *qp) {}
+static inline int mlx5_ib_odp_init_one(struct mlx5_ib_dev *ibdev) { return 0; }
+static inline void mlx5_ib_odp_remove_one(struct mlx5_ib_dev *ibdev) {}
+static inline int mlx5_ib_odp_init(void) { return 0; }
+static inline void mlx5_ib_odp_cleanup(void) {}
+static inline void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp) {}
+static inline void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp) {}
+
#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
static inline void init_query_mad(struct ib_smp *mad)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 1071cb5..4462ca8 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -51,6 +51,8 @@ static __be64 mlx5_ib_update_mtt_emergency_buffer[
__aligned(MLX5_UMR_ALIGN);
static DEFINE_MUTEX(mlx5_ib_update_mtt_emergency_buffer_mutex);
+static int clean_mr(struct mlx5_ib_mr *mr);
+
static int order2idx(struct mlx5_ib_dev *dev, int order)
{
struct mlx5_mr_cache *cache = &dev->cache;
@@ -1039,6 +1041,10 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
mlx5_ib_dbg(dev, "cache empty for order %d", order);
mr = NULL;
}
+ } else if (access_flags & IB_ACCESS_ON_DEMAND) {
+ err = -EINVAL;
+ pr_err("Got MR registration for ODP MR > 512MB, not supported for Connect-IB");
+ goto error;
}
if (!mr)
@@ -1054,9 +1060,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
mr->umem = umem;
mr->npages = npages;
- spin_lock(&dev->mr_lock);
- dev->mdev.priv.reg_pages += npages;
- spin_unlock(&dev->mr_lock);
+ atomic_add(npages, &dev->mdev.priv.reg_pages);
mr->ibmr.lkey = mr->mmr.key;
mr->ibmr.rkey = mr->mmr.key;
@@ -1100,12 +1104,9 @@ error:
return err;
}
-int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
+static int clean_mr(struct mlx5_ib_mr *mr)
{
- struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
- struct mlx5_ib_mr *mr = to_mmr(ibmr);
- struct ib_umem *umem = mr->umem;
- int npages = mr->npages;
+ struct mlx5_ib_dev *dev = to_mdev(mr->ibmr.device);
int umred = mr->umred;
int err;
@@ -1125,16 +1126,32 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
free_cached_mr(dev, mr);
}
+ if (!umred)
+ kfree(mr);
+
+ return 0;
+}
+
+int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
+{
+ struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
+ struct mlx5_ib_mr *mr = to_mmr(ibmr);
+ int npages = mr->npages;
+ struct ib_umem *umem = mr->umem;
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ if (umem)
+ /* Wait for all running page-fault handlers to finish. */
+ synchronize_srcu(&dev->mr_srcu);
+#endif
+
+ clean_mr(mr);
+
if (umem) {
ib_umem_release(umem);
- spin_lock(&dev->mr_lock);
- dev->mdev.priv.reg_pages -= npages;
- spin_unlock(&dev->mr_lock);
+ atomic_sub(npages, &dev->mdev.priv.reg_pages);
}
- if (!umred)
- kfree(mr);
-
return 0;
}
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 71ef604..f297f14 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -32,6 +32,8 @@
#include "mlx5_ib.h"
+struct workqueue_struct *mlx5_ib_page_fault_wq;
+
#define COPY_ODP_BIT_MLX_TO_IB(reg, ib_caps, field_name, bit_name) do { \
if (be32_to_cpu(reg.field_name) & MLX5_ODP_SUPPORT_##bit_name) \
ib_caps->field_name |= IB_ODP_SUPPORT_##bit_name; \
@@ -67,3 +69,146 @@ int mlx5_ib_query_odp_caps(struct ib_device *ibdev, struct ib_odp_caps *caps)
return 0;
}
+
+static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
+ u32 key)
+{
+ u32 base_key = mlx5_base_mkey(key);
+ struct mlx5_core_mr *mmr = __mlx5_mr_lookup(&dev->mdev, base_key);
+
+ if (!mmr || mmr->key != key)
+ return NULL;
+
+ return container_of(mmr, struct mlx5_ib_mr, mmr);
+}
+
+static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp *qp,
+ struct mlx5_ib_pfault *pfault,
+ int error) {
+ struct mlx5_ib_dev *dev = to_mdev(qp->ibqp.pd->device);
+ int ret = mlx5_core_page_fault_resume(&dev->mdev, qp->mqp.qpn,
+ pfault->mpfault.flags,
+ error);
+ if (ret)
+ pr_err("Failed to resolve the page fault on QP 0x%x\n",
+ qp->mqp.qpn);
+}
+
+void mlx5_ib_mr_pfault_handler(struct mlx5_ib_qp *qp,
+ struct mlx5_ib_pfault *pfault)
+{
+ u8 event_subtype = pfault->mpfault.event_subtype;
+
+ switch (event_subtype) {
+ default:
+ pr_warn("Invalid page fault event subtype: 0x%x\n",
+ event_subtype);
+ mlx5_ib_page_fault_resume(qp, pfault, 1);
+ break;
+ }
+}
+
+static void mlx5_ib_qp_pfault_action(struct work_struct *work)
+{
+ struct mlx5_ib_pfault *pfault = container_of(work,
+ struct mlx5_ib_pfault,
+ work);
+ enum mlx5_ib_pagefault_context context =
+ mlx5_ib_get_pagefault_context(&pfault->mpfault);
+ struct mlx5_ib_qp *qp = container_of(pfault, struct mlx5_ib_qp,
+ pagefaults[context]);
+ mlx5_ib_mr_pfault_handler(qp, pfault);
+}
+
+void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&qp->disable_page_faults_lock, flags);
+ qp->disable_page_faults = 1;
+ spin_unlock_irqrestore(&qp->disable_page_faults_lock, flags);
+
+ /*
+ * Note that at this point, we are guarenteed that no more
+ * work queue elements will be posted to the work queue with
+ * the QP we are closing.
+ */
+ flush_workqueue(mlx5_ib_page_fault_wq);
+}
+
+void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&qp->disable_page_faults_lock, flags);
+ qp->disable_page_faults = 0;
+ spin_unlock_irqrestore(&qp->disable_page_faults_lock, flags);
+}
+
+static void mlx5_ib_pfault_handler(struct mlx5_core_qp *qp,
+ struct mlx5_pagefault *pfault)
+{
+ /*
+ * Note that we will only get one fault event per QP per context
+ * (responder/initiator, read/write), until we resolve the page fault
+ * with the mlx5_ib_page_fault_resume command. Since this function is
+ * called from within the work element, there is no risk of missing
+ * events.
+ */
+ struct mlx5_ib_qp *mibqp = to_mibqp(qp);
+ enum mlx5_ib_pagefault_context context =
+ mlx5_ib_get_pagefault_context(pfault);
+ struct mlx5_ib_pfault *qp_pfault = &mibqp->pagefaults[context];
+
+ qp_pfault->mpfault = *pfault;
+
+ /* No need to stop interrupts here since we are in an interrupt */
+ spin_lock(&mibqp->disable_page_faults_lock);
+ if (!mibqp->disable_page_faults)
+ queue_work(mlx5_ib_page_fault_wq, &qp_pfault->work);
+ spin_unlock(&mibqp->disable_page_faults_lock);
+}
+
+void mlx5_ib_odp_create_qp(struct mlx5_ib_qp *qp)
+{
+ int i;
+
+ qp->disable_page_faults = 1;
+ spin_lock_init(&qp->disable_page_faults_lock);
+
+ qp->mqp.pfault_handler = mlx5_ib_pfault_handler;
+
+ for (i = 0; i < MLX5_IB_PAGEFAULT_CONTEXTS; ++i)
+ INIT_WORK(&qp->pagefaults[i].work, mlx5_ib_qp_pfault_action);
+}
+
+int mlx5_ib_odp_init_one(struct mlx5_ib_dev *ibdev)
+{
+ int ret;
+
+ ret = init_srcu_struct(&ibdev->mr_srcu);
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
+void mlx5_ib_odp_remove_one(struct mlx5_ib_dev *ibdev)
+{
+ cleanup_srcu_struct(&ibdev->mr_srcu);
+}
+
+int __init mlx5_ib_odp_init(void)
+{
+ mlx5_ib_page_fault_wq =
+ create_singlethread_workqueue("mlx5_ib_page_faults");
+ if (!mlx5_ib_page_fault_wq)
+ return -ENOMEM;
+
+ return 0;
+}
+
+void mlx5_ib_odp_cleanup(void)
+{
+ destroy_workqueue(mlx5_ib_page_fault_wq);
+}
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 535fea3..a14a43f 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -870,6 +870,8 @@ static int create_qp_common(struct mlx5_ib_dev *dev, struct ib_pd *pd,
int inlen = sizeof(*in);
int err;
+ mlx5_ib_odp_create_qp(qp);
+
mutex_init(&qp->mutex);
spin_lock_init(&qp->sq.lock);
spin_lock_init(&qp->rq.lock);
@@ -1685,6 +1687,15 @@ static int __mlx5_ib_modify_qp(struct ib_qp *ibqp,
if (mlx5_st < 0)
goto out;
+ /* If moving to a reset or error state, we must disable page faults on
+ * this QP and flush all current page faults. Otherwise a stale page
+ * fault may attempt to work on this QP after it is reset and moved
+ * again to RTS, and may cause the driver and the device to get out of
+ * sync. */
+ if (cur_state != IB_QPS_RESET && cur_state != IB_QPS_ERR &&
+ (new_state == IB_QPS_RESET || new_state == IB_QPS_ERR))
+ mlx5_ib_qp_disable_pagefaults(qp);
+
optpar = ib_mask_to_mlx5_opt(attr_mask);
optpar &= opt_mask[mlx5_cur][mlx5_new][mlx5_st];
in->optparam = cpu_to_be32(optpar);
@@ -1694,6 +1705,9 @@ static int __mlx5_ib_modify_qp(struct ib_qp *ibqp,
if (err)
goto out;
+ if (cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT)
+ mlx5_ib_qp_enable_pagefaults(qp);
+
qp->state = new_state;
if (attr_mask & IB_QP_ACCESS_FLAGS)
@@ -3013,6 +3027,14 @@ int mlx5_ib_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr
int mlx5_state;
int err = 0;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ /*
+ * Wait for any outstanding page faults, in case the user frees memory
+ * based upon this query's result.
+ */
+ flush_workqueue(mlx5_ib_page_fault_wq);
+#endif
+
mutex_lock(&qp->mutex);
outb = kzalloc(sizeof(*outb), GFP_KERNEL);
if (!outb) {
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 4f162e8..1d6266f 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -526,7 +526,7 @@ struct mlx5_priv {
struct workqueue_struct *pg_wq;
struct rb_root page_root;
int fw_pages;
- int reg_pages;
+ atomic_t reg_pages;
struct list_head free_list;
struct mlx5_core_health health;
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 18/20] IB/mlx5: Handle page faults
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (16 preceding siblings ...)
2014-03-02 10:49 ` [RFC 17/20] IB/mlx5: Page faults handling infrastructure Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 19/20] IB/mlx5: Add support for RDMA write responder " Haggai Eran
` (3 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
This patch implement a page fault handler (leaving the pages pinned as
of time being). The page fault handler handles initiator and responder
page faults for UD/RC transports, and for send/receive operations.
Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/odp.c | 396 +++++++++++++++++++++++++++++++++++++++
include/linux/mlx5/qp.h | 7 +
2 files changed, 403 insertions(+)
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index f297f14..c6da238 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -30,6 +30,9 @@
* SOFTWARE.
*/
+#include <rdma/ib_umem.h>
+#include <rdma/ib_umem_odp.h>
+
#include "mlx5_ib.h"
struct workqueue_struct *mlx5_ib_page_fault_wq;
@@ -94,12 +97,405 @@ static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp *qp,
qp->mqp.qpn);
}
+/*
+ * Handle a single data segment in a page-fault WQE.
+ *
+ * Returns number of pages retrieved on success. The caller will continue to
+ * the next data segment.
+ * Can return the following error codes:
+ * -EAGAIN to designate a temporary error. The caller will abort handling the
+ * page fault and resolve it.
+ * -EFAULT when there's an error mapping the requested pages. The caller will
+ * abort the page fault handling and possibly move the QP to an error state.
+ * On other errors the QP should also be closed with an error.
+ */
+static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
+ struct mlx5_ib_pfault *pfault, u32 key, u64 io_virt,
+ size_t bcnt, u32 *bytes_mapped)
+{
+ struct mlx5_ib_dev *mib_dev = to_mdev(qp->ibqp.pd->device);
+ int srcu_key;
+ u64 start_idx;
+ int npages = 0, ret = 0;
+ struct mlx5_ib_mr *mr;
+ srcu_key = srcu_read_lock(&mib_dev->mr_srcu);
+ mr = mlx5_ib_odp_find_mr_lkey(mib_dev, key);
+ /*
+ * If we didn't find the MR, it means the MR was closed while we were
+ * handling the ODP event. In this case we return -EFAULT so that the
+ * QP will be closed.
+ */
+ if (!mr || !mr->ibmr.pd) {
+ pr_err("Failed to find relevant mr for lkey=0x%06x, probably the MR was destroyed\n",
+ key);
+ ret = -EFAULT;
+ goto srcu_unlock;
+ }
+ if (!mr->umem->odp_data) {
+ pr_debug("skipping non ODP MR (lkey=0x%06x) in page fault handler.\n",
+ key);
+ if (bytes_mapped)
+ *bytes_mapped +=
+ (bcnt - pfault->mpfault.bytes_committed);
+ goto srcu_unlock;
+ }
+ if (mr->ibmr.pd != qp->ibqp.pd) {
+ pr_err("Page-fault with different PDs for QP and MR.\n");
+ ret = -EFAULT;
+ goto srcu_unlock;
+ }
+
+ /*
+ * Avoid branches - this code will perform correctly
+ * in all iterations (in iteration 2 and above,
+ * gather_commit == 0).
+ */
+ io_virt += pfault->mpfault.bytes_committed;
+ bcnt -= pfault->mpfault.bytes_committed;
+
+ start_idx = (io_virt - (mr->mmr.iova & PAGE_MASK)) >> PAGE_SHIFT;
+
+ npages = ib_umem_odp_map_dma_pages(mr->umem, io_virt, bcnt,
+ mr->umem->writable ?
+ (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT) :
+ ODP_READ_ALLOWED_BIT,
+ atomic_read(&mr->umem->odp_data->notifiers_seq));
+ if (npages < 0) {
+ ret = npages;
+ goto srcu_unlock;
+ }
+
+ if (npages > 0) {
+ mutex_lock(&mr->umem->odp_data->umem_mutex);
+ /*
+ * No need to check whether the MTTs really belong to
+ * this MR, since ib_umem_odp_map_dma_pages already
+ * checks this.
+ */
+ ret = mlx5_ib_update_mtt(mr, start_idx, npages, 0);
+ mutex_unlock(&mr->umem->odp_data->umem_mutex);
+
+ if (bytes_mapped) {
+ u32 new_mappings = npages * PAGE_SIZE -
+ (io_virt - round_down(io_virt, PAGE_SIZE));
+ *bytes_mapped += min_t(u32, new_mappings, bcnt);
+ }
+ }
+ if (ret) {
+ pr_err("Failed to update mkey page tables\n");
+ ret = -EAGAIN;
+ goto srcu_unlock;
+ }
+
+srcu_unlock:
+ srcu_read_unlock(&mib_dev->mr_srcu, srcu_key);
+ pfault->mpfault.bytes_committed = 0;
+ return ret ? ret : npages;
+}
+
+/**
+ * Parse a series of data segments for page fault handling.
+ *
+ * @qp the QP on which the fault occurred.
+ * @pfault contains page fault information.
+ * @wqe points at the first data segment in the WQE.
+ * @wqe_end points after the end of the WQE.
+ * @bytes_mapped receives the number of bytes that the function was able to
+ * map. This allows the caller to decide intelligently whether
+ * enough memory was mapped to resolve the page fault
+ * successfully (e.g. enough for the next MTU, or the entire
+ * WQE).
+ * @total_wqe_bytes receives the total data size of this WQE in bytes (minus
+ * the committed bytes).
+ *
+ * Returns the number of pages loaded if positive, zero for an empty WQE, or a
+ * negative error code.
+ */
+static int pagefault_data_segments(struct mlx5_ib_qp *qp,
+ struct mlx5_ib_pfault *pfault, void *wqe, void *wqe_end,
+ u32 *bytes_mapped, u32 *total_wqe_bytes, int receive_queue)
+{
+ int ret = 0, npages = 0;
+ u64 io_virt;
+ u32 key;
+ u32 byte_count;
+ size_t bcnt;
+ int inline_segment;
+
+ /* Skip SRQ next-WQE segment. */
+ if (receive_queue && qp->ibqp.srq)
+ wqe += sizeof(struct mlx5_wqe_srq_next_seg);
+
+ if (bytes_mapped)
+ *bytes_mapped = 0;
+ if (total_wqe_bytes)
+ *total_wqe_bytes = 0;
+
+ while (wqe < wqe_end) {
+ struct mlx5_wqe_data_seg *dseg = wqe;
+ io_virt = be64_to_cpu(dseg->addr);
+ key = be32_to_cpu(dseg->lkey);
+ byte_count = be32_to_cpu(dseg->byte_count);
+ inline_segment = !!(byte_count & MLX5_INLINE_SEG);
+ bcnt = byte_count & ~MLX5_INLINE_SEG;
+
+ if (inline_segment) {
+ bcnt = bcnt & MLX5_WQE_INLINE_SEG_BYTE_COUNT_MASK;
+ wqe += ALIGN(sizeof(struct mlx5_wqe_inline_seg) + bcnt,
+ 16);
+ } else {
+ wqe += sizeof(*dseg);
+ }
+
+ /* receive WQE end of sg list. */
+ if (receive_queue && bcnt == 0 && key == MLX5_INVALID_LKEY &&
+ io_virt == 0)
+ break;
+
+ if (!inline_segment && total_wqe_bytes) {
+ *total_wqe_bytes += bcnt - min_t(size_t, bcnt,
+ pfault->mpfault.bytes_committed);
+ }
+
+ /* A zero length data segment designates a length of 2GB. */
+ if (bcnt == 0)
+ bcnt = 1U << 31;
+
+ if (inline_segment || bcnt <= pfault->mpfault.bytes_committed) {
+ pfault->mpfault.bytes_committed -=
+ min_t(size_t, bcnt,
+ pfault->mpfault.bytes_committed);
+ continue;
+ }
+
+ ret = pagefault_single_data_segment(qp, pfault,
+ key, io_virt, bcnt, bytes_mapped);
+ if (ret < 0)
+ break;
+ npages += ret;
+ }
+
+ return ret < 0 ? ret : npages;
+}
+
+/*
+ * Parse initiator WQE. Advances the wqe pointer to point at the
+ * scatter-gather list, and set wqe_end to the end of the WQE.
+ */
+static int mlx5_ib_mr_initiator_pfault_handler(
+ struct mlx5_ib_qp *qp, struct mlx5_ib_pfault *pfault,
+ void **wqe, void **wqe_end, int wqe_length)
+{
+ struct mlx5_ib_dev *dev = to_mdev(qp->ibqp.pd->device);
+ struct mlx5_wqe_ctrl_seg *ctrl = *wqe;
+ u16 wqe_index = pfault->mpfault.wqe.wqe_index;
+ unsigned ds, opcode;
+#if defined(DEBUG)
+ u32 ctrl_wqe_index, ctrl_qpn;
+#endif
+
+ ds = be32_to_cpu(ctrl->qpn_ds) & MLX5_WQE_CTRL_DS_MASK;
+ if (ds * MLX5_WQE_DS_UNITS > wqe_length) {
+ mlx5_ib_err(dev, "Unable to read the complete WQE. ds = 0x%x, ret = 0x%x\n",
+ ds, wqe_length);
+ return -EFAULT;
+ }
+
+ if (ds == 0) {
+ mlx5_ib_err(dev, "Got WQE with zero DS. wqe_index=%x, qpn=%x\n",
+ wqe_index, qp->mqp.qpn);
+ return -EFAULT;
+ }
+
+#if defined(DEBUG)
+ ctrl_wqe_index = (be32_to_cpu(ctrl->opmod_idx_opcode) &
+ MLX5_WQE_CTRL_WQE_INDEX_MASK) >>
+ MLX5_WQE_CTRL_WQE_INDEX_SHIFT;
+ if (wqe_index != ctrl_wqe_index) {
+ mlx5_ib_err(dev, "Got WQE with invalid wqe_index. wqe_index=0x%x, qpn=0x%x ctrl->wqe_index=0x%x\n",
+ wqe_index, qp->mqp.qpn,
+ ctrl_wqe_index);
+ return -EFAULT;
+ }
+
+ ctrl_qpn = (be32_to_cpu(ctrl->qpn_ds) & MLX5_WQE_CTRL_QPN_MASK) >>
+ MLX5_WQE_CTRL_QPN_SHIFT;
+ if (qp->mqp.qpn != ctrl_qpn) {
+ mlx5_ib_err(dev, "Got WQE with incorrect QP number. wqe_index=0x%x, qpn=0x%x ctrl->qpn=0x%x\n",
+ wqe_index, qp->mqp.qpn,
+ ctrl_qpn);
+ return -EFAULT;
+ }
+#endif /* DEBUG */
+
+ *wqe_end = *wqe + ds * MLX5_WQE_DS_UNITS;
+ *wqe += sizeof(*ctrl);
+
+ opcode = be32_to_cpu(ctrl->opmod_idx_opcode) &
+ MLX5_WQE_CTRL_OPCODE_MASK;
+ switch (qp->ibqp.qp_type) {
+ case IB_QPT_RC:
+ switch (opcode) {
+ case MLX5_OPCODE_SEND:
+ case MLX5_OPCODE_SEND_IMM:
+ case MLX5_OPCODE_SEND_INVAL:
+ if (!(dev->odp_caps.per_transport_caps.rc_odp_caps &
+ IB_ODP_SUPPORT_SEND))
+ goto invalid_transport_or_opcode;
+ break;
+ case MLX5_OPCODE_RDMA_WRITE:
+ case MLX5_OPCODE_RDMA_WRITE_IMM:
+ if (!(dev->odp_caps.per_transport_caps.rc_odp_caps &
+ IB_ODP_SUPPORT_WRITE))
+ goto invalid_transport_or_opcode;
+ *wqe += sizeof(struct mlx5_wqe_raddr_seg);
+ break;
+ default:
+ goto invalid_transport_or_opcode;
+ }
+ break;
+ case IB_QPT_UD:
+ switch (opcode) {
+ case MLX5_OPCODE_SEND:
+ case MLX5_OPCODE_SEND_IMM:
+ if (!(dev->odp_caps.per_transport_caps.ud_odp_caps &
+ IB_ODP_SUPPORT_SEND))
+ goto invalid_transport_or_opcode;
+ *wqe += sizeof(struct mlx5_wqe_datagram_seg);
+ break;
+ default:
+ goto invalid_transport_or_opcode;
+ }
+ break;
+ default:
+invalid_transport_or_opcode:
+ mlx5_ib_err(dev, "ODP fault on QP of an unsupported opcode or transport. transport: 0x%x opcode: 0x%x.\n",
+ qp->ibqp.qp_type, opcode);
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
+/*
+ * Parse responder WQE. Advances the wqe pointer to point at the
+ * scatter-gather list, and set wqe_end to the end of the WQE.
+ */
+static int mlx5_ib_mr_responder_pfault_handler(
+ struct mlx5_ib_qp *qp, struct mlx5_ib_pfault *pfault,
+ void **wqe, void **wqe_end, int wqe_length)
+{
+ struct mlx5_ib_dev *dev = to_mdev(qp->ibqp.pd->device);
+ struct mlx5_ib_wq *wq = &qp->rq;
+ int wqe_size = 1 << wq->wqe_shift;
+
+ if (qp->ibqp.srq) {
+ mlx5_ib_err(dev, "ODP fault on SRQ is not supported\n");
+ return -EFAULT;
+ }
+
+ if (qp->wq_sig) {
+ mlx5_ib_err(dev, "ODP fault with WQE signatures is not supported\n");
+ return -EFAULT;
+ }
+
+ if (wqe_size > wqe_length) {
+ mlx5_ib_err(dev, "Couldn't read all of the receive WQE's content\n");
+ return -EFAULT;
+ }
+
+ switch (qp->ibqp.qp_type) {
+ case IB_QPT_RC:
+ if (!(dev->odp_caps.per_transport_caps.rc_odp_caps &
+ IB_ODP_SUPPORT_RECV))
+ goto invalid_transport_or_opcode;
+ break;
+ default:
+invalid_transport_or_opcode:
+ mlx5_ib_err(dev, "ODP fault on QP of an unsupported transport. transport: 0x%x\n",
+ qp->ibqp.qp_type);
+ return -EFAULT;
+ }
+
+ *wqe_end = *wqe + wqe_size;
+
+ return 0;
+}
+
+static void mlx5_ib_mr_wqe_pfault_handler(struct mlx5_ib_qp *qp,
+ struct mlx5_ib_pfault *pfault)
+{
+ struct mlx5_ib_dev *dev = to_mdev(qp->ibqp.pd->device);
+ int ret;
+ void *wqe, *wqe_end;
+ u32 bytes_mapped, total_wqe_bytes;
+ char *buffer = NULL;
+ int resume_with_error = 0;
+ u16 wqe_index = pfault->mpfault.wqe.wqe_index;
+ int requestor = pfault->mpfault.flags & MLX5_PFAULT_REQUESTOR;
+
+ buffer = (char *)__get_free_page(GFP_KERNEL);
+ if (!buffer) {
+ mlx5_ib_err(dev, "Error allocating memory for IO page fault handling.\n");
+ resume_with_error = 1;
+ goto resolve_page_fault;
+ }
+
+ ret = mlx5_ib_read_user_wqe(qp, requestor, wqe_index, buffer,
+ PAGE_SIZE);
+ if (ret < 0) {
+ mlx5_ib_err(dev, "Failed reading a WQE following page fault, error=%x, wqe_index=%x, qpn=%x\n",
+ -ret, wqe_index, qp->mqp.qpn);
+ resume_with_error = 1;
+ goto resolve_page_fault;
+ }
+
+ wqe = buffer;
+ if (requestor)
+ ret = mlx5_ib_mr_initiator_pfault_handler(qp, pfault, &wqe,
+ &wqe_end, ret);
+ else
+ ret = mlx5_ib_mr_responder_pfault_handler(qp, pfault, &wqe,
+ &wqe_end, ret);
+ if (ret < 0) {
+ resume_with_error = 1;
+ goto resolve_page_fault;
+ }
+
+ if (wqe >= wqe_end) {
+ mlx5_ib_err(dev, "ODP fault on invalid WQE.\n");
+ resume_with_error = 1;
+ goto resolve_page_fault;
+ }
+
+ ret = pagefault_data_segments(qp, pfault, wqe, wqe_end, &bytes_mapped,
+ &total_wqe_bytes, !requestor);
+ if (ret == -EAGAIN) {
+ goto resolve_page_fault;
+ } else if (ret < 0 || total_wqe_bytes > bytes_mapped) {
+ mlx5_ib_err(dev, "Error getting user pages for page fault. Error: 0x%x\n",
+ -ret);
+ resume_with_error = 1;
+ goto resolve_page_fault;
+ }
+
+resolve_page_fault:
+ mlx5_ib_page_fault_resume(qp, pfault, resume_with_error);
+ mlx5_ib_dbg(dev, "PAGE FAULT completed. QP 0x%x resume_with_error=%d, flags: 0x%x\n",
+ qp->mqp.qpn, resume_with_error, pfault->mpfault.flags);
+
+ free_page((unsigned long)buffer);
+}
+
void mlx5_ib_mr_pfault_handler(struct mlx5_ib_qp *qp,
struct mlx5_ib_pfault *pfault)
{
u8 event_subtype = pfault->mpfault.event_subtype;
switch (event_subtype) {
+ case MLX5_PFAULT_SUBTYPE_WQE:
+ mlx5_ib_mr_wqe_pfault_handler(qp, pfault);
+ break;
default:
pr_warn("Invalid page fault event subtype: 0x%x\n",
event_subtype);
diff --git a/include/linux/mlx5/qp.h b/include/linux/mlx5/qp.h
index 52a2ea8..cbf544b1 100644
--- a/include/linux/mlx5/qp.h
+++ b/include/linux/mlx5/qp.h
@@ -183,7 +183,12 @@ struct mlx5_wqe_ctrl_seg {
};
#define MLX5_WQE_CTRL_DS_MASK 0x3f
+#define MLX5_WQE_CTRL_QPN_MASK 0xffffff00
+#define MLX5_WQE_CTRL_QPN_SHIFT 8
#define MLX5_WQE_DS_UNITS 16
+#define MLX5_WQE_CTRL_OPCODE_MASK 0xff
+#define MLX5_WQE_CTRL_WQE_INDEX_MASK 0x00ffff00
+#define MLX5_WQE_CTRL_WQE_INDEX_SHIFT 8
struct mlx5_wqe_xrc_seg {
__be32 xrc_srqn;
@@ -288,6 +293,8 @@ struct mlx5_wqe_signature_seg {
u8 rsvd1[11];
};
+#define MLX5_WQE_INLINE_SEG_BYTE_COUNT_MASK 0x3ff
+
struct mlx5_wqe_inline_seg {
__be32 byte_count;
};
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 19/20] IB/mlx5: Add support for RDMA write responder page faults
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (17 preceding siblings ...)
2014-03-02 10:49 ` [RFC 18/20] IB/mlx5: Handle page faults Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-02 10:49 ` [RFC 20/20] IB/mlx5: Implement on demand paging by adding support for MMU notifiers Haggai Eran
` (2 subsequent siblings)
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
From: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/odp.c | 71 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 71 insertions(+)
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index c6da238..56f2e3b 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -35,6 +35,8 @@
#include "mlx5_ib.h"
+#define MAX_PREFETCH_LEN (4*1024*1024U)
+
struct workqueue_struct *mlx5_ib_page_fault_wq;
#define COPY_ODP_BIT_MLX_TO_IB(reg, ib_caps, field_name, bit_name) do { \
@@ -487,6 +489,72 @@ resolve_page_fault:
free_page((unsigned long)buffer);
}
+static int pages_in_range(u64 address, u32 length)
+{
+ return (ALIGN(address + length, PAGE_SIZE) -
+ (address & PAGE_MASK)) >> PAGE_SHIFT;
+}
+
+static void mlx5_ib_mr_rdma_pfault_handler(struct mlx5_ib_qp *qp,
+ struct mlx5_ib_pfault *pfault)
+{
+ struct mlx5_pagefault *mpfault = &pfault->mpfault;
+ u64 address;
+ u32 length;
+ u32 prefetch_len = mpfault->bytes_committed;
+ int prefetch_activated = 0;
+ u32 rkey = mpfault->rdma.r_key;
+ int ret;
+ struct mlx5_ib_pfault dummy_pfault = {};
+ dummy_pfault.mpfault.bytes_committed = 0;
+
+ mpfault->rdma.rdma_va += mpfault->bytes_committed;
+ mpfault->rdma.rdma_op_len -= min(mpfault->bytes_committed,
+ mpfault->rdma.rdma_op_len);
+ mpfault->bytes_committed = 0;
+
+ address = mpfault->rdma.rdma_va;
+ length = mpfault->rdma.rdma_op_len;
+
+ /* For some operations, the hardware cannot tell the exact message
+ * length, and in those cases it reports zero. Use prefetch
+ * logic. */
+ if (length == 0) {
+ prefetch_activated = 1;
+ length = mpfault->rdma.packet_size;
+ prefetch_len = min(MAX_PREFETCH_LEN, prefetch_len);
+ }
+
+ ret = pagefault_single_data_segment(qp, pfault, rkey, address, length,
+ NULL);
+ if (ret == -EAGAIN) {
+ /* We're racing with an invalidation, don't prefetch */
+ prefetch_activated = 0;
+ } else if (ret < 0 || pages_in_range(address, length) > ret) {
+ mlx5_ib_page_fault_resume(qp, pfault, 1);
+ return;
+ }
+
+ mlx5_ib_page_fault_resume(qp, pfault, 0);
+
+ /* At this point, there might be a new pagefault already arriving in
+ * the eq, switch to the dummy pagefault for the rest of the
+ * processing. We're still OK with the objects being alive as the
+ * work-queue is being fenced. */
+
+ if (prefetch_activated) {
+ ret = pagefault_single_data_segment(qp, &dummy_pfault, rkey,
+ address,
+ prefetch_len,
+ NULL);
+ if (ret < 0) {
+ pr_warn("Prefetch failed (ret = %d, prefetch_activated = %d) for QPN %d, address: 0x%.16llx, length = 0x%.16x\n",
+ ret, prefetch_activated,
+ qp->ibqp.qp_num, address, prefetch_len);
+ }
+ }
+}
+
void mlx5_ib_mr_pfault_handler(struct mlx5_ib_qp *qp,
struct mlx5_ib_pfault *pfault)
{
@@ -496,6 +564,9 @@ void mlx5_ib_mr_pfault_handler(struct mlx5_ib_qp *qp,
case MLX5_PFAULT_SUBTYPE_WQE:
mlx5_ib_mr_wqe_pfault_handler(qp, pfault);
break;
+ case MLX5_PFAULT_SUBTYPE_RDMA:
+ mlx5_ib_mr_rdma_pfault_handler(qp, pfault);
+ break;
default:
pr_warn("Invalid page fault event subtype: 0x%x\n",
event_subtype);
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC 20/20] IB/mlx5: Implement on demand paging by adding support for MMU notifiers
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (18 preceding siblings ...)
2014-03-02 10:49 ` [RFC 19/20] IB/mlx5: Add support for RDMA write responder " Haggai Eran
@ 2014-03-02 10:49 ` Haggai Eran
2014-03-13 12:00 ` [RFC 00/20] On demand paging Haggai Eran
2014-04-24 14:10 ` Or Gerlitz
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-02 10:49 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Cc: Roland Dreier, Andrea Arcangeli, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss, Haggai Eran
From: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
* Implement the relevant invalidation functions (zap MTTs as needed)
* Implement interlocking (and rollback in the page fault handlers) for cases of a racing notifier and fault.
* With this patch we can now enable the capability bits for supporting RC
send/receive and UD send.
Signed-off-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/hw/mlx5/main.c | 4 ++
drivers/infiniband/hw/mlx5/mlx5_ib.h | 3 +
drivers/infiniband/hw/mlx5/mr.c | 79 +++++++++++++++++++++--
drivers/infiniband/hw/mlx5/odp.c | 118 +++++++++++++++++++++++++++++++----
4 files changed, 188 insertions(+), 16 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 535ccac..f7941c8 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -644,6 +644,10 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct ib_device *ibdev,
goto out_count;
}
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ context->ibucontext.invalidate_range = &mlx5_ib_invalidate_range;
+#endif
+
INIT_LIST_HEAD(&context->db_page_list);
mutex_init(&context->db_page_mutex);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index f4240cd..a6ef427 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -327,6 +327,7 @@ struct mlx5_ib_mr {
struct mlx5_ib_dev *dev;
struct mlx5_create_mkey_mbox_out out;
struct mlx5_core_sig_ctx *sig;
+ int live;
};
struct mlx5_ib_fast_reg_page_list {
@@ -642,6 +643,8 @@ int __init mlx5_ib_odp_init(void);
void mlx5_ib_odp_cleanup(void);
void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp);
void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp);
+void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
+ unsigned long end);
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 4462ca8..c596507 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -37,6 +37,7 @@
#include <linux/export.h>
#include <linux/delay.h>
#include <rdma/ib_umem.h>
+#include <rdma/ib_umem_odp.h>
#include <rdma/ib_verbs.h>
#include "mlx5_ib.h"
@@ -53,6 +54,18 @@ static DEFINE_MUTEX(mlx5_ib_update_mtt_emergency_buffer_mutex);
static int clean_mr(struct mlx5_ib_mr *mr);
+static int destroy_mkey(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
+{
+ int err = mlx5_core_destroy_mkey(&dev->mdev, &mr->mmr);
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ /* Wait until all page fault handlers using the mr complete. */
+ synchronize_srcu(&dev->mr_srcu);
+#endif
+
+ return err;
+}
+
static int order2idx(struct mlx5_ib_dev *dev, int order)
{
struct mlx5_mr_cache *cache = &dev->cache;
@@ -187,7 +200,7 @@ static void remove_keys(struct mlx5_ib_dev *dev, int c, int num)
ent->cur--;
ent->size--;
spin_unlock_irq(&ent->lock);
- err = mlx5_core_destroy_mkey(&dev->mdev, &mr->mmr);
+ err = destroy_mkey(dev, mr);
if (err)
mlx5_ib_warn(dev, "failed destroy mkey\n");
else
@@ -478,7 +491,7 @@ static void clean_keys(struct mlx5_ib_dev *dev, int c)
ent->cur--;
ent->size--;
spin_unlock_irq(&ent->lock);
- err = mlx5_core_destroy_mkey(&dev->mdev, &mr->mmr);
+ err = destroy_mkey(dev, mr);
if (err)
mlx5_ib_warn(dev, "failed destroy mkey\n");
else
@@ -804,6 +817,8 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
mr->mmr.size = len;
mr->mmr.pd = to_mpd(pd)->pdn;
+ mr->live = 1;
+
unmap_dma:
up(&umrc->sem);
dma_unmap_single(ddev, mr->dma, size, DMA_TO_DEVICE);
@@ -987,6 +1002,7 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, u64 virt_addr,
goto err_2;
}
mr->umem = umem;
+ mr->live = 1;
mlx5_vfree(in);
mlx5_ib_dbg(dev, "mkey = 0x%x\n", mr->mmr.key);
@@ -1064,10 +1080,47 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
mr->ibmr.lkey = mr->mmr.key;
mr->ibmr.rkey = mr->mmr.key;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+ if (umem->odp_data) {
+ /*
+ * This barrier prevents the compiler from moving the
+ * setting of umem->odp_data->private to point to our
+ * MR, before reg_umr finished, to ensure that the MR
+ * initialization have finished before starting to
+ * handle invalidations.
+ */
+ smp_wmb();
+ mr->umem->odp_data->private = mr;
+ /*
+ * Make sure we will see the new
+ * umem->odp_data->private value in the invalidation
+ * routines, before we can get page faults on the
+ * MR. Page faults can happen once we put the MR in
+ * the tree, below this line. Without the barrier,
+ * there can be a fault handling and an invalidation
+ * before umem->odp_data->private == mr is visible to
+ * the invalidation handler.
+ */
+ smp_wmb();
+ }
+#endif
+
return &mr->ibmr;
error:
+ /*
+ * Destroy the umem *before* destroying the MR, to ensure we
+ * will not have any in-flight notifiers when destroying the
+ * MR.
+ *
+ * As the MR is completely invalid to begin with, and this
+ * error path is only taken if we can't push the mr entry into
+ * the pagefault tree, this is safe.
+ */
+
ib_umem_release(umem);
+ /* Kill the MR, and return an error code. */
+ clean_mr(mr);
return ERR_PTR(err);
}
@@ -1111,7 +1164,7 @@ static int clean_mr(struct mlx5_ib_mr *mr)
int err;
if (!umred) {
- err = mlx5_core_destroy_mkey(&dev->mdev, &mr->mmr);
+ err = destroy_mkey(dev, mr);
if (err) {
mlx5_ib_warn(dev, "failed to destroy mkey 0x%x (%d)\n",
mr->mmr.key, err);
@@ -1140,9 +1193,25 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
struct ib_umem *umem = mr->umem;
#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
- if (umem)
+ if (umem && umem->odp_data) {
+ /* Prevent new page faults from succeeding */
+ mr->live = 0;
/* Wait for all running page-fault handlers to finish. */
synchronize_srcu(&dev->mr_srcu);
+ /* Destroy all page mappings */
+ mlx5_ib_invalidate_range(umem, ib_umem_start(umem),
+ ib_umem_end(umem));
+ /*
+ * We kill the umem before the MR for ODP,
+ * so that there will not be any invalidations in
+ * flight, looking at the *mr struct.
+ */
+ ib_umem_release(umem);
+ atomic_sub(npages, &dev->mdev.priv.reg_pages);
+
+ /* Avoid double-freeing the umem. */
+ umem = NULL;
+ }
#endif
clean_mr(mr);
@@ -1259,7 +1328,7 @@ int mlx5_ib_destroy_mr(struct ib_mr *ibmr)
kfree(mr->sig);
}
- err = mlx5_core_destroy_mkey(&dev->mdev, &mr->mmr);
+ err = destroy_mkey(dev, mr);
if (err) {
mlx5_ib_warn(dev, "failed to destroy mkey 0x%x (%d)\n",
mr->mmr.key, err);
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 56f2e3b..b8868ea 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -39,6 +39,71 @@
struct workqueue_struct *mlx5_ib_page_fault_wq;
+void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
+ unsigned long end)
+{
+ struct mlx5_ib_mr *mr;
+ const u64 umr_block_mask = (MLX5_UMR_MTT_ALIGNMENT / sizeof(u64)) - 1;
+ u64 idx = 0, blk_start_idx = 0;
+ int in_block = 0;
+ u64 addr;
+
+ if (!umem || !umem->odp_data) {
+ pr_err("invalidation called on NULL umem or non-ODP umem\n");
+ return;
+ }
+
+ mr = umem->odp_data->private;
+
+ if (!mr || !mr->ibmr.pd)
+ return;
+
+ start = max_t(u64, ib_umem_start(umem), start);
+ end = min_t(u64, ib_umem_end(umem), end);
+
+ /*
+ * Iteration one - zap the HW's MTTs. The notifiers_count ensures that
+ * while we are doing the invalidation, no page fault will attempt to
+ * overwrite the same MTTs. Concurent invalidations might race us,
+ * but they will write 0s as well, so no difference in the end result.
+ */
+
+ for (addr = start; addr < end; addr += (u64)umem->page_size) {
+ idx = (addr - ib_umem_start(umem)) / PAGE_SIZE;
+ /*
+ * Strive to write the MTTs in chunks, but avoid overwriting
+ * non-existing MTTs. The huristic here can be improved to
+ * estimate the cost of another UMR vs. the cost of bigger
+ * UMR.
+ */
+ if (umem->odp_data->dma_list[idx] &
+ (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) {
+ if (!in_block) {
+ blk_start_idx = idx;
+ in_block = 1;
+ }
+ } else {
+ u64 umr_offset = idx & umr_block_mask;
+ if (in_block && umr_offset == 0) {
+ mlx5_ib_update_mtt(mr, blk_start_idx,
+ idx - blk_start_idx, 1);
+ in_block = 0;
+ }
+ }
+ }
+ if (in_block)
+ mlx5_ib_update_mtt(mr, blk_start_idx, idx - blk_start_idx + 1,
+ 1);
+
+ /*
+ * We are now sure that the device will not access the
+ * memory. We can safely unmap it, and mark it as dirty if
+ * needed.
+ */
+
+ ib_umem_odp_unmap_dma_pages(umem, start, end);
+}
+
#define COPY_ODP_BIT_MLX_TO_IB(reg, ib_caps, field_name, bit_name) do { \
if (be32_to_cpu(reg.field_name) & MLX5_ODP_SUPPORT_##bit_name) \
ib_caps->field_name |= IB_ODP_SUPPORT_##bit_name; \
@@ -59,9 +124,16 @@ int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
if (err)
goto out;
- /* At this point we would copy the capability bits that the driver
- * supports from the hw_caps struct to the caps struct. However, no
- * such capabilities are supported so far. */
+ caps->general_caps = IB_ODP_SUPPORT;
+ COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.ud_odp_caps,
+ SEND);
+ COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
+ SEND);
+ COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
+ RECV);
+ COPY_ODP_BIT_MLX_TO_IB(hw_caps, caps, per_transport_caps.rc_odp_caps,
+ WRITE);
+
out:
return err;
}
@@ -80,8 +152,9 @@ static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
{
u32 base_key = mlx5_base_mkey(key);
struct mlx5_core_mr *mmr = __mlx5_mr_lookup(&dev->mdev, base_key);
+ struct mlx5_ib_mr *mr = container_of(mmr, struct mlx5_ib_mr, mmr);
- if (!mmr || mmr->key != key)
+ if (!mmr || mmr->key != key || !mr->live)
return NULL;
return container_of(mmr, struct mlx5_ib_mr, mmr);
@@ -117,6 +190,7 @@ static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
{
struct mlx5_ib_dev *mib_dev = to_mdev(qp->ibqp.pd->device);
int srcu_key;
+ unsigned int current_seq;
u64 start_idx;
int npages = 0, ret = 0;
struct mlx5_ib_mr *mr;
@@ -147,6 +221,13 @@ static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
goto srcu_unlock;
}
+ current_seq = atomic_read(&mr->umem->odp_data->notifiers_seq);
+ /*
+ * Ensure the sequence number is valid for some time before we call
+ * gup.
+ */
+ smp_rmb();
+
/*
* Avoid branches - this code will perform correctly
* in all iterations (in iteration 2 and above,
@@ -161,7 +242,7 @@ static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
mr->umem->writable ?
(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT) :
ODP_READ_ALLOWED_BIT,
- atomic_read(&mr->umem->odp_data->notifiers_seq));
+ current_seq);
if (npages < 0) {
ret = npages;
goto srcu_unlock;
@@ -169,13 +250,19 @@ static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
if (npages > 0) {
mutex_lock(&mr->umem->odp_data->umem_mutex);
- /*
- * No need to check whether the MTTs really belong to
- * this MR, since ib_umem_odp_map_dma_pages already
- * checks this.
- */
- ret = mlx5_ib_update_mtt(mr, start_idx, npages, 0);
+ if (!ib_umem_mmu_notifier_retry(mr->umem, current_seq)) {
+ /*
+ * No need to check whether the MTTs really belong to
+ * this MR, since ib_umem_odp_map_dma_pages already
+ * checks this.
+ */
+ ret = mlx5_ib_update_mtt(mr, start_idx, npages, 0);
+ } else {
+ ret = -EAGAIN;
+ }
mutex_unlock(&mr->umem->odp_data->umem_mutex);
+ if (ret < 0)
+ goto srcu_unlock;
if (bytes_mapped) {
u32 new_mappings = npages * PAGE_SIZE -
@@ -190,6 +277,15 @@ static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
}
srcu_unlock:
+ if (ret == -EAGAIN) {
+ if (!mr->umem->odp_data->dying) {
+ struct ib_umem_odp *odp_data = mr->umem->odp_data;
+ wait_for_completion(&odp_data->notifier_completion);
+ } else {
+ /* The MR is being killed, kill the QP as well. */
+ ret = -EFAULT;
+ }
+ }
srcu_read_unlock(&mib_dev->mr_srcu, srcu_key);
pfault->mpfault.bytes_committed = 0;
return ret ? ret : npages;
--
1.7.11.2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [RFC 00/20] On demand paging
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (19 preceding siblings ...)
2014-03-02 10:49 ` [RFC 20/20] IB/mlx5: Implement on demand paging by adding support for MMU notifiers Haggai Eran
@ 2014-03-13 12:00 ` Haggai Eran
2014-04-24 14:10 ` Or Gerlitz
21 siblings, 0 replies; 23+ messages in thread
From: Haggai Eran @ 2014-03-13 12:00 UTC (permalink / raw)
To: Roland Dreier
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz, Sagi Grimberg,
Shachar Raindel, Liran Liss
On 02/03/2014 12:49, Haggai Eran wrote:
> The following set of patches implements on-demand paging (ODP) support
> in the RDMA stack and in the mlx5_ib Infiniband driver.
Hi Roland,
Did you get a chance to look at this patchset? I'll be happy to hear
your comments or answers any questions you may have.
I also wanted to point out that some of the patches in the set may be
picked up separately as fixes, and perhaps that would make the set
shorter and easier to review.
Specifically, I think the following patches can be taken independently
today:
> IB/mlx5: Fix error handling in reg_umr
> IB/mlx5: Add MR to radix tree in reg_mr_callback
> mlx5: Store MR attributes in mlx5_mr_core during creation and after UMR
> IB/mlx5: Set QP offsets and parameters for user QPs and not just for kernel QPs
> IB/mlx5: Refactor UMR to have its own context struct
Best regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC 00/20] On demand paging
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
` (20 preceding siblings ...)
2014-03-13 12:00 ` [RFC 00/20] On demand paging Haggai Eran
@ 2014-04-24 14:10 ` Or Gerlitz
21 siblings, 0 replies; 23+ messages in thread
From: Or Gerlitz @ 2014-04-24 14:10 UTC (permalink / raw)
To: Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Christoph Lameter
Cc: Roland Dreier, Andrea Arcangeli, Sagi Grimberg, Shachar Raindel,
Liran Liss
On 02/03/2014 12:49, Haggai Eran wrote:
> The following set of patches implements on-demand paging (ODP) support
> in the RDMA stack and in the mlx5_ib Infiniband driver.
>
I've placed the latest cut of the ODP patches on my public git tree @
git://beany.openfabrics.org/~ogerlitz/linux-2.6.git odp
this is actually V0.1 with the following changes:
- Rebase against v3.15-rc2
- Removed dependency on patches that were accepted upstream
- Changed use of compound_trans_head to compound_head as the latter was
removed in 3.14
f5d7fc1 IB/mlx5: Implement on demand paging by adding support for MMU
notifiers
09eae22 IB/mlx5: Add support for RDMA write responder page faults
ca84a78 IB/mlx5: Handle page faults
302a6ea IB/mlx5: Page faults handling infrastructure
df792a8 IB/mlx5: Add function to read WQE from user-space
1b4f69b IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation
bc5a6b0 IB/mlx5: Changes in memory region creation to support on-demand
paging
335c8ef IB/mlx5: Implement the ODP capability query verb
5a67390 net/mlx5_core: Add support for page faults events and low level
handling
11450c4 IB/mlx5: Refactor UMR to have its own context struct
51deb37 IB/mlx5: Enhance UMR support to allow partial page table update
107bc64 IB/mlx5: Set QP offsets and parameters for user QPs and not just
for kernel QPs
e91a314 mlx5: Store MR attributes in mlx5_mr_core during creation and
after UMR
894f946 IB/mlx5: Add MR to radix tree in reg_mr_callback
9283891 IB/mlx5: Fix error handling in reg_umr
467f4e7 IB/core: Implement support for MMU notifiers regarding on demand
paging regions
c98a42e IB/core: Add support for on demand paging regions
8fb5241 IB/core: Add umem function to read data from user-space
16c9cf0 IB/core: Replace ib_umem's offset field with a full address
9f0d8b5 IB/core: Add flags for on demand paging support
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2014-04-24 14:10 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-02 10:49 [RFC 00/20] On demand paging Haggai Eran
[not found] ` <1393757378-16412-1-git-send-email-haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-03-02 10:49 ` [RFC 01/20] IB/core: Add flags for on demand paging support Haggai Eran
2014-03-02 10:49 ` [RFC 02/20] IB/core: Replace ib_umem's offset field with a full address Haggai Eran
2014-03-02 10:49 ` [RFC 03/20] IB/core: Add umem function to read data from user-space Haggai Eran
2014-03-02 10:49 ` [RFC 04/20] IB/core: Add support for on demand paging regions Haggai Eran
2014-03-02 10:49 ` [RFC 05/20] IB/core: Implement support for MMU notifiers regarding " Haggai Eran
2014-03-02 10:49 ` [RFC 06/20] IB/mlx5: Fix error handling in reg_umr Haggai Eran
2014-03-02 10:49 ` [RFC 07/20] IB/mlx5: Add MR to radix tree in reg_mr_callback Haggai Eran
2014-03-02 10:49 ` [RFC 08/20] mlx5: Store MR attributes in mlx5_mr_core during creation and after UMR Haggai Eran
2014-03-02 10:49 ` [RFC 09/20] IB/mlx5: Set QP offsets and parameters for user QPs and not just for kernel QPs Haggai Eran
2014-03-02 10:49 ` [RFC 10/20] IB/mlx5: Enhance UMR support to allow partial page table update Haggai Eran
2014-03-02 10:49 ` [RFC 11/20] IB/mlx5: Refactor UMR to have its own context struct Haggai Eran
2014-03-02 10:49 ` [RFC 12/20] net/mlx5_core: Add support for page faults events and low level handling Haggai Eran
2014-03-02 10:49 ` [RFC 13/20] IB/mlx5: Implement the ODP capability query verb Haggai Eran
2014-03-02 10:49 ` [RFC 14/20] IB/mlx5: Changes in memory region creation to support on-demand paging Haggai Eran
2014-03-02 10:49 ` [RFC 15/20] IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation Haggai Eran
2014-03-02 10:49 ` [RFC 16/20] IB/mlx5: Add function to read WQE from user-space Haggai Eran
2014-03-02 10:49 ` [RFC 17/20] IB/mlx5: Page faults handling infrastructure Haggai Eran
2014-03-02 10:49 ` [RFC 18/20] IB/mlx5: Handle page faults Haggai Eran
2014-03-02 10:49 ` [RFC 19/20] IB/mlx5: Add support for RDMA write responder " Haggai Eran
2014-03-02 10:49 ` [RFC 20/20] IB/mlx5: Implement on demand paging by adding support for MMU notifiers Haggai Eran
2014-03-13 12:00 ` [RFC 00/20] On demand paging Haggai Eran
2014-04-24 14:10 ` Or Gerlitz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox