* [RFC PATCH 0/4] ibverbs: new verbs for registering I/O memory
@ 2010-07-29 16:24 Tom Tucker
[not found] ` <20100729162339.14674.15788.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
0 siblings, 1 reply; 17+ messages in thread
From: Tom Tucker @ 2010-07-29 16:24 UTC (permalink / raw)
To: rdreier-FYB4Gu1CFyUAvxtiuMwx3w
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, brandt-4OHPYypu0djtX7QSmKvirg,
tom-/Yg/VP3ZvrM, swise-/Yg/VP3ZvrM
The following patches add verbs for registering I/O memory from user-space.
This capability allows device memory to be registered. More specifically,
any VM_PFNMAP vma can be registered.
The mmap service is used to obtain the address of this memory and provide it
to user space. This is where any security policy would be implemented.
The ib_iomem_get service requires that any address provided by the service
be in a VMA owned by the process. This precludes providing 'random' addresses
to the service to acquire access to arbitrary memory locations.
---
Tom Tucker (4):
mthca: Add support for reg_io_mr and unreg_io_mr
uverbs_cmd: Add uverbs command definitions for reg_io_mr
uverbs: Add common ib_iomem_get service
ibverbs: Add new provider verb for I/O memory registration
drivers/infiniband/core/umem.c | 248 +++++++++++++++++++++++++-
drivers/infiniband/core/uverbs.h | 2
drivers/infiniband/core/uverbs_cmd.c | 140 +++++++++++++++
drivers/infiniband/core/uverbs_main.c | 2
drivers/infiniband/hw/mthca/mthca_provider.c | 111 ++++++++++++
include/rdma/ib_umem.h | 14 +
include/rdma/ib_user_verbs.h | 24 ++-
include/rdma/ib_verbs.h | 5 +
8 files changed, 534 insertions(+), 12 deletions(-)
--
Signed-off-by: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC PATCH 1/4] ibverbs: Add new provider verb for I/O memory registration
[not found] ` <20100729162339.14674.15788.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
@ 2010-07-29 16:25 ` Tom Tucker
2010-07-29 16:25 ` [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service Tom Tucker
` (2 subsequent siblings)
3 siblings, 0 replies; 17+ messages in thread
From: Tom Tucker @ 2010-07-29 16:25 UTC (permalink / raw)
To: rdreier-FYB4Gu1CFyUAvxtiuMwx3w
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, brandt-4OHPYypu0djtX7QSmKvirg,
tom-/Yg/VP3ZvrM, swise-/Yg/VP3ZvrM
From: Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Add a function pointer for the provider's reg_io_mr
method.
Signed-off-by: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
---
include/rdma/ib_verbs.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index f3e8f3c..5034ac9 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1096,6 +1096,11 @@ struct ib_device {
u64 virt_addr,
int mr_access_flags,
struct ib_udata *udata);
+ struct ib_mr * (*reg_io_mr)(struct ib_pd *pd,
+ u64 start, u64 length,
+ u64 virt_addr,
+ int mr_access_flags,
+ struct ib_udata *udata);
int (*query_mr)(struct ib_mr *mr,
struct ib_mr_attr *mr_attr);
int (*dereg_mr)(struct ib_mr *mr);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <20100729162339.14674.15788.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
2010-07-29 16:25 ` [RFC PATCH 1/4] ibverbs: Add new provider verb for I/O memory registration Tom Tucker
@ 2010-07-29 16:25 ` Tom Tucker
[not found] ` <20100729162509.14674.34237.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
2010-07-29 16:25 ` [RFC PATCH 3/4] uverbs_cmd: Add uverbs command definitions for reg_io_mr Tom Tucker
2010-07-29 16:25 ` [RFC PATCH 4/4] mthca: Add support for reg_io_mr and unreg_io_mr Tom Tucker
3 siblings, 1 reply; 17+ messages in thread
From: Tom Tucker @ 2010-07-29 16:25 UTC (permalink / raw)
To: rdreier-FYB4Gu1CFyUAvxtiuMwx3w
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, brandt-4OHPYypu0djtX7QSmKvirg,
tom-/Yg/VP3ZvrM, swise-/Yg/VP3ZvrM
From: Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Add an ib_iomem_get service that converts a vma to an array of
physical addresses. This makes it easier for each device driver to
add support for the reg_io_mr provider method.
Signed-off-by: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
---
drivers/infiniband/core/umem.c | 248 ++++++++++++++++++++++++++++++++++++++--
include/rdma/ib_umem.h | 14 ++
2 files changed, 251 insertions(+), 11 deletions(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 415e186..f103956 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -52,16 +52,18 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
int i;
list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) {
- ib_dma_unmap_sg(dev, chunk->page_list,
- chunk->nents, DMA_BIDIRECTIONAL);
- for (i = 0; i < chunk->nents; ++i) {
- struct page *page = sg_page(&chunk->page_list[i]);
-
- if (umem->writable && dirty)
- set_page_dirty_lock(page);
- put_page(page);
- }
+ if (umem->type == IB_UMEM_MEM_MAP) {
+ ib_dma_unmap_sg(dev, chunk->page_list,
+ chunk->nents, DMA_BIDIRECTIONAL);
+ for (i = 0; i < chunk->nents; ++i) {
+ struct page *page =
+ sg_page(&chunk->page_list[i]);
+ if (umem->writable && dirty)
+ set_page_dirty_lock(page);
+ put_page(page);
+ }
+ }
kfree(chunk);
}
}
@@ -150,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
while (npages) {
ret = get_user_pages(current, current->mm, cur_base,
min_t(unsigned long, npages,
- PAGE_SIZE / sizeof (struct page *)),
+ PAGE_SIZE / sizeof(struct page *)),
1, !umem->writable, page_list, vma_list);
if (ret < 0)
@@ -162,7 +164,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
off = 0;
while (ret) {
- chunk = kmalloc(sizeof *chunk + sizeof (struct scatterlist) *
+ chunk = kmalloc(sizeof *chunk +
+ sizeof(struct scatterlist) *
min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK),
GFP_KERNEL);
if (!chunk) {
@@ -292,3 +295,226 @@ int ib_umem_page_count(struct ib_umem *umem)
return n;
}
EXPORT_SYMBOL(ib_umem_page_count);
+/*
+ * Return the PFN for the specified address in the vma. This only
+ * works for a vma that is VM_PFNMAP.
+ */
+static unsigned long follow_io_pfn(struct vm_area_struct *vma,
+ unsigned long address, int write)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep, pte;
+ spinlock_t *ptl;
+ unsigned long pfn;
+ struct mm_struct *mm = vma->vm_mm;
+
+ BUG_ON(0 == (vma->vm_flags & VM_PFNMAP));
+
+ pgd = pgd_offset(mm, address);
+ if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+ return 0;
+
+ pud = pud_offset(pgd, address);
+ if (pud_none(*pud))
+ return 0;
+ if (unlikely(pud_bad(*pud)))
+ return 0;
+
+ pmd = pmd_offset(pud, address);
+ if (pmd_none(*pmd))
+ return 0;
+ if (unlikely(pmd_bad(*pmd)))
+ return 0;
+
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ pte = *ptep;
+ if (!pte_present(pte))
+ goto bad;
+ if (write && !pte_write(pte))
+ goto bad;
+
+ pfn = pte_pfn(pte);
+ pte_unmap_unlock(ptep, ptl);
+ return pfn;
+ bad:
+ pte_unmap_unlock(ptep, ptl);
+ return 0;
+}
+
+int ib_get_io_pfn(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, int len, int write, int force,
+ unsigned long *pfn_list, struct vm_area_struct **vmas)
+{
+ unsigned long pfn;
+ int i;
+ if (len <= 0)
+ return 0;
+
+ i = 0;
+ do {
+ struct vm_area_struct *vma;
+
+ vma = find_vma(mm, start);
+ if (0 == (vma->vm_flags & VM_PFNMAP))
+ return -EINVAL;
+
+ if (0 == (vma->vm_flags & VM_IO))
+ return -EFAULT;
+
+ if (is_vm_hugetlb_page(vma))
+ return -EFAULT;
+
+ do {
+ cond_resched();
+ pfn = follow_io_pfn(vma, start, write);
+ if (!pfn)
+ return -EFAULT;
+ if (pfn_list)
+ pfn_list[i] = pfn;
+ if (vmas)
+ vmas[i] = vma;
+ i++;
+ start += PAGE_SIZE;
+ len--;
+ } while (len && start < vma->vm_end);
+ } while (len);
+ return i;
+}
+
+/**
+ * ib_iomem_get - DMA map a userspace map of IO memory.
+ * @context: userspace context to map memory for
+ * @addr: userspace virtual address to start at
+ * @size: length of region to map
+ * @access: IB_ACCESS_xxx flags for memory being mapped
+ * @dmasync: flush in-flight DMA when the memory region is written
+ */
+struct ib_umem *ib_iomem_get(struct ib_ucontext *context, unsigned long addr,
+ size_t size, int access, int dmasync)
+{
+ struct ib_umem *umem;
+ unsigned long *pfn_list;
+ struct ib_umem_chunk *chunk;
+ unsigned long locked;
+ unsigned long lock_limit;
+ unsigned long cur_base;
+ unsigned long npages;
+ int ret;
+ int off;
+ int i;
+ DEFINE_DMA_ATTRS(attrs);
+
+ if (dmasync)
+ dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs);
+
+ if (!can_do_mlock())
+ return ERR_PTR(-EPERM);
+
+ umem = kmalloc(sizeof *umem, GFP_KERNEL);
+ if (!umem)
+ return ERR_PTR(-ENOMEM);
+
+ umem->type = IB_UMEM_IO_MAP;
+ umem->context = context;
+ umem->length = size;
+ umem->offset = addr & ~PAGE_MASK;
+ umem->page_size = PAGE_SIZE;
+ /*
+ * We ask for writable memory if any access flags other than
+ * "remote read" are set. "Local write" and "remote write"
+ * obviously require write access. "Remote atomic" can do
+ * things like fetch and add, which will modify memory, and
+ * "MW bind" can change permissions by binding a window.
+ */
+ umem->writable = !!(access & ~IB_ACCESS_REMOTE_READ);
+
+ /* IO memory is not hugetlb memory */
+ umem->hugetlb = 0;
+
+ INIT_LIST_HEAD(&umem->chunk_list);
+
+ pfn_list = (unsigned long *) __get_free_page(GFP_KERNEL);
+ if (!pfn_list) {
+ kfree(umem);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT;
+
+ down_write(¤t->mm->mmap_sem);
+
+ locked = npages + current->mm->locked_vm;
+ lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
+ >> PAGE_SHIFT;
+
+ if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ cur_base = addr & PAGE_MASK;
+
+ ret = 0;
+ while (npages) {
+ ret = ib_get_io_pfn(current, current->mm, cur_base,
+ min_t(unsigned long, npages,
+ PAGE_SIZE / sizeof(unsigned long *)),
+ umem->writable,
+ !umem->writable, pfn_list, NULL);
+
+ if (ret < 0)
+ goto out;
+
+ cur_base += ret * PAGE_SIZE;
+ npages -= ret;
+
+ off = 0;
+
+ while (ret) {
+ chunk = kmalloc(sizeof *chunk +
+ sizeof(struct scatterlist) *
+ min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK),
+ GFP_KERNEL);
+ if (!chunk) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK);
+ sg_init_table(chunk->page_list, chunk->nents);
+ /* The pfn_list we built is a set of Page
+ * Frame Numbers (PFN) whose physical address
+ * is PFN << PAGE_SHIFT. The SG DMA mapping
+ * services expect page addresses, not PFN,
+ * therefore, we have to do the dma mapping
+ * ourselves here. */
+ for (i = 0; i < chunk->nents; ++i) {
+ sg_set_page(&chunk->page_list[i], 0,
+ PAGE_SIZE, 0);
+ chunk->page_list[i].dma_address =
+ (pfn_list[i] << PAGE_SHIFT);
+ chunk->page_list[i].dma_length = PAGE_SIZE;
+ }
+ chunk->nmap = chunk->nents;
+ ret -= chunk->nents;
+ off += chunk->nents;
+ list_add_tail(&chunk->list, &umem->chunk_list);
+ }
+
+ ret = 0;
+ }
+
+out:
+ if (ret < 0) {
+ __ib_umem_release(context->device, umem, 0);
+ kfree(umem);
+ } else
+ current->mm->locked_vm = locked;
+ up_write(¤t->mm->mmap_sem);
+ free_page((unsigned long) pfn_list);
+
+ return ret < 0 ? ERR_PTR(ret) : umem;
+}
+EXPORT_SYMBOL(ib_iomem_get);
+
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 9ee0d2e..2c64d82 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -39,8 +39,14 @@
struct ib_ucontext;
+enum ib_umem_type {
+ IB_UMEM_MEM_MAP = 0,
+ IB_UMEM_IO_MAP = 1
+};
+
struct ib_umem {
struct ib_ucontext *context;
+ enum ib_umem_type type;
size_t length;
int offset;
int page_size;
@@ -61,6 +67,8 @@ struct ib_umem_chunk {
#ifdef CONFIG_INFINIBAND_USER_MEM
+struct ib_umem *ib_iomem_get(struct ib_ucontext *context, unsigned long addr,
+ size_t size, int access, int dmasync);
struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
size_t size, int access, int dmasync);
void ib_umem_release(struct ib_umem *umem);
@@ -70,6 +78,12 @@ int ib_umem_page_count(struct ib_umem *umem);
#include <linux/err.h>
+static struct ib_umem *ib_iomem_get(struct ib_ucontext *context,
+ unsigned long addr, size_t size,
+ int access, int dmasync) {
+ return ERR_PTR(-EINVAL);
+}
+
static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
unsigned long addr, size_t size,
int access, int dmasync) {
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH 3/4] uverbs_cmd: Add uverbs command definitions for reg_io_mr
[not found] ` <20100729162339.14674.15788.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
2010-07-29 16:25 ` [RFC PATCH 1/4] ibverbs: Add new provider verb for I/O memory registration Tom Tucker
2010-07-29 16:25 ` [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service Tom Tucker
@ 2010-07-29 16:25 ` Tom Tucker
2010-07-29 16:25 ` [RFC PATCH 4/4] mthca: Add support for reg_io_mr and unreg_io_mr Tom Tucker
3 siblings, 0 replies; 17+ messages in thread
From: Tom Tucker @ 2010-07-29 16:25 UTC (permalink / raw)
To: rdreier-FYB4Gu1CFyUAvxtiuMwx3w
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, brandt-4OHPYypu0djtX7QSmKvirg,
tom-/Yg/VP3ZvrM, swise-/Yg/VP3ZvrM
From: Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Add the uverbs command definitions and marshalling and unmarshalling
routines to handle requests from user space.
Signed-off-by: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
---
drivers/infiniband/core/uverbs.h | 2
drivers/infiniband/core/uverbs_cmd.c | 140 +++++++++++++++++++++++++++++++++
drivers/infiniband/core/uverbs_main.c | 2
include/rdma/ib_user_verbs.h | 24 +++++-
4 files changed, 167 insertions(+), 1 deletions(-)
diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index a078e56..17c0828 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -195,5 +195,7 @@ IB_UVERBS_DECLARE_CMD(create_srq);
IB_UVERBS_DECLARE_CMD(modify_srq);
IB_UVERBS_DECLARE_CMD(query_srq);
IB_UVERBS_DECLARE_CMD(destroy_srq);
+IB_UVERBS_DECLARE_CMD(reg_io_mr);
+IB_UVERBS_DECLARE_CMD(dereg_io_mr);
#endif /* UVERBS_H */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 6fcfbeb..8e72a50 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -716,6 +716,146 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
return in_len;
}
+ssize_t ib_uverbs_reg_io_mr(struct ib_uverbs_file *file,
+ const char __user *buf, int in_len,
+ int out_len)
+{
+ struct ib_uverbs_reg_io_mr cmd;
+ struct ib_uverbs_reg_io_mr_resp resp;
+ struct ib_udata udata;
+ struct ib_uobject *uobj;
+ struct ib_pd *pd;
+ struct ib_mr *mr;
+ int ret;
+
+ if (out_len < sizeof resp)
+ return -ENOSPC;
+
+ if (copy_from_user(&cmd, buf, sizeof cmd))
+ return -EFAULT;
+
+ INIT_UDATA(&udata, buf + sizeof cmd,
+ (unsigned long) cmd.response + sizeof resp,
+ in_len - sizeof cmd, out_len - sizeof resp);
+
+ if ((cmd.start & ~PAGE_MASK) != (cmd.hca_va & ~PAGE_MASK))
+ return -EINVAL;
+
+ /*
+ * Local write permission is required if remote write or
+ * remote atomic permission is also requested.
+ */
+ if ((cmd.access_flags &
+ (IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_REMOTE_WRITE)) &&
+ !(cmd.access_flags & IB_ACCESS_LOCAL_WRITE))
+ return -EINVAL;
+
+ uobj = kmalloc(sizeof *uobj, GFP_KERNEL);
+ if (!uobj)
+ return -ENOMEM;
+
+ init_uobj(uobj, 0, file->ucontext, &mr_lock_key);
+ down_write(&uobj->mutex);
+
+ pd = idr_read_pd(cmd.pd_handle, file->ucontext);
+ if (!pd) {
+ ret = -EINVAL;
+ goto err_free;
+ }
+
+ mr = pd->device->reg_io_mr(pd, cmd.start, cmd.length, cmd.hca_va,
+ cmd.access_flags, &udata);
+ if (IS_ERR(mr)) {
+ ret = PTR_ERR(mr);
+ goto err_put;
+ }
+
+ mr->device = pd->device;
+ mr->pd = pd;
+ mr->uobject = uobj;
+ atomic_inc(&pd->usecnt);
+ atomic_set(&mr->usecnt, 0);
+
+ uobj->object = mr;
+ ret = idr_add_uobj(&ib_uverbs_mr_idr, uobj);
+ if (ret)
+ goto err_unreg;
+
+ memset(&resp, 0, sizeof resp);
+ resp.lkey = mr->lkey;
+ resp.rkey = mr->rkey;
+ resp.mr_handle = uobj->id;
+
+ if (copy_to_user((void __user *) (unsigned long) cmd.response,
+ &resp, sizeof resp)) {
+ ret = -EFAULT;
+ goto err_copy;
+ }
+
+ put_pd_read(pd);
+
+ mutex_lock(&file->mutex);
+ list_add_tail(&uobj->list, &file->ucontext->mr_list);
+ mutex_unlock(&file->mutex);
+
+ uobj->live = 1;
+
+ up_write(&uobj->mutex);
+
+ return in_len;
+
+err_copy:
+ idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
+
+err_unreg:
+ ib_dereg_mr(mr);
+
+err_put:
+ put_pd_read(pd);
+
+err_free:
+ put_uobj_write(uobj);
+ return ret;
+}
+
+ssize_t ib_uverbs_dereg_io_mr(struct ib_uverbs_file *file,
+ const char __user *buf, int in_len,
+ int out_len)
+{
+ struct ib_uverbs_dereg_io_mr cmd;
+ struct ib_mr *mr;
+ struct ib_uobject *uobj;
+ int ret = -EINVAL;
+
+ if (copy_from_user(&cmd, buf, sizeof cmd))
+ return -EFAULT;
+
+ uobj = idr_write_uobj(&ib_uverbs_mr_idr, cmd.mr_handle, file->ucontext);
+ if (!uobj)
+ return -EINVAL;
+
+ mr = uobj->object;
+
+ ret = ib_dereg_mr(mr);
+ if (!ret)
+ uobj->live = 0;
+
+ put_uobj_write(uobj);
+
+ if (ret)
+ return ret;
+
+ idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
+
+ mutex_lock(&file->mutex);
+ list_del(&uobj->list);
+ mutex_unlock(&file->mutex);
+
+ put_uobj(uobj);
+
+ return in_len;
+}
+
ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file,
const char __user *buf, int in_len,
int out_len)
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index ec83e9f..2b4e110 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -86,6 +86,8 @@ static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file *file,
[IB_USER_VERBS_CMD_DEALLOC_PD] = ib_uverbs_dealloc_pd,
[IB_USER_VERBS_CMD_REG_MR] = ib_uverbs_reg_mr,
[IB_USER_VERBS_CMD_DEREG_MR] = ib_uverbs_dereg_mr,
+ [IB_USER_VERBS_CMD_REG_IO_MR] = ib_uverbs_reg_io_mr,
+ [IB_USER_VERBS_CMD_DEREG_IO_MR] = ib_uverbs_dereg_io_mr,
[IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL] = ib_uverbs_create_comp_channel,
[IB_USER_VERBS_CMD_CREATE_CQ] = ib_uverbs_create_cq,
[IB_USER_VERBS_CMD_RESIZE_CQ] = ib_uverbs_resize_cq,
diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h
index a17f771..2e1f7a0 100644
--- a/include/rdma/ib_user_verbs.h
+++ b/include/rdma/ib_user_verbs.h
@@ -81,7 +81,9 @@ enum {
IB_USER_VERBS_CMD_MODIFY_SRQ,
IB_USER_VERBS_CMD_QUERY_SRQ,
IB_USER_VERBS_CMD_DESTROY_SRQ,
- IB_USER_VERBS_CMD_POST_SRQ_RECV
+ IB_USER_VERBS_CMD_POST_SRQ_RECV,
+ IB_USER_VERBS_CMD_REG_IO_MR,
+ IB_USER_VERBS_CMD_DEREG_IO_MR,
};
/*
@@ -241,6 +243,26 @@ struct ib_uverbs_dereg_mr {
__u32 mr_handle;
};
+struct ib_uverbs_reg_io_mr {
+ __u64 response;
+ __u64 start;
+ __u64 length;
+ __u64 hca_va;
+ __u32 pd_handle;
+ __u32 access_flags;
+ __u64 driver_data[0];
+};
+
+struct ib_uverbs_reg_io_mr_resp {
+ __u32 mr_handle;
+ __u32 lkey;
+ __u32 rkey;
+};
+
+struct ib_uverbs_dereg_io_mr {
+ __u32 mr_handle;
+};
+
struct ib_uverbs_create_comp_channel {
__u64 response;
};
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH 4/4] mthca: Add support for reg_io_mr and unreg_io_mr
[not found] ` <20100729162339.14674.15788.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
` (2 preceding siblings ...)
2010-07-29 16:25 ` [RFC PATCH 3/4] uverbs_cmd: Add uverbs command definitions for reg_io_mr Tom Tucker
@ 2010-07-29 16:25 ` Tom Tucker
3 siblings, 0 replies; 17+ messages in thread
From: Tom Tucker @ 2010-07-29 16:25 UTC (permalink / raw)
To: rdreier-FYB4Gu1CFyUAvxtiuMwx3w
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, brandt-4OHPYypu0djtX7QSmKvirg,
tom-/Yg/VP3ZvrM, swise-/Yg/VP3ZvrM
From: Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Add support for the new I/O memory registration verbs to
the mthca driver.
Signed-off-by: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
---
drivers/infiniband/hw/mthca/mthca_provider.c | 111 ++++++++++++++++++++++++++
1 files changed, 111 insertions(+), 0 deletions(-)
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 1e0b4b6..3d3e871 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -900,6 +900,114 @@ static inline u32 convert_access(int acc)
MTHCA_MPT_FLAG_LOCAL_READ;
}
+static struct ib_mr *mthca_reg_io_mr(struct ib_pd *pd, u64 start, u64 length,
+ u64 virt, int acc, struct ib_udata *udata)
+{
+ struct mthca_dev *dev = to_mdev(pd->device);
+ struct ib_umem_chunk *chunk;
+ struct mthca_mr *mr;
+ struct mthca_reg_mr ucmd;
+ u64 *pages;
+ int shift, n, len;
+ int i, j, k;
+ int err = 0;
+ int write_mtt_size;
+
+ if (udata->inlen - sizeof(struct ib_uverbs_cmd_hdr) < sizeof ucmd) {
+ if (!to_mucontext(pd->uobject->context)->reg_mr_warned) {
+ mthca_warn(dev,
+ "Process '%s' did not pass in MR attrs.\n",
+ current->comm);
+ mthca_warn(dev, " Update libmthca to fix this.\n");
+ }
+ ++to_mucontext(pd->uobject->context)->reg_mr_warned;
+ ucmd.mr_attrs = 0;
+ } else if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd))
+ return ERR_PTR(-EFAULT);
+
+ mr = kmalloc(sizeof *mr, GFP_KERNEL);
+ if (!mr)
+ return ERR_PTR(-ENOMEM);
+
+ mr->umem = ib_iomem_get(pd->uobject->context, start, length, acc,
+ ucmd.mr_attrs & MTHCA_MR_DMASYNC);
+
+ if (IS_ERR(mr->umem)) {
+ err = PTR_ERR(mr->umem);
+ goto err;
+ }
+
+ shift = ffs(mr->umem->page_size) - 1;
+
+ n = 0;
+ list_for_each_entry(chunk, &mr->umem->chunk_list, list)
+ n += chunk->nents;
+
+ mr->mtt = mthca_alloc_mtt(dev, n);
+ if (IS_ERR(mr->mtt)) {
+ err = PTR_ERR(mr->mtt);
+ goto err_umem;
+ }
+
+ pages = (u64 *) __get_free_page(GFP_KERNEL);
+ if (!pages) {
+ err = -ENOMEM;
+ goto err_mtt;
+ }
+
+ i = n = 0;
+
+ write_mtt_size = min(mthca_write_mtt_size(dev),
+ (int) (PAGE_SIZE / sizeof *pages));
+
+ list_for_each_entry(chunk, &mr->umem->chunk_list, list)
+ for (j = 0; j < chunk->nmap; ++j) {
+ len = sg_dma_len(&chunk->page_list[j]) >> shift;
+ for (k = 0; k < len; ++k) {
+ pages[i++] =
+ sg_dma_address(&chunk->page_list[j]) +
+ mr->umem->page_size * k;
+ /*
+ * Be friendly to write_mtt and pass it chunks
+ * of appropriate size.
+ */
+ if (i == write_mtt_size) {
+ err = mthca_write_mtt(dev, mr->mtt, n,
+ pages, i);
+ if (err)
+ goto mtt_done;
+ n += i;
+ i = 0;
+ }
+ }
+ }
+
+ if (i)
+ err = mthca_write_mtt(dev, mr->mtt, n, pages, i);
+mtt_done:
+ free_page((unsigned long) pages);
+ if (err)
+ goto err_mtt;
+
+ err = mthca_mr_alloc(dev, to_mpd(pd)->pd_num, shift, virt, length,
+ convert_access(acc), mr);
+
+ if (err)
+ goto err_mtt;
+
+ return &mr->ibmr;
+
+err_mtt:
+ mthca_free_mtt(dev, mr->mtt);
+
+err_umem:
+ ib_umem_release(mr->umem);
+
+err:
+ kfree(mr);
+ return ERR_PTR(err);
+}
+
static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc)
{
struct mthca_mr *mr;
@@ -1318,6 +1426,8 @@ int mthca_register_device(struct mthca_dev *dev)
(1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
(1ull << IB_USER_VERBS_CMD_REG_MR) |
(1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+ (1ull << IB_USER_VERBS_CMD_REG_IO_MR) |
+ (1ull << IB_USER_VERBS_CMD_DEREG_IO_MR) |
(1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
(1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
(1ull << IB_USER_VERBS_CMD_RESIZE_CQ) |
@@ -1376,6 +1486,7 @@ int mthca_register_device(struct mthca_dev *dev)
dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr;
dev->ib_dev.reg_user_mr = mthca_reg_user_mr;
dev->ib_dev.dereg_mr = mthca_dereg_mr;
+ dev->ib_dev.reg_io_mr = mthca_reg_io_mr;
if (dev->mthca_flags & MTHCA_FLAG_FMR) {
dev->ib_dev.alloc_fmr = mthca_alloc_fmr;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <20100729162509.14674.34237.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
@ 2010-07-29 18:22 ` Ralph Campbell
[not found] ` <1280427723.6820.56.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-07-29 21:20 ` Tom Tucker
1 sibling, 1 reply; 17+ messages in thread
From: Ralph Campbell @ 2010-07-29 18:22 UTC (permalink / raw)
To: Tom Tucker
Cc: rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
brandt-4OHPYypu0djtX7QSmKvirg@public.gmane.org,
swise-/Yg/VP3ZvrM@public.gmane.org
On Thu, 2010-07-29 at 09:25 -0700, Tom Tucker wrote:
> From: Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>
> Add an ib_iomem_get service that converts a vma to an array of
> physical addresses. This makes it easier for each device driver to
> add support for the reg_io_mr provider method.
>
> Signed-off-by: Tom Tucker <tom-/Yg/VP3ZvrM@public.gmane.org>
> ---
>
> drivers/infiniband/core/umem.c | 248 ++++++++++++++++++++++++++++++++++++++--
> include/rdma/ib_umem.h | 14 ++
> 2 files changed, 251 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 415e186..f103956 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
...
> @@ -292,3 +295,226 @@ int ib_umem_page_count(struct ib_umem *umem)
> return n;
> }
> EXPORT_SYMBOL(ib_umem_page_count);
> +/*
> + * Return the PFN for the specified address in the vma. This only
> + * works for a vma that is VM_PFNMAP.
> + */
> +static unsigned long follow_io_pfn(struct vm_area_struct *vma,
> + unsigned long address, int write)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep, pte;
> + spinlock_t *ptl;
> + unsigned long pfn;
> + struct mm_struct *mm = vma->vm_mm;
> +
> + BUG_ON(0 == (vma->vm_flags & VM_PFNMAP));
Why use BUG_ON?
WARN_ON is more appropriate but
if (!(vma->vm_flags & VM_PFNMAP))
return 0;
seems better.
In fact, move it outside the inner do loop in ib_get_io_pfn().
> + pgd = pgd_offset(mm, address);
> + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> + return 0;
> +
> + pud = pud_offset(pgd, address);
> + if (pud_none(*pud))
> + return 0;
> + if (unlikely(pud_bad(*pud)))
> + return 0;
> +
> + pmd = pmd_offset(pud, address);
> + if (pmd_none(*pmd))
> + return 0;
> + if (unlikely(pmd_bad(*pmd)))
> + return 0;
> +
> + ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> + pte = *ptep;
> + if (!pte_present(pte))
> + goto bad;
> + if (write && !pte_write(pte))
> + goto bad;
> +
> + pfn = pte_pfn(pte);
> + pte_unmap_unlock(ptep, ptl);
> + return pfn;
> + bad:
> + pte_unmap_unlock(ptep, ptl);
> + return 0;
> +}
> +
> +int ib_get_io_pfn(struct task_struct *tsk, struct mm_struct *mm,
> + unsigned long start, int len, int write, int force,
> + unsigned long *pfn_list, struct vm_area_struct **vmas)
> +{
> + unsigned long pfn;
> + int i;
> + if (len <= 0)
> + return 0;
> +
> + i = 0;
> + do {
> + struct vm_area_struct *vma;
> +
> + vma = find_vma(mm, start);
> + if (0 == (vma->vm_flags & VM_PFNMAP))
> + return -EINVAL;
Style nit: I would use ! instead of 0 ==
> + if (0 == (vma->vm_flags & VM_IO))
> + return -EFAULT;
> +
> + if (is_vm_hugetlb_page(vma))
> + return -EFAULT;
> +
> + do {
> + cond_resched();
> + pfn = follow_io_pfn(vma, start, write);
> + if (!pfn)
> + return -EFAULT;
> + if (pfn_list)
> + pfn_list[i] = pfn;
> + if (vmas)
> + vmas[i] = vma;
> + i++;
> + start += PAGE_SIZE;
> + len--;
> + } while (len && start < vma->vm_end);
> + } while (len);
> + return i;
> +}
> +
> +/**
> + * ib_iomem_get - DMA map a userspace map of IO memory.
> + * @context: userspace context to map memory for
> + * @addr: userspace virtual address to start at
> + * @size: length of region to map
> + * @access: IB_ACCESS_xxx flags for memory being mapped
> + * @dmasync: flush in-flight DMA when the memory region is written
> + */
> +struct ib_umem *ib_iomem_get(struct ib_ucontext *context, unsigned long addr,
> + size_t size, int access, int dmasync)
> +{
> + struct ib_umem *umem;
> + unsigned long *pfn_list;
> + struct ib_umem_chunk *chunk;
> + unsigned long locked;
> + unsigned long lock_limit;
> + unsigned long cur_base;
> + unsigned long npages;
> + int ret;
> + int off;
> + int i;
> + DEFINE_DMA_ATTRS(attrs);
> +
> + if (dmasync)
> + dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs);
> +
> + if (!can_do_mlock())
> + return ERR_PTR(-EPERM);
> +
> + umem = kmalloc(sizeof *umem, GFP_KERNEL);
> + if (!umem)
> + return ERR_PTR(-ENOMEM);
> +
> + umem->type = IB_UMEM_IO_MAP;
> + umem->context = context;
> + umem->length = size;
> + umem->offset = addr & ~PAGE_MASK;
> + umem->page_size = PAGE_SIZE;
> + /*
> + * We ask for writable memory if any access flags other than
> + * "remote read" are set. "Local write" and "remote write"
> + * obviously require write access. "Remote atomic" can do
> + * things like fetch and add, which will modify memory, and
> + * "MW bind" can change permissions by binding a window.
> + */
> + umem->writable = !!(access & ~IB_ACCESS_REMOTE_READ);
> +
> + /* IO memory is not hugetlb memory */
> + umem->hugetlb = 0;
> +
> + INIT_LIST_HEAD(&umem->chunk_list);
> +
> + pfn_list = (unsigned long *) __get_free_page(GFP_KERNEL);
> + if (!pfn_list) {
> + kfree(umem);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT;
> +
> + down_write(¤t->mm->mmap_sem);
> +
> + locked = npages + current->mm->locked_vm;
> + lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
> + >> PAGE_SHIFT;
I think the current kernel code is supposed to look like:
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + cur_base = addr & PAGE_MASK;
> +
> + ret = 0;
> + while (npages) {
> + ret = ib_get_io_pfn(current, current->mm, cur_base,
> + min_t(unsigned long, npages,
> + PAGE_SIZE / sizeof(unsigned long *)),
> + umem->writable,
> + !umem->writable, pfn_list, NULL);
> +
> + if (ret < 0)
> + goto out;
> +
> + cur_base += ret * PAGE_SIZE;
> + npages -= ret;
> +
> + off = 0;
> +
> + while (ret) {
> + chunk = kmalloc(sizeof *chunk +
> + sizeof(struct scatterlist) *
> + min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK),
> + GFP_KERNEL);
> + if (!chunk) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK);
> + sg_init_table(chunk->page_list, chunk->nents);
> + /* The pfn_list we built is a set of Page
> + * Frame Numbers (PFN) whose physical address
> + * is PFN << PAGE_SHIFT. The SG DMA mapping
> + * services expect page addresses, not PFN,
> + * therefore, we have to do the dma mapping
> + * ourselves here. */
> + for (i = 0; i < chunk->nents; ++i) {
> + sg_set_page(&chunk->page_list[i], 0,
> + PAGE_SIZE, 0);
> + chunk->page_list[i].dma_address =
> + (pfn_list[i] << PAGE_SHIFT);
> + chunk->page_list[i].dma_length = PAGE_SIZE;
dma_length isn't present if CONFIG_NEED_SG_DMA_LENGTH is undefined.
In fact, the kmalloc() should probably be kzalloc since the other
struct scatterlist entries are uninitialized.
> + }
> + chunk->nmap = chunk->nents;
> + ret -= chunk->nents;
> + off += chunk->nents;
> + list_add_tail(&chunk->list, &umem->chunk_list);
> + }
> +
> + ret = 0;
This is redundant since ret == 0 at the end of the loop.
> + }
> +
> +out:
> + if (ret < 0) {
> + __ib_umem_release(context->device, umem, 0);
> + kfree(umem);
> + } else
> + current->mm->locked_vm = locked;
> + up_write(¤t->mm->mmap_sem);
> + free_page((unsigned long) pfn_list);
> +
> + return ret < 0 ? ERR_PTR(ret) : umem;
> +}
> +EXPORT_SYMBOL(ib_iomem_get);
> +
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index 9ee0d2e..2c64d82 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -39,8 +39,14 @@
>
> struct ib_ucontext;
>
> +enum ib_umem_type {
> + IB_UMEM_MEM_MAP = 0,
> + IB_UMEM_IO_MAP = 1
> +};
> +
> struct ib_umem {
> struct ib_ucontext *context;
> + enum ib_umem_type type;
> size_t length;
> int offset;
> int page_size;
> @@ -61,6 +67,8 @@ struct ib_umem_chunk {
>
> #ifdef CONFIG_INFINIBAND_USER_MEM
>
> +struct ib_umem *ib_iomem_get(struct ib_ucontext *context, unsigned long addr,
> + size_t size, int access, int dmasync);
> struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> size_t size, int access, int dmasync);
> void ib_umem_release(struct ib_umem *umem);
> @@ -70,6 +78,12 @@ int ib_umem_page_count(struct ib_umem *umem);
>
> #include <linux/err.h>
>
> +static struct ib_umem *ib_iomem_get(struct ib_ucontext *context,
> + unsigned long addr, size_t size,
> + int access, int dmasync) {
> + return ERR_PTR(-EINVAL);
> +}
> +
> static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
> unsigned long addr, size_t size,
> int access, int dmasync) {
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <1280427723.6820.56.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
@ 2010-07-29 19:07 ` Tom Tucker
[not found] ` <4C51D15F.8060708-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
0 siblings, 1 reply; 17+ messages in thread
From: Tom Tucker @ 2010-07-29 19:07 UTC (permalink / raw)
To: Ralph Campbell
Cc: Tom Tucker, rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
brandt-4OHPYypu0djtX7QSmKvirg@public.gmane.org,
swise-/Yg/VP3ZvrM@public.gmane.org
On 7/29/10 1:22 PM, Ralph Campbell wrote:
> On Thu, 2010-07-29 at 09:25 -0700, Tom Tucker wrote:
>
>> From: Tom Tucker<tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>>
>> Add an ib_iomem_get service that converts a vma to an array of
>> physical addresses. This makes it easier for each device driver to
>> add support for the reg_io_mr provider method.
>>
>> Signed-off-by: Tom Tucker<tom-/Yg/VP3ZvrM@public.gmane.org>
>> ---
>>
>> drivers/infiniband/core/umem.c | 248 ++++++++++++++++++++++++++++++++++++++--
>> include/rdma/ib_umem.h | 14 ++
>> 2 files changed, 251 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>> index 415e186..f103956 100644
>> --- a/drivers/infiniband/core/umem.c
>> +++ b/drivers/infiniband/core/umem.c
>>
> ...
>
>> @@ -292,3 +295,226 @@ int ib_umem_page_count(struct ib_umem *umem)
>> return n;
>> }
>> EXPORT_SYMBOL(ib_umem_page_count);
>> +/*
>> + * Return the PFN for the specified address in the vma. This only
>> + * works for a vma that is VM_PFNMAP.
>> + */
>> +static unsigned long follow_io_pfn(struct vm_area_struct *vma,
>> + unsigned long address, int write)
>> +{
>> + pgd_t *pgd;
>> + pud_t *pud;
>> + pmd_t *pmd;
>> + pte_t *ptep, pte;
>> + spinlock_t *ptl;
>> + unsigned long pfn;
>> + struct mm_struct *mm = vma->vm_mm;
>> +
>> + BUG_ON(0 == (vma->vm_flags& VM_PFNMAP));
>>
> Why use BUG_ON?
> WARN_ON is more appropriate but
> if (!(vma->vm_flags& VM_PFNMAP))
> return 0;
> seems better.
> In fact, move it outside the inner do loop in ib_get_io_pfn().
>
>
It's paranoia from the debug phase. It's already in the 'outer loop'. I
should just delete it I think.
>> + pgd = pgd_offset(mm, address);
>> + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
>> + return 0;
>> +
>> + pud = pud_offset(pgd, address);
>> + if (pud_none(*pud))
>> + return 0;
>> + if (unlikely(pud_bad(*pud)))
>> + return 0;
>> +
>> + pmd = pmd_offset(pud, address);
>> + if (pmd_none(*pmd))
>> + return 0;
>> + if (unlikely(pmd_bad(*pmd)))
>> + return 0;
>> +
>> + ptep = pte_offset_map_lock(mm, pmd, address,&ptl);
>> + pte = *ptep;
>> + if (!pte_present(pte))
>> + goto bad;
>> + if (write&& !pte_write(pte))
>> + goto bad;
>> +
>> + pfn = pte_pfn(pte);
>> + pte_unmap_unlock(ptep, ptl);
>> + return pfn;
>> + bad:
>> + pte_unmap_unlock(ptep, ptl);
>> + return 0;
>> +}
>> +
>> +int ib_get_io_pfn(struct task_struct *tsk, struct mm_struct *mm,
>> + unsigned long start, int len, int write, int force,
>> + unsigned long *pfn_list, struct vm_area_struct **vmas)
>> +{
>> + unsigned long pfn;
>> + int i;
>> + if (len<= 0)
>> + return 0;
>> +
>> + i = 0;
>> + do {
>> + struct vm_area_struct *vma;
>> +
>> + vma = find_vma(mm, start);
>> + if (0 == (vma->vm_flags& VM_PFNMAP))
>> + return -EINVAL;
>>
> Style nit: I would use ! instead of 0 ==
>
>
ok.
>> + if (0 == (vma->vm_flags& VM_IO))
>> + return -EFAULT;
>> +
>> + if (is_vm_hugetlb_page(vma))
>> + return -EFAULT;
>> +
>> + do {
>> + cond_resched();
>> + pfn = follow_io_pfn(vma, start, write);
>> + if (!pfn)
>> + return -EFAULT;
>> + if (pfn_list)
>> + pfn_list[i] = pfn;
>> + if (vmas)
>> + vmas[i] = vma;
>> + i++;
>> + start += PAGE_SIZE;
>> + len--;
>> + } while (len&& start< vma->vm_end);
>> + } while (len);
>> + return i;
>> +}
>> +
>> +/**
>> + * ib_iomem_get - DMA map a userspace map of IO memory.
>> + * @context: userspace context to map memory for
>> + * @addr: userspace virtual address to start at
>> + * @size: length of region to map
>> + * @access: IB_ACCESS_xxx flags for memory being mapped
>> + * @dmasync: flush in-flight DMA when the memory region is written
>> + */
>> +struct ib_umem *ib_iomem_get(struct ib_ucontext *context, unsigned long addr,
>> + size_t size, int access, int dmasync)
>> +{
>> + struct ib_umem *umem;
>> + unsigned long *pfn_list;
>> + struct ib_umem_chunk *chunk;
>> + unsigned long locked;
>> + unsigned long lock_limit;
>> + unsigned long cur_base;
>> + unsigned long npages;
>> + int ret;
>> + int off;
>> + int i;
>> + DEFINE_DMA_ATTRS(attrs);
>> +
>> + if (dmasync)
>> + dma_set_attr(DMA_ATTR_WRITE_BARRIER,&attrs);
>> +
>> + if (!can_do_mlock())
>> + return ERR_PTR(-EPERM);
>> +
>> + umem = kmalloc(sizeof *umem, GFP_KERNEL);
>> + if (!umem)
>> + return ERR_PTR(-ENOMEM);
>> +
>> + umem->type = IB_UMEM_IO_MAP;
>> + umem->context = context;
>> + umem->length = size;
>> + umem->offset = addr& ~PAGE_MASK;
>> + umem->page_size = PAGE_SIZE;
>> + /*
>> + * We ask for writable memory if any access flags other than
>> + * "remote read" are set. "Local write" and "remote write"
>> + * obviously require write access. "Remote atomic" can do
>> + * things like fetch and add, which will modify memory, and
>> + * "MW bind" can change permissions by binding a window.
>> + */
>> + umem->writable = !!(access& ~IB_ACCESS_REMOTE_READ);
>> +
>> + /* IO memory is not hugetlb memory */
>> + umem->hugetlb = 0;
>> +
>> + INIT_LIST_HEAD(&umem->chunk_list);
>> +
>> + pfn_list = (unsigned long *) __get_free_page(GFP_KERNEL);
>> + if (!pfn_list) {
>> + kfree(umem);
>> + return ERR_PTR(-ENOMEM);
>> + }
>> +
>> + npages = PAGE_ALIGN(size + umem->offset)>> PAGE_SHIFT;
>> +
>> + down_write(¤t->mm->mmap_sem);
>> +
>> + locked = npages + current->mm->locked_vm;
>> + lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
>> + >> PAGE_SHIFT;
>>
> I think the current kernel code is supposed to look like:
> lock_limit = rlimit(RLIMIT_MEMLOCK)>> PAGE_SHIFT;
>
>
agreed.
>> + if ((locked> lock_limit)&& !capable(CAP_IPC_LOCK)) {
>> + ret = -ENOMEM;
>> + goto out;
>> + }
>> +
>> + cur_base = addr& PAGE_MASK;
>> +
>> + ret = 0;
>> + while (npages) {
>> + ret = ib_get_io_pfn(current, current->mm, cur_base,
>> + min_t(unsigned long, npages,
>> + PAGE_SIZE / sizeof(unsigned long *)),
>> + umem->writable,
>> + !umem->writable, pfn_list, NULL);
>> +
>> + if (ret< 0)
>> + goto out;
>> +
>> + cur_base += ret * PAGE_SIZE;
>> + npages -= ret;
>> +
>> + off = 0;
>> +
>> + while (ret) {
>> + chunk = kmalloc(sizeof *chunk +
>> + sizeof(struct scatterlist) *
>> + min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK),
>> + GFP_KERNEL);
>> + if (!chunk) {
>> + ret = -ENOMEM;
>> + goto out;
>> + }
>> + chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK);
>> + sg_init_table(chunk->page_list, chunk->nents);
>> + /* The pfn_list we built is a set of Page
>> + * Frame Numbers (PFN) whose physical address
>> + * is PFN<< PAGE_SHIFT. The SG DMA mapping
>> + * services expect page addresses, not PFN,
>> + * therefore, we have to do the dma mapping
>> + * ourselves here. */
>> + for (i = 0; i< chunk->nents; ++i) {
>> + sg_set_page(&chunk->page_list[i], 0,
>> + PAGE_SIZE, 0);
>> + chunk->page_list[i].dma_address =
>> + (pfn_list[i]<< PAGE_SHIFT);
>> + chunk->page_list[i].dma_length = PAGE_SIZE;
>>
> dma_length isn't present if CONFIG_NEED_SG_DMA_LENGTH is undefined.
>
Right.
> In fact, the kmalloc() should probably be kzalloc since the other
> struct scatterlist entries are uninitialized.
>
>
I think chunk->nents handles this.
>> + }
>> + chunk->nmap = chunk->nents;
>> + ret -= chunk->nents;
>> + off += chunk->nents;
>> + list_add_tail(&chunk->list,&umem->chunk_list);
>> + }
>> +
>> + ret = 0;
>>
> This is redundant since ret == 0 at the end of the loop.
>
>
ok.
>> + }
>> +
>> +out:
>> + if (ret< 0) {
>> + __ib_umem_release(context->device, umem, 0);
>> + kfree(umem);
>> + } else
>> + current->mm->locked_vm = locked;
>> + up_write(¤t->mm->mmap_sem);
>> + free_page((unsigned long) pfn_list);
>> +
>> + return ret< 0 ? ERR_PTR(ret) : umem;
>> +}
>> +EXPORT_SYMBOL(ib_iomem_get);
>> +
>> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
>> index 9ee0d2e..2c64d82 100644
>> --- a/include/rdma/ib_umem.h
>> +++ b/include/rdma/ib_umem.h
>> @@ -39,8 +39,14 @@
>>
>> struct ib_ucontext;
>>
>> +enum ib_umem_type {
>> + IB_UMEM_MEM_MAP = 0,
>> + IB_UMEM_IO_MAP = 1
>> +};
>> +
>> struct ib_umem {
>> struct ib_ucontext *context;
>> + enum ib_umem_type type;
>> size_t length;
>> int offset;
>> int page_size;
>> @@ -61,6 +67,8 @@ struct ib_umem_chunk {
>>
>> #ifdef CONFIG_INFINIBAND_USER_MEM
>>
>> +struct ib_umem *ib_iomem_get(struct ib_ucontext *context, unsigned long addr,
>> + size_t size, int access, int dmasync);
>> struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>> size_t size, int access, int dmasync);
>> void ib_umem_release(struct ib_umem *umem);
>> @@ -70,6 +78,12 @@ int ib_umem_page_count(struct ib_umem *umem);
>>
>> #include<linux/err.h>
>>
>> +static struct ib_umem *ib_iomem_get(struct ib_ucontext *context,
>> + unsigned long addr, size_t size,
>> + int access, int dmasync) {
>> + return ERR_PTR(-EINVAL);
>> +}
>> +
>> static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
>> unsigned long addr, size_t size,
>> int access, int dmasync) {
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <4C51D15F.8060708-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2010-07-29 19:33 ` Ralph Campbell
2010-07-29 20:13 ` Jason Gunthorpe
1 sibling, 0 replies; 17+ messages in thread
From: Ralph Campbell @ 2010-07-29 19:33 UTC (permalink / raw)
To: Tom Tucker
Cc: Tom Tucker, rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
brandt-4OHPYypu0djtX7QSmKvirg@public.gmane.org,
swise-/Yg/VP3ZvrM@public.gmane.org
I forgot to ask how pages are marked as being locked.
I see that the user process amount of locked memory is
adjusted but the actual pages themselves aren't converted
to struct page and the refcount incremented.
Presumably, the device which created the
vma->vm_flags & VM_PFNMAP mapping "owns" the pages and IB
is sharing that mapping.
I'm worried about what happens if the vma mapping is modified or
unmapped.
I guess the ummunotify code could be used to tell the
application of this event but that might be after the
page is reallocated and an IB DMA to the page corrupts it.
A few more comments below...
On Thu, 2010-07-29 at 12:07 -0700, Tom Tucker wrote:
> On 7/29/10 1:22 PM, Ralph Campbell wrote:
> > On Thu, 2010-07-29 at 09:25 -0700, Tom Tucker wrote:
> >
> >> From: Tom Tucker<tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
> >>
> >> Add an ib_iomem_get service that converts a vma to an array of
> >> physical addresses. This makes it easier for each device driver to
> >> add support for the reg_io_mr provider method.
> >>
> >> Signed-off-by: Tom Tucker<tom-/Yg/VP3ZvrM@public.gmane.org>
> >> ---
> >>
> >> drivers/infiniband/core/umem.c | 248 ++++++++++++++++++++++++++++++++++++++--
> >> include/rdma/ib_umem.h | 14 ++
> >> 2 files changed, 251 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> >> index 415e186..f103956 100644
> >> --- a/drivers/infiniband/core/umem.c
> >> +++ b/drivers/infiniband/core/umem.c
> >>
> > ...
> >
> >> @@ -292,3 +295,226 @@ int ib_umem_page_count(struct ib_umem *umem)
> >> return n;
> >> }
> >> EXPORT_SYMBOL(ib_umem_page_count);
> >> +/*
> >> + * Return the PFN for the specified address in the vma. This only
> >> + * works for a vma that is VM_PFNMAP.
> >> + */
> >> +static unsigned long follow_io_pfn(struct vm_area_struct *vma,
> >> + unsigned long address, int write)
> >> +{
> >> + pgd_t *pgd;
> >> + pud_t *pud;
> >> + pmd_t *pmd;
> >> + pte_t *ptep, pte;
> >> + spinlock_t *ptl;
> >> + unsigned long pfn;
> >> + struct mm_struct *mm = vma->vm_mm;
> >> +
> >> + BUG_ON(0 == (vma->vm_flags& VM_PFNMAP));
> >>
> > Why use BUG_ON?
> > WARN_ON is more appropriate but
> > if (!(vma->vm_flags& VM_PFNMAP))
> > return 0;
> > seems better.
> > In fact, move it outside the inner do loop in ib_get_io_pfn().
> >
> >
> It's paranoia from the debug phase. It's already in the 'outer loop'. I
> should just delete it I think.
> >> + pgd = pgd_offset(mm, address);
> >> + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> >> + return 0;
> >> +
> >> + pud = pud_offset(pgd, address);
> >> + if (pud_none(*pud))
> >> + return 0;
> >> + if (unlikely(pud_bad(*pud)))
> >> + return 0;
> >> +
> >> + pmd = pmd_offset(pud, address);
> >> + if (pmd_none(*pmd))
> >> + return 0;
> >> + if (unlikely(pmd_bad(*pmd)))
> >> + return 0;
> >> +
> >> + ptep = pte_offset_map_lock(mm, pmd, address,&ptl);
> >> + pte = *ptep;
> >> + if (!pte_present(pte))
> >> + goto bad;
> >> + if (write&& !pte_write(pte))
> >> + goto bad;
> >> +
> >> + pfn = pte_pfn(pte);
> >> + pte_unmap_unlock(ptep, ptl);
> >> + return pfn;
> >> + bad:
> >> + pte_unmap_unlock(ptep, ptl);
> >> + return 0;
> >> +}
> >> +
> >> +int ib_get_io_pfn(struct task_struct *tsk, struct mm_struct *mm,
> >> + unsigned long start, int len, int write, int force,
> >> + unsigned long *pfn_list, struct vm_area_struct **vmas)
> >> +{
> >> + unsigned long pfn;
> >> + int i;
> >> + if (len<= 0)
> >> + return 0;
> >> +
> >> + i = 0;
> >> + do {
> >> + struct vm_area_struct *vma;
> >> +
> >> + vma = find_vma(mm, start);
> >> + if (0 == (vma->vm_flags& VM_PFNMAP))
> >> + return -EINVAL;
> >>
> > Style nit: I would use ! instead of 0 ==
> >
> >
>
> ok.
>
> >> + if (0 == (vma->vm_flags& VM_IO))
> >> + return -EFAULT;
> >> +
> >> + if (is_vm_hugetlb_page(vma))
> >> + return -EFAULT;
> >> +
> >> + do {
> >> + cond_resched();
> >> + pfn = follow_io_pfn(vma, start, write);
> >> + if (!pfn)
> >> + return -EFAULT;
> >> + if (pfn_list)
> >> + pfn_list[i] = pfn;
> >> + if (vmas)
> >> + vmas[i] = vma;
> >> + i++;
> >> + start += PAGE_SIZE;
> >> + len--;
> >> + } while (len&& start< vma->vm_end);
> >> + } while (len);
> >> + return i;
> >> +}
> >> +
> >> +/**
> >> + * ib_iomem_get - DMA map a userspace map of IO memory.
> >> + * @context: userspace context to map memory for
> >> + * @addr: userspace virtual address to start at
> >> + * @size: length of region to map
> >> + * @access: IB_ACCESS_xxx flags for memory being mapped
> >> + * @dmasync: flush in-flight DMA when the memory region is written
> >> + */
> >> +struct ib_umem *ib_iomem_get(struct ib_ucontext *context, unsigned long addr,
> >> + size_t size, int access, int dmasync)
> >> +{
> >> + struct ib_umem *umem;
> >> + unsigned long *pfn_list;
> >> + struct ib_umem_chunk *chunk;
> >> + unsigned long locked;
> >> + unsigned long lock_limit;
> >> + unsigned long cur_base;
> >> + unsigned long npages;
> >> + int ret;
> >> + int off;
> >> + int i;
> >> + DEFINE_DMA_ATTRS(attrs);
> >> +
> >> + if (dmasync)
> >> + dma_set_attr(DMA_ATTR_WRITE_BARRIER,&attrs);
attrs appears to be unused.
> >> + if (!can_do_mlock())
> >> + return ERR_PTR(-EPERM);
> >> +
> >> + umem = kmalloc(sizeof *umem, GFP_KERNEL);
> >> + if (!umem)
> >> + return ERR_PTR(-ENOMEM);
> >> +
> >> + umem->type = IB_UMEM_IO_MAP;
> >> + umem->context = context;
> >> + umem->length = size;
> >> + umem->offset = addr& ~PAGE_MASK;
> >> + umem->page_size = PAGE_SIZE;
> >> + /*
> >> + * We ask for writable memory if any access flags other than
> >> + * "remote read" are set. "Local write" and "remote write"
> >> + * obviously require write access. "Remote atomic" can do
> >> + * things like fetch and add, which will modify memory, and
> >> + * "MW bind" can change permissions by binding a window.
> >> + */
> >> + umem->writable = !!(access& ~IB_ACCESS_REMOTE_READ);
> >> +
> >> + /* IO memory is not hugetlb memory */
> >> + umem->hugetlb = 0;
> >> +
> >> + INIT_LIST_HEAD(&umem->chunk_list);
> >> +
> >> + pfn_list = (unsigned long *) __get_free_page(GFP_KERNEL);
> >> + if (!pfn_list) {
> >> + kfree(umem);
> >> + return ERR_PTR(-ENOMEM);
> >> + }
> >> +
> >> + npages = PAGE_ALIGN(size + umem->offset)>> PAGE_SHIFT;
> >> +
> >> + down_write(¤t->mm->mmap_sem);
> >> +
> >> + locked = npages + current->mm->locked_vm;
> >> + lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
> >> + >> PAGE_SHIFT;
> >>
> > I think the current kernel code is supposed to look like:
> > lock_limit = rlimit(RLIMIT_MEMLOCK)>> PAGE_SHIFT;
> >
> >
>
> agreed.
>
> >> + if ((locked> lock_limit)&& !capable(CAP_IPC_LOCK)) {
> >> + ret = -ENOMEM;
> >> + goto out;
> >> + }
> >> +
> >> + cur_base = addr& PAGE_MASK;
> >> +
> >> + ret = 0;
> >> + while (npages) {
> >> + ret = ib_get_io_pfn(current, current->mm, cur_base,
> >> + min_t(unsigned long, npages,
> >> + PAGE_SIZE / sizeof(unsigned long *)),
> >> + umem->writable,
> >> + !umem->writable, pfn_list, NULL);
> >> +
> >> + if (ret< 0)
> >> + goto out;
> >> +
> >> + cur_base += ret * PAGE_SIZE;
> >> + npages -= ret;
> >> +
> >> + off = 0;
> >> +
> >> + while (ret) {
> >> + chunk = kmalloc(sizeof *chunk +
> >> + sizeof(struct scatterlist) *
> >> + min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK),
> >> + GFP_KERNEL);
> >> + if (!chunk) {
> >> + ret = -ENOMEM;
> >> + goto out;
> >> + }
> >> + chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK);
> >> + sg_init_table(chunk->page_list, chunk->nents);
> >> + /* The pfn_list we built is a set of Page
> >> + * Frame Numbers (PFN) whose physical address
> >> + * is PFN<< PAGE_SHIFT. The SG DMA mapping
> >> + * services expect page addresses, not PFN,
> >> + * therefore, we have to do the dma mapping
> >> + * ourselves here. */
> >> + for (i = 0; i< chunk->nents; ++i) {
> >> + sg_set_page(&chunk->page_list[i], 0,
> >> + PAGE_SIZE, 0);
> >> + chunk->page_list[i].dma_address =
> >> + (pfn_list[i]<< PAGE_SHIFT);
> >> + chunk->page_list[i].dma_length = PAGE_SIZE;
> >>
> > dma_length isn't present if CONFIG_NEED_SG_DMA_LENGTH is undefined.
> >
> Right.
I forgot to mention that access to dma_address is usually through
sg_dma_address() and is supposed to be architecture specific
although it looks like all the architectures use the asm-generic
version of scatterlist.h now.
I'm not sure how the core kernel folks would like IB doing
sg_dma_length(sg) = N;
> > In fact, the kmalloc() should probably be kzalloc since the other
> > struct scatterlist entries are uninitialized.
> >
> >
> I think chunk->nents handles this.
Not really.
My point is there are other fields besides .dma_address
in struct scatterlist which are not initialized and perhaps
not used by mthca/mlx4 for implementing ibv_reg_io_mr()
but shouldn't just be uninitialized either.
> >> + }
> >> + chunk->nmap = chunk->nents;
> >> + ret -= chunk->nents;
> >> + off += chunk->nents;
> >> + list_add_tail(&chunk->list,&umem->chunk_list);
> >> + }
> >> +
> >> + ret = 0;
> >>
> > This is redundant since ret == 0 at the end of the loop.
> >
> >
> ok.
> >> + }
> >> +
> >> +out:
> >> + if (ret< 0) {
> >> + __ib_umem_release(context->device, umem, 0);
> >> + kfree(umem);
> >> + } else
> >> + current->mm->locked_vm = locked;
> >> + up_write(¤t->mm->mmap_sem);
> >> + free_page((unsigned long) pfn_list);
> >> +
> >> + return ret< 0 ? ERR_PTR(ret) : umem;
> >> +}
> >> +EXPORT_SYMBOL(ib_iomem_get);
> >> +
> >> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> >> index 9ee0d2e..2c64d82 100644
> >> --- a/include/rdma/ib_umem.h
> >> +++ b/include/rdma/ib_umem.h
> >> @@ -39,8 +39,14 @@
> >>
> >> struct ib_ucontext;
> >>
> >> +enum ib_umem_type {
> >> + IB_UMEM_MEM_MAP = 0,
> >> + IB_UMEM_IO_MAP = 1
> >> +};
> >> +
> >> struct ib_umem {
> >> struct ib_ucontext *context;
> >> + enum ib_umem_type type;
> >> size_t length;
> >> int offset;
> >> int page_size;
> >> @@ -61,6 +67,8 @@ struct ib_umem_chunk {
> >>
> >> #ifdef CONFIG_INFINIBAND_USER_MEM
> >>
> >> +struct ib_umem *ib_iomem_get(struct ib_ucontext *context, unsigned long addr,
> >> + size_t size, int access, int dmasync);
> >> struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >> size_t size, int access, int dmasync);
> >> void ib_umem_release(struct ib_umem *umem);
> >> @@ -70,6 +78,12 @@ int ib_umem_page_count(struct ib_umem *umem);
> >>
> >> #include<linux/err.h>
> >>
> >> +static struct ib_umem *ib_iomem_get(struct ib_ucontext *context,
> >> + unsigned long addr, size_t size,
> >> + int access, int dmasync) {
> >> + return ERR_PTR(-EINVAL);
> >> +}
> >> +
> >> static inline struct ib_umem *ib_umem_get(struct ib_ucontext *context,
> >> unsigned long addr, size_t size,
> >> int access, int dmasync) {
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <4C51D15F.8060708-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-07-29 19:33 ` Ralph Campbell
@ 2010-07-29 20:13 ` Jason Gunthorpe
[not found] ` <20100729201317.GA7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
1 sibling, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2010-07-29 20:13 UTC (permalink / raw)
To: Tom Tucker
Cc: Ralph Campbell, Tom Tucker,
rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
brandt-4OHPYypu0djtX7QSmKvirg@public.gmane.org,
swise-/Yg/VP3ZvrM@public.gmane.org
On Thu, Jul 29, 2010 at 02:07:11PM -0500, Tom Tucker wrote:
>>> Add an ib_iomem_get service that converts a vma to an array of
>>> physical addresses. This makes it easier for each device driver to
>>> add support for the reg_io_mr provider method.
Tom, could you say a few words on what this is for?
Also, I'd like to see a strong defence of this new user space API
particularly:
1) Why can't this be done with the existing ibv_reg_mr, like huge
pages are.
2) How is it possible for userspace to know when it should use
ibv_reg_mr vs ibv_reg_io_mr?
3) Why can't ibv_dereg_io_mr be entirely handled by ibv_dereg_mr?
On first glance, this seems like a hugely bad API to me :)
Also, if new user space APIs are really needed, they do need man pages
too ...
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <20100729201317.GA7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-07-29 20:29 ` Tom Tucker
[not found] ` <4C51E4B1.6040800-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
0 siblings, 1 reply; 17+ messages in thread
From: Tom Tucker @ 2010-07-29 20:29 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Ralph Campbell, Tom Tucker,
rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
brandt-4OHPYypu0djtX7QSmKvirg@public.gmane.org,
swise-/Yg/VP3ZvrM@public.gmane.org
On 7/29/10 3:13 PM, Jason Gunthorpe wrote:
> On Thu, Jul 29, 2010 at 02:07:11PM -0500, Tom Tucker wrote:
>
>>>> Add an ib_iomem_get service that converts a vma to an array of
>>>> physical addresses. This makes it easier for each device driver to
>>>> add support for the reg_io_mr provider method.
>>>>
> Tom, could you say a few words on what this is for?
>
>
The proposed interface is there to allow a user-mode application to mmap
kernel memory and/or device memory as a data source or sync for RDMA.
This is useful for a number of applications, for example:
- An application that wishes to provide kernel statistics to a
monitoring host with zero overhead. For example, a set of compute nodes
that are being monitored by a statistics gathering application (Ovis),
but don't want to introduce scheduler jitter into the compute node. The
stats are provided in a kernel allocated buffer and the monitoring app
'gathers' those statistics with RDMA_READ at zero cost to the monitored
host.
- An application that wishes to provide network (RDMA) access to a large
static memory hung off of it's PCI-e bus, thereby creating a distributed
and persistent memory
Those are the only two that I've written so far that use these verbs.
> Also, I'd like to see a strong defence of this new user space API
> particularly:
> 1) Why can't this be done with the existing ibv_reg_mr, like huge
> pages are.
>
The ibv_reg_mr API assumes that the memory being registered was
allocated in user mode and is part of the current->mm VMA. It uses
get_user_pages which will scoff and jeer at kernel memory.
> 2) How is it possible for userspace to know when it should use
> ibv_reg_mr vs ibv_reg_io_mr?
>
By virtue of the device that it is mmap'ing. If I mmap my_vmstat_driver,
I know that the memory I am mapping is a kernel buffer.
> 3) Why can't ibv_dereg_io_mr be entirely handled by ibv_dereg_mr?
>
>
Probably could.
> On first glance, this seems like a hugely bad API to me :)
>
>
Well hopefully now that it's purpose is revealed you will change your
view and we can collaboratively make it better :-)
> Also, if new user space APIs are really needed, they do need man pages
> too ...
>
>
Of course.
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <4C51E4B1.6040800-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2010-07-29 20:41 ` Jason Gunthorpe
[not found] ` <20100729204150.GC7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2010-07-29 20:41 UTC (permalink / raw)
To: Tom Tucker
Cc: Ralph Campbell, Tom Tucker,
rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
brandt-4OHPYypu0djtX7QSmKvirg@public.gmane.org,
swise-/Yg/VP3ZvrM@public.gmane.org
On Thu, Jul 29, 2010 at 03:29:37PM -0500, Tom Tucker wrote:
>> Also, I'd like to see a strong defence of this new user space API
>> particularly:
>> 1) Why can't this be done with the existing ibv_reg_mr, like huge
>> pages are.
> The ibv_reg_mr API assumes that the memory being registered was
> allocated in user mode and is part of the current->mm VMA. It uses
> get_user_pages which will scoff and jeer at kernel memory.
I'm confused? What is the vaddr input then? How does userspace get
that value? Isn't it created by mmap or the like?
Ie for the PCI-E example you gave I assume the flow is that userspace
mmaps devices/pci0000:00/0000:00:XX.X/resourceX to get the IO memory
and then passes that through to ibv_reg_mr?
IMHO, ibv_reg_mr API should accept any valid vaddr available to the
process and if it bombs for certain kinds of vaddrs then it is just a
bug..
>> 2) How is it possible for userspace to know when it should use
>> ibv_reg_mr vs ibv_reg_io_mr?
> By virtue of the device that it is mmap'ing. If I mmap my_vmstat_driver,
> I know that the memory I am mapping is a kernel buffer.
Yah, but what if the next version of your vmstat driver changes the
kind of memory it returns?
>> On first glance, this seems like a hugely bad API to me :)
> Well hopefully now that it's purpose is revealed you will change your
> view and we can collaboratively make it better :-)
I don't object to the idea, just to the notion that user space is
supposed to somehow know that one vaddr is different from another
vaddr and call the right API - seems impossible to use correctly to
me.
What would you have to do to implement this using scheme using
ibv_reg_mr as the entry point?
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <20100729204150.GC7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-07-29 20:54 ` Tom Tucker
2010-07-29 22:49 ` [Suspected SPAM] " Ralph Campbell
1 sibling, 0 replies; 17+ messages in thread
From: Tom Tucker @ 2010-07-29 20:54 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Ralph Campbell, Tom Tucker,
rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
brandt-4OHPYypu0djtX7QSmKvirg@public.gmane.org,
swise-/Yg/VP3ZvrM@public.gmane.org
On 7/29/10 3:41 PM, Jason Gunthorpe wrote:
> On Thu, Jul 29, 2010 at 03:29:37PM -0500, Tom Tucker wrote:
>
>
>>> Also, I'd like to see a strong defence of this new user space API
>>> particularly:
>>> 1) Why can't this be done with the existing ibv_reg_mr, like huge
>>> pages are.
>>>
>
>> The ibv_reg_mr API assumes that the memory being registered was
>> allocated in user mode and is part of the current->mm VMA. It uses
>> get_user_pages which will scoff and jeer at kernel memory.
>>
> I'm confused? What is the vaddr input then? How does userspace get
> that value? Isn't it created by mmap or the like?
>
Yes.
> Ie for the PCI-E example you gave I assume the flow is that userspace
> mmaps devices/pci0000:00/0000:00:XX.X/resourceX to get the IO memory
> and then passes that through to ibv_reg_mr?
>
>
Not exactly. It would mmap the device that manages the adapter hosting
the memory.
> IMHO, ibv_reg_mr API should accept any valid vaddr available to the
> process and if it bombs for certain kinds of vaddrs then it is just a
> bug..
>
>
Perhaps.
>>> 2) How is it possible for userspace to know when it should use
>>> ibv_reg_mr vs ibv_reg_io_mr?
>>>
>
>> By virtue of the device that it is mmap'ing. If I mmap my_vmstat_driver,
>> I know that the memory I am mapping is a kernel buffer.
>>
> Yah, but what if the next version of your vmstat driver changes the
> kind of memory it returns?
>
>
It's a general service for a class of memory, not an enabler for a
particular application's peculiarities.
>>> On first glance, this seems like a hugely bad API to me :)
>>>
>
>> Well hopefully now that it's purpose is revealed you will change your
>> view and we can collaboratively make it better :-)
>>
> I don't object to the idea, just to the notion that user space is
> supposed to somehow know that one vaddr is different from another
> vaddr and call the right API - seems impossible to use correctly to
> me.
>
> What would you have to do to implement this using scheme using
> ibv_reg_mr as the entry point?
>
>
The kernel service on the other side of ibv_reg_mr verb could divine the
necessary information by searching all vma owned by current and looking
at vma_flags to decide what type it was.
> Jason
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <20100729162509.14674.34237.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
2010-07-29 18:22 ` Ralph Campbell
@ 2010-07-29 21:20 ` Tom Tucker
1 sibling, 0 replies; 17+ messages in thread
From: Tom Tucker @ 2010-07-29 21:20 UTC (permalink / raw)
To: Tom Tucker
Cc: rdreier-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
brandt-4OHPYypu0djtX7QSmKvirg, swise-/Yg/VP3ZvrM
On 7/29/10 11:25 AM, Tom Tucker wrote:
> From: Tom Tucker<tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>
> Add an ib_iomem_get service that converts a vma to an array of
> physical addresses. This makes it easier for each device driver to
> add support for the reg_io_mr provider method.
>
> Signed-off-by: Tom Tucker<tom-/Yg/VP3ZvrM@public.gmane.org>
> ---
>
> drivers/infiniband/core/umem.c | 248 ++++++++++++++++++++++++++++++++++++++--
> include/rdma/ib_umem.h | 14 ++
> 2 files changed, 251 insertions(+), 11 deletions(-)
>
> [...snip...]
>
> + /* The pfn_list we built is a set of Page
> + * Frame Numbers (PFN) whose physical address
> + * is PFN<< PAGE_SHIFT. The SG DMA mapping
> + * services expect page addresses, not PFN,
> + * therefore, we have to do the dma mapping
> + * ourselves here. */
> + for (i = 0; i< chunk->nents; ++i) {
> + sg_set_page(&chunk->page_list[i], 0,
> + PAGE_SIZE, 0);
> + chunk->page_list[i].dma_address =
> + (pfn_list[i]<< PAGE_SHIFT);
>
This is not architecture independent. Does anyone have any thoughts on
how this ought to be done?
> + chunk->page_list[i].dma_length = PAGE_SIZE;
> + }
> + chunk->nmap = chunk->nents;
> + ret -= chunk->nents;
> + off += chunk->nents;
> + list_add_tail(&chunk->list,&umem->chunk_list);
> + }
> +
> + ret = 0;
> + }
> +
>
[...snip...]
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Suspected SPAM] Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <20100729204150.GC7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-07-29 20:54 ` Tom Tucker
@ 2010-07-29 22:49 ` Ralph Campbell
[not found] ` <1280443789.6820.153.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
1 sibling, 1 reply; 17+ messages in thread
From: Ralph Campbell @ 2010-07-29 22:49 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Tom Tucker, rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
brandt-4OHPYypu0djtX7QSmKvirg@public.gmane.org,
swise-/Yg/VP3ZvrM@public.gmane.org
On Thu, 2010-07-29 at 13:41 -0700, Jason Gunthorpe wrote:
> On Thu, Jul 29, 2010 at 03:29:37PM -0500, Tom Tucker wrote:
>
> >> Also, I'd like to see a strong defence of this new user space API
> >> particularly:
> >> 1) Why can't this be done with the existing ibv_reg_mr, like huge
> >> pages are.
>
> > The ibv_reg_mr API assumes that the memory being registered was
> > allocated in user mode and is part of the current->mm VMA. It uses
> > get_user_pages which will scoff and jeer at kernel memory.
>
> I'm confused? What is the vaddr input then? How does userspace get
> that value? Isn't it created by mmap or the like?
>
> Ie for the PCI-E example you gave I assume the flow is that userspace
> mmaps devices/pci0000:00/0000:00:XX.X/resourceX to get the IO memory
> and then passes that through to ibv_reg_mr?
>
> IMHO, ibv_reg_mr API should accept any valid vaddr available to the
> process and if it bombs for certain kinds of vaddrs then it is just a
> bug..
The reason that get_user_pages() returns an error for
VM_IO and VM_PFNMAP mappings is that there may not be a struct page
associated with the physical address so there is no general way
for the VM layer to allow sharing the page and know when the page
(not the mapping into user space) is not being used.
I looked at this issue for supporting Nvidia CUDA driver which
sets the VM_IO flag on its memory mmaped into user space.
Nvidia wanted to add a set of callback functions and a new
get_driver_pages() function which should be called instead of
get_user_pages() very similar to what is being proposed here for
VM_PFNMAP.
My solution to the problem was to change the Nvidia driver by
not setting VM_IO (since the memory was normal kernel memory with
a struct page) but Nvidia didn't like that solution.
> >> 2) How is it possible for userspace to know when it should use
> >> ibv_reg_mr vs ibv_reg_io_mr?
>
> > By virtue of the device that it is mmap'ing. If I mmap my_vmstat_driver,
> > I know that the memory I am mapping is a kernel buffer.
>
> Yah, but what if the next version of your vmstat driver changes the
> kind of memory it returns?
>
> >> On first glance, this seems like a hugely bad API to me :)
>
> > Well hopefully now that it's purpose is revealed you will change your
> > view and we can collaboratively make it better :-)
>
> I don't object to the idea, just to the notion that user space is
> supposed to somehow know that one vaddr is different from another
> vaddr and call the right API - seems impossible to use correctly to
> me.
I agree unless there is some function that can be called which
calls into the kernel and returns an enum or something to
indicate which call to use.
> What would you have to do to implement this using scheme using
> ibv_reg_mr as the entry point?
>
> Jason
You would need to modify ib_umem_get() to check for the VM_PFNMAP
flag and build the struct ib_umem similar to the proposed
ib_iomem_get(). However, the page reference counting/sharing issue
would need to be solved. I think there are kernel level callbacks
for this that could be used.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Suspected SPAM] Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <1280443789.6820.153.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
@ 2010-07-29 22:57 ` Jason Gunthorpe
[not found] ` <20100729225705.GC26390-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2010-07-29 22:57 UTC (permalink / raw)
To: Ralph Campbell
Cc: Tom Tucker, rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
brandt-4OHPYypu0djtX7QSmKvirg@public.gmane.org,
swise-/Yg/VP3ZvrM@public.gmane.org
> You would need to modify ib_umem_get() to check for the VM_PFNMAP
> flag and build the struct ib_umem similar to the proposed
> ib_iomem_get(). However, the page reference counting/sharing issue
> would need to be solved. I think there are kernel level callbacks
> for this that could be used.
But in this case the pages are already mmaped into a user process,
there must be some mechanism to ensure they don't get pulled away?!
Though, I guess, what happens if you hot un-plug the PCI-E card that
has a process mmaping its memory?!
What happens if you RDMA READ from PCI-E address space that does not
have any device responding?
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Suspected SPAM] Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <20100729225705.GC26390-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2010-07-29 23:13 ` Ralph Campbell
2010-07-30 2:16 ` Tom Tucker
1 sibling, 0 replies; 17+ messages in thread
From: Ralph Campbell @ 2010-07-29 23:13 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Tom Tucker, rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
brandt-4OHPYypu0djtX7QSmKvirg@public.gmane.org,
swise-/Yg/VP3ZvrM@public.gmane.org
On Thu, 2010-07-29 at 15:57 -0700, Jason Gunthorpe wrote:
> > You would need to modify ib_umem_get() to check for the VM_PFNMAP
> > flag and build the struct ib_umem similar to the proposed
> > ib_iomem_get(). However, the page reference counting/sharing issue
> > would need to be solved. I think there are kernel level callbacks
> > for this that could be used.
>
> But in this case the pages are already mmaped into a user process,
> there must be some mechanism to ensure they don't get pulled away?!
Yes. The VM_PFNMAP flag means the device is the owner of the memory
and has to handle unmmap. Since the rest of the kernel won't allow
that memory to be shared, swapped out, etc., only the driver has to
be notified or handle when it goes away.
> Though, I guess, what happens if you hot un-plug the PCI-E card that
> has a process mmaping its memory?!
The hot un-plug would have to close the device first which would
unmmap the memory or the device would not allow the hot un-plug
to happed by saying it is busy.
> What happens if you RDMA READ from PCI-E address space that does not
> have any device responding?
Well, RDMA reads with a Mellanox style HCA may cause problems
because I don't know if they support PCIe to PCIe DMAs.
If the device isn't responding to PCIe reads of its memory,
the root complex will timeout the read transaction and signal
the initiator (CPU) which will cause a bus error signal.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Suspected SPAM] Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
[not found] ` <20100729225705.GC26390-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-07-29 23:13 ` Ralph Campbell
@ 2010-07-30 2:16 ` Tom Tucker
1 sibling, 0 replies; 17+ messages in thread
From: Tom Tucker @ 2010-07-30 2:16 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Ralph Campbell, rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
brandt-4OHPYypu0djtX7QSmKvirg@public.gmane.org,
swise-/Yg/VP3ZvrM@public.gmane.org
On 7/29/10 5:57 PM, Jason Gunthorpe wrote:
>> You would need to modify ib_umem_get() to check for the VM_PFNMAP
>> flag and build the struct ib_umem similar to the proposed
>> ib_iomem_get(). However, the page reference counting/sharing issue
>> would need to be solved. I think there are kernel level callbacks
>> for this that could be used.
>>
> But in this case the pages are already mmaped into a user process,
> there must be some mechanism to ensure they don't get pulled away?!
>
>
This is not virtual memory. It's real memory.
> Though, I guess, what happens if you hot un-plug the PCI-E card that
> has a process mmaping its memory?!
>
>
Exactly. The memory would have to be physically detached for it to get
'pulled away'
> What happens if you RDMA READ from PCI-E address space that does not
> have any device responding?
>
>
bus error.
> Jason
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2010-07-30 2:16 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-29 16:24 [RFC PATCH 0/4] ibverbs: new verbs for registering I/O memory Tom Tucker
[not found] ` <20100729162339.14674.15788.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
2010-07-29 16:25 ` [RFC PATCH 1/4] ibverbs: Add new provider verb for I/O memory registration Tom Tucker
2010-07-29 16:25 ` [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service Tom Tucker
[not found] ` <20100729162509.14674.34237.stgit-T4OLL4TyM9aNDNWfRnPdfg@public.gmane.org>
2010-07-29 18:22 ` Ralph Campbell
[not found] ` <1280427723.6820.56.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-07-29 19:07 ` Tom Tucker
[not found] ` <4C51D15F.8060708-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-07-29 19:33 ` Ralph Campbell
2010-07-29 20:13 ` Jason Gunthorpe
[not found] ` <20100729201317.GA7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-07-29 20:29 ` Tom Tucker
[not found] ` <4C51E4B1.6040800-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-07-29 20:41 ` Jason Gunthorpe
[not found] ` <20100729204150.GC7920-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-07-29 20:54 ` Tom Tucker
2010-07-29 22:49 ` [Suspected SPAM] " Ralph Campbell
[not found] ` <1280443789.6820.153.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-07-29 22:57 ` Jason Gunthorpe
[not found] ` <20100729225705.GC26390-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-07-29 23:13 ` Ralph Campbell
2010-07-30 2:16 ` Tom Tucker
2010-07-29 21:20 ` Tom Tucker
2010-07-29 16:25 ` [RFC PATCH 3/4] uverbs_cmd: Add uverbs command definitions for reg_io_mr Tom Tucker
2010-07-29 16:25 ` [RFC PATCH 4/4] mthca: Add support for reg_io_mr and unreg_io_mr Tom Tucker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).