* [PATCH rdma-next 4/4] IB/mlx5: Use u64 for UMR length
From: Leon Romanovsky @ 2016-11-27 13:18 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Maor Gottlieb
In-Reply-To: <1480252702-8005-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
From: Maor Gottlieb <maorg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
The fast_registration length is used to convey length for memory
registrations through UMR which can be of any size up to 2^64.
Change the length type to be u64.
Fixes: 968e78dd9644 ('IB/mlx5: Enhance UMR support to allow partial page table update')
Signed-off-by: Maor Gottlieb <maorg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 7d68990..0b938ad 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -418,7 +418,7 @@ struct mlx5_umr_wr {
struct ib_pd *pd;
unsigned int page_shift;
unsigned int npages;
- u32 length;
+ u64 length;
int access_flags;
u32 mkey;
};
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH rdma-next 3/4] IB/mlx5: Avoid system crash when enabling many VFs
From: Leon Romanovsky @ 2016-11-27 13:18 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Eli Cohen, Maor Gottlieb
In-Reply-To: <1480252702-8005-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
From: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
When enabling many VFs, the total amount of DMA mappings increase
significantly. This causes DMA allocations to take a lot of time
since they are serialized in the kernel.
As a result the driver enters into fatal condition due to
timeout and the system hangs. To recover from this we disable
MR cache for VFs.
PFs will still have a full cache and VFs cache can be manipulated
as usual after driver load.
Fixes: e126ba97dba9 ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Maor Gottlieb <maorg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mr.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 4e90124..501af9e 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -628,7 +628,8 @@ int mlx5_mr_cache_init(struct mlx5_ib_dev *dev)
ent->order = i + 2;
ent->dev = dev;
- if (dev->mdev->profile->mask & MLX5_PROF_MASK_MR_CACHE)
+ if ((dev->mdev->profile->mask & MLX5_PROF_MASK_MR_CACHE) &&
+ (mlx5_core_is_pf(dev->mdev)))
limit = dev->mdev->profile->mr_cache[i].limit;
else
limit = 0;
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH rdma-next 2/4] IB/mlx5: Assign SRQ type earlier
From: Leon Romanovsky @ 2016-11-27 13:18 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Maor Gottlieb
In-Reply-To: <1480252702-8005-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
From: Maor Gottlieb <maorg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Move the SRQ type assignment to be before actually using it
in create_srq_user() and in create_srq_kernel() functions.
Fixes: af1ba291c5e4 ('{net, IB}/mlx5: Refactor internal SRQ API')
Signed-off-by: Maor Gottlieb <maorg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Reviewed-by: Majd Dibbiny <majd-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
drivers/infiniband/hw/mlx5/srq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index 3857dbd..729b069 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -282,6 +282,7 @@ struct ib_srq *mlx5_ib_create_srq(struct ib_pd *pd,
mlx5_ib_dbg(dev, "desc_size 0x%x, req wr 0x%x, srq size 0x%x, max_gs 0x%x, max_avail_gather 0x%x\n",
desc_size, init_attr->attr.max_wr, srq->msrq.max, srq->msrq.max_gs,
srq->msrq.max_avail_gather);
+ in.type = init_attr->srq_type;
if (pd->uobject)
err = create_srq_user(pd, srq, &in, udata, buf_size);
@@ -294,7 +295,6 @@ struct ib_srq *mlx5_ib_create_srq(struct ib_pd *pd,
goto err_srq;
}
- in.type = init_attr->srq_type;
in.log_size = ilog2(srq->msrq.max);
in.wqe_shift = srq->msrq.wqe_shift - 4;
if (srq->wq_sig)
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH rdma-next 1/4] IB/mlx4: Fix out-of-range array index in destroy qp flow
From: Leon Romanovsky @ 2016-11-27 13:18 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jack Morgenstein
In-Reply-To: <1480252702-8005-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
From: Jack Morgenstein <jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
For non-special QPs, the port value becomes non-zero only at the
RESET-to-INIT transition. If the QP has not undergone that transition,
its port number value is still zero.
If such a QP is destroyed before being moved out of the RESET state,
subtracting one from the qp port number results in a negative value.
Using that negative value as an index into the qp1_proxy array
results in an out-of-bounds array reference.
Fix this by testing that the QP type is one that uses qp1_proxy before
using the port number. For special QPs of all types, the port number is
specified at QP creation time.
Fixes: 9433c188915c ("IB/mlx4: Invoke UPDATE_QP for proxy QP1 on MAC changes")
Signed-off-by: Jack Morgenstein <jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
drivers/infiniband/hw/mlx4/qp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 570bc86..6973947 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -1280,7 +1280,8 @@ static int _mlx4_ib_destroy_qp(struct ib_qp *qp)
if (is_qp0(dev, mqp))
mlx4_CLOSE_PORT(dev->dev, mqp->port);
- if (dev->qp1_proxy[mqp->port - 1] == mqp) {
+ if (mqp->mlx4_ib_qp_type == MLX4_IB_QPT_PROXY_GSI &&
+ dev->qp1_proxy[mqp->port - 1] == mqp) {
mutex_lock(&dev->qp1_proxy_lock[mqp->port - 1]);
dev->qp1_proxy[mqp->port - 1] = NULL;
mutex_unlock(&dev->qp1_proxy_lock[mqp->port - 1]);
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH rdma-next 0/4] Mellanox ConnectX-3/ConnectX-4 fixes for 4.10
From: Leon Romanovsky @ 2016-11-27 13:18 UTC (permalink / raw)
To: dledford-H+wXaHxf7aLQT0dZR+AlfA; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi Doug,
Please find below four different fixes to ConnectX-3/ConnectX-4.
This patchset was generated against v4.9-rc6 and targeted for 4.10.
Available in the "topic/mlx-fixes-4.10" topic branch of this git repo:
git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git
Or for browsing:
https://git.kernel.org/cgit/linux/kernel/git/leon/linux-rdma.git/log/?h=topic/mlx-fixes-4.10
Thanks
Eli Cohen (1):
IB/mlx5: Avoid system crash when enabling many VFs
Jack Morgenstein (1):
IB/mlx4: Fix out-of-range array index in destroy qp flow
Maor Gottlieb (2):
IB/mlx5: Assign SRQ type earlier
IB/mlx5: Use u64 for UMR length
drivers/infiniband/hw/mlx4/qp.c | 3 ++-
drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +-
drivers/infiniband/hw/mlx5/mr.c | 3 ++-
drivers/infiniband/hw/mlx5/srq.c | 2 +-
4 files changed, 6 insertions(+), 4 deletions(-)
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Haggai Eran @ 2016-11-27 8:16 UTC (permalink / raw)
To: Jason Gunthorpe, Christian König
Cc: Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix, Serguei Sagalovitch,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Blinzer, Paul, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Sander, Ben,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161125193252.GC16504-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
>
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>>
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
>
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
>
> Yes, it is DMA, so this is a valid approach.
>
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
>
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.
I think blocking mmu notifiers against something that is basically
controlled by user-space can be problematic. This can block things like
memory reclaim. If you have user-space access to the device's queues,
user-space can block the mmu notifier forever.
On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately. This should work on legacy devices without ODP
support, and allows the system to safely terminate a process that
misbehaves. The downside of course is that it cannot transparently
migrate memory but I think for user-space RDMA doing that transparently
requires hardware support for paging, via something like HMM.
...
> I'm hearing most people say ZONE_DEVICE is the way to handle this,
> which means the missing remaing piece for RDMA is some kind of DMA
> core support for p2p address translation..
Yes, this is definitely something we need. I think Will Davis's patches
are a good start.
Another thing I think is that while HMM is good for user-space
applications, for kernel p2p use there is no need for that. Using
ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
pages for the short duration as you wrote above could work fine for
kernel uses in which we can guarantee they are short.
Haggai
^ permalink raw reply
* Re: [PATCH] net/mlx5: drop duplicate header delay.h
From: David Miller @ 2016-11-26 1:33 UTC (permalink / raw)
To: geliangtang; +Cc: saeedm, matanb, leonro, netdev, linux-rdma, linux-kernel
In-Reply-To: <03d5a2a0f03458cdb4f2b139cfabc11b5c334f95.1479990943.git.geliangtang@gmail.com>
From: Geliang Tang <geliangtang@gmail.com>
Date: Thu, 24 Nov 2016 21:58:33 +0800
> Drop duplicate header delay.h from mlx5/core/main.c.
>
> Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Applied.
^ permalink raw reply
* Re: [patch] net/mlx5: remove a duplicate condition
From: David Miller @ 2016-11-26 1:30 UTC (permalink / raw)
To: dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA
Cc: saeedm-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
leonro-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-janitors-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161124110345.GC17225@mwanda>
From: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Date: Thu, 24 Nov 2016 14:03:45 +0300
> We verified that MLX5_FLOW_CONTEXT_ACTION_COUNT was set on the first
> line of the function so we don't need to check again here.
>
> Signed-off-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Alex Deucher @ 2016-11-25 23:41 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix, Serguei Sagalovitch, Blinzer, Paul,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Sander, Ben, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Christian König,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161125193410.GD16504-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
On Fri, Nov 25, 2016 at 2:34 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:
>
>> b) Allocation may not have CPU address at all - only GPU one.
>
> But you don't expect RDMA to work in the case, right?
>
> GPU people need to stop doing this windowed memory stuff :)
>
Blame 32 bit systems and GPUs with tons of vram :)
I think resizable bars are finally coming in a useful way so this
should go away soon.
Alex
^ permalink raw reply
* Re: [PATCH 1/2] SRP transport: Move queuecommand() wait code to SCSI core
From: Max Gurtovoy @ 2016-11-25 23:21 UTC (permalink / raw)
To: Bart Van Assche, James Bottomley, Martin K. Petersen
Cc: Doug Ledford, Christoph Hellwig, Sagi Grimberg,
linux-scsi@vger.kernel.org, linux-rdma@vger.kernel.org
In-Reply-To: <649cb037-3196-df4d-0d31-47f040ce44a2@sandisk.com>
On 11/23/2016 2:17 AM, Bart Van Assche wrote:
> Additionally, rename srp_wait_for_queuecommand() into
> scsi_wait_for_queuecommand() and add a comment about the
> queuecommand() call from scsi_send_eh_cmnd().
>
> Note: this patch changes scsi_internal_device_block from a function
> that did not sleep into a function that may sleep. This is fine for
> all callers of this function:
> * scsi_internal_device_block() is called from the mpt3sas device while
> that driver holds the ioc->dm_cmds.mutex. This means that the mpt3sas
> driver calls this function from thread context.
> * scsi_target_block() is called by __iscsi_block_session() from
> kernel thread context and with IRQs enabled.
> * The SRP transport code also calls scsi_target_block() from kernel
> thread context while sleeping is allowed.
> * The snic driver also calls scsi_target_block() from a context from
> which sleeping is allowed. The scsi_target_block() call namely occurs
> immediately after a scsi_flush_work() call.
>
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: James Bottomley <jejb@linux.vnet.ibm.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Doug Ledford <dledford@redhat.com>
> ---
> drivers/scsi/scsi_lib.c | 41 +++++++++++++++++++++++++++++++++++++--
> drivers/scsi/scsi_transport_srp.c | 41 ++++++---------------------------------
> 2 files changed, 45 insertions(+), 37 deletions(-)
>
> +
> +/**
> + * scsi_wait_for_queuecommand() - wait for ongoing queuecommand() calls
> + * @shost: SCSI host pointer.
the arg is sdev.
> + *
> + * Wait until the ongoing shost->hostt->queuecommand() calls that are
> + * invoked from scsi_request_fn() have finished.
> + */
> +static void scsi_wait_for_queuecommand(struct scsi_device *sdev)
> +{
> + WARN_ON_ONCE(sdev->host->use_blk_mq);
> +
> + while (scsi_request_fn_active(sdev))
> + msleep(20);
> +}
> +
> +/**
> * scsi_device_quiesce - Block user issued commands.
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-25 21:18 UTC (permalink / raw)
To: Christian König
Cc: Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix, Serguei Sagalovitch,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Blinzer, Paul, Sander, Ben, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Christian König,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <a98185d9-ffb1-6469-4272-2d1222600825-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
On Fri, Nov 25, 2016 at 09:40:10PM +0100, Christian König wrote:
> We call this "userptr" and it's just a combination of get_user_pages() on
> command submission and making sure the returned list of pages stays valid
> using a MMU notifier.
Doesn't that still pin the page?
> The "big" problem with this approach is that it is horrible slow. I mean
> seriously horrible slow so that we actually can't use it for some of the
> purposes we wanted to use it.
>
> >The code moving the page will move it and the next GPU command that
> >needs it will refault it in the usual way, just like the CPU would.
>
> And here comes the problem. CPU do this on a page by page basis, so they
> fault only what needed and everything else gets filled in on demand. This
> results that faulting a page is relatively light weight operation.
>
> But for GPU command submission we don't know which pages might be accessed
> beforehand, so what we do is walking all possible pages and make sure all of
> them are present.
Little confused why this is slow? So you fault the entire user MM into
your page tables at start of day and keep track of it with mmu
notifiers?
> >This might be much more efficient since it optimizes for the common
> >case of unchanging translation tables.
>
> Yeah, completely agree. It works perfectly fine as long as you don't have
> two drivers trying to mess with the same page.
Well, the idea would be to not have the GPU block the other driver
beyond hinting that the page shouldn't be swapped out.
> >This assumes the commands are fairly short lived of course, the
> >expectation of the mmu notifiers is that a flush is reasonably prompt
>
> Correct, this is another problem. GFX command submissions usually don't take
> longer than a few milliseconds, but compute command submission can easily
> take multiple hours.
So, that won't work - you have the same issue as RDMA with work loads
like that.
If you can't somehow fence the hardware then pinning is the only
solution. Felix has the right kind of suggestion for what is needed -
globally stop the GPU, fence the DMA, fix the page tables, and start
it up again. :\
> I can easily imagine what would happen when kswapd is blocked by a GPU
> command submission for an hour or so while the system is under memory
> pressure :)
Right. The advantage of pinning is it tells the other stuff not to
touch the page and doesn't block it, MMU notifiers have to be able to
block&fence quickly.
> I'm thinking on this problem for about a year now and going in circles for
> quite a while. So if you have ideas on this even if they sound totally
> crazy, feel free to come up.
Well, it isn't a software problem. From what I've seen in this thread
the GPU application requires coherent page table mirroring, so the
only full & complete solution is going to be to actually implement
that somehow in GPU hardware.
Everything else is going to be deeply flawed somehow. Linux just
doesn't have the support for this kind of stuff - and I'm honestly not
sure something better is even possible considering the hardware
constraints....
This doesn't have to be faulting, but really anything that lets you
pause the GPU DMA and reload the page tables.
You might look at trying to use the IOMMU and/or PCI ATS in very new
hardware. IIRC the physical IOMMU hardware can do the fault and fence
and block stuff, but I'm not sure about software support for using the
IOMMU to create coherent user page table mirrors - that is something
Linux doesn't do today. But there is demand for this kind of capability..
Jason
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Felix Kuehling @ 2016-11-25 20:51 UTC (permalink / raw)
To: Christian König, Jason Gunthorpe, Christian König
Cc: Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Serguei Sagalovitch,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Sander, Ben, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Blinzer, Paul,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <a98185d9-ffb1-6469-4272-2d1222600825-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
On 16-11-25 03:40 PM, Christian König wrote:
> Am 25.11.2016 um 20:32 schrieb Jason Gunthorpe:
>> This assumes the commands are fairly short lived of course, the
>> expectation of the mmu notifiers is that a flush is reasonably prompt
>
> Correct, this is another problem. GFX command submissions usually
> don't take longer than a few milliseconds, but compute command
> submission can easily take multiple hours.
>
> I can easily imagine what would happen when kswapd is blocked by a GPU
> command submission for an hour or so while the system is under memory
> pressure :)
>
> I'm thinking on this problem for about a year now and going in circles
> for quite a while. So if you have ideas on this even if they sound
> totally crazy, feel free to come up.
Our GPUs (at least starting with VI) support compute-wave-save-restore
and can swap out compute queues with fairly low latency. Yes, there is
some overhead (both memory usage and time), but it's a fairly regular
thing with our hardware scheduler (firmware, actually) when we need to
preempt running compute queues to update runlists or we overcommit the
hardware queue resources.
Regards,
Felix
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Serguei Sagalovitch @ 2016-11-25 20:48 UTC (permalink / raw)
To: Felix Kuehling, Logan Gunthorpe, Christian König,
Jason Gunthorpe, Dan Williams
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Blinzer, Paul, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Sander, Ben,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <fa44af1c-0c8a-7726-5100-a1f74f824a21-5C7GfCeVMHo@public.gmane.org>
On 2016-11-25 03:26 PM, Felix Kuehling wrote:
> On 16-11-25 12:20 PM, Serguei Sagalovitch wrote:
>>> A white list may end up being rather complicated if it has to cover
>>> different CPU generations and system architectures. I feel this is a
>>> decision user space could easily make.
>>>
>>> Logan
>> I agreed that it is better to leave up to user space to check what is
>> working
>> and what is not. I found that write is practically always working but
>> read very
>> often not. Also sometimes system BIOS update could fix the issue.
>>
> But is user mode always aware that P2P is going on or even possible? For
> example you may have a library reading a buffer from a file, but it
> doesn't necessarily know where that buffer is located (system memory,
> VRAM, ...) and it may not know what kind of the device the file is on
> (SATA drive, NVMe SSD, ...). The library will never know if all it gets
> is a pointer and a file descriptor.
>
> The library ends up calling a read system call. Then it would be up to
> the kernel to figure out the most efficient way to read the buffer from
> the file. If supported, it could use P2P between a GPU and NVMe where
> the NVMe device performs a DMA write to VRAM.
>
> If you put the burden of figuring out the P2P details on user mode code,
> I think it will severely limit the use cases that actually take
> advantage of it. You also risk a bunch of different implementations that
> get it wrong half the time on half the systems out there.
>
> Regards,
> Felix
>
>
I agreed in theory with you but I must admit that I do not know how
kernel could effectively collect all informations without running
pretty complicated tests each time on boot-up (if any configuration
changed including BIOS settings) and on pnp events. Also for efficient
way kernel needs to know performance results (and it could also
depends on clock / power mode) for read/write of each pair devices, for
double-buffering it needs to know / detect on which NUMA node
to allocate, etc. etc. Also device could be fully configured only
on the first request for access so it may be needed to change initialization
sequences.
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Christian König @ 2016-11-25 20:40 UTC (permalink / raw)
To: Jason Gunthorpe, Christian König
Cc: Haggai Eran, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix, Serguei Sagalovitch,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Sander, Ben, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Blinzer, Paul,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161125193252.GC16504-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Am 25.11.2016 um 20:32 schrieb Jason Gunthorpe:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
>
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
> Yes, it is DMA, so this is a valid approach.
>
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
>
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.
Yeah, we have a function to "import" anonymous pages from a CPU pointer
which works exactly that way as well.
We call this "userptr" and it's just a combination of get_user_pages()
on command submission and making sure the returned list of pages stays
valid using a MMU notifier.
The "big" problem with this approach is that it is horrible slow. I mean
seriously horrible slow so that we actually can't use it for some of the
purposes we wanted to use it.
> The code moving the page will move it and the next GPU command that
> needs it will refault it in the usual way, just like the CPU would.
And here comes the problem. CPU do this on a page by page basis, so they
fault only what needed and everything else gets filled in on demand.
This results that faulting a page is relatively light weight operation.
But for GPU command submission we don't know which pages might be
accessed beforehand, so what we do is walking all possible pages and
make sure all of them are present.
Now as far as I understand it the I/O subsystem for example assumes that
it can easily change the CPU page tables without much overhead. So for
example when a page can't modified it is temporary marked as readonly
AFAIK (you are probably way deeper into this than me, so please confirm).
That absolutely kills any performance for GPU command submissions. We
have use cases where we practically ended up playing ping/pong between
the GPU driver trying to grab the page with get_user_pages() and sombody
else in the kernel marking it readonly.
> This might be much more efficient since it optimizes for the common
> case of unchanging translation tables.
Yeah, completely agree. It works perfectly fine as long as you don't
have two drivers trying to mess with the same page.
> This assumes the commands are fairly short lived of course, the
> expectation of the mmu notifiers is that a flush is reasonably prompt
Correct, this is another problem. GFX command submissions usually don't
take longer than a few milliseconds, but compute command submission can
easily take multiple hours.
I can easily imagine what would happen when kswapd is blocked by a GPU
command submission for an hour or so while the system is under memory
pressure :)
I'm thinking on this problem for about a year now and going in circles
for quite a while. So if you have ideas on this even if they sound
totally crazy, feel free to come up.
Cheers,
Christian.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Felix Kuehling @ 2016-11-25 20:26 UTC (permalink / raw)
To: Serguei Sagalovitch, Logan Gunthorpe, Christian König,
Jason Gunthorpe, Dan Williams
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Blinzer, Paul, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Sander, Ben,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <ea52d962-6c8e-92d9-eb1b-3ace4bf56126-5C7GfCeVMHo@public.gmane.org>
On 16-11-25 12:20 PM, Serguei Sagalovitch wrote:
>
>> A white list may end up being rather complicated if it has to cover
>> different CPU generations and system architectures. I feel this is a
>> decision user space could easily make.
>>
>> Logan
> I agreed that it is better to leave up to user space to check what is
> working
> and what is not. I found that write is practically always working but
> read very
> often not. Also sometimes system BIOS update could fix the issue.
>
But is user mode always aware that P2P is going on or even possible? For
example you may have a library reading a buffer from a file, but it
doesn't necessarily know where that buffer is located (system memory,
VRAM, ...) and it may not know what kind of the device the file is on
(SATA drive, NVMe SSD, ...). The library will never know if all it gets
is a pointer and a file descriptor.
The library ends up calling a read system call. Then it would be up to
the kernel to figure out the most efficient way to read the buffer from
the file. If supported, it could use P2P between a GPU and NVMe where
the NVMe device performs a DMA write to VRAM.
If you put the burden of figuring out the P2P details on user mode code,
I think it will severely limit the use cases that actually take
advantage of it. You also risk a bunch of different implementations that
get it wrong half the time on half the systems out there.
Regards,
Felix
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-25 20:19 UTC (permalink / raw)
To: Serguei Sagalovitch
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix, Blinzer, Paul,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Sander, Ben, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Christian König,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <80aae3e6-278e-49af-7d09-bea87ffd19e8-5C7GfCeVMHo@public.gmane.org>
On Fri, Nov 25, 2016 at 02:49:50PM -0500, Serguei Sagalovitch wrote:
> GPU could perfectly access all VRAM. It is only issue for p2p without
> special interconnect and CPU access. Strictly speaking as long as we
> have "bus address" we could have RDMA but I agreed that for
> RDMA we could/should(?) always "request" CPU address (I hope that we
> could forget about 32-bit application :-)).
At least on x86 if you have a bus address you have a CPU address. All
RDMAable VRAM has to be visible in the BAR.
> BTW/FYI: About CPU access: Some user-level API is mainly handle based
> so there is no need for CPU access by default.
You mean no need for the memory to be virtually mapped into the
process?
Do you expect to RDMA from this kind of API? How will that work?
Jason
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Serguei Sagalovitch @ 2016-11-25 19:49 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix, Blinzer, Paul,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Sander, Ben, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Christian König,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161125193410.GD16504-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
On 2016-11-25 02:34 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:
>
>> b) Allocation may not have CPU address at all - only GPU one.
> But you don't expect RDMA to work in the case, right?
>
> GPU people need to stop doing this windowed memory stuff :)
GPU could perfectly access all VRAM. It is only issue for p2p without
special interconnect and CPU access. Strictly speaking as long as we
have "bus address" we could have RDMA but I agreed that for
RDMA we could/should(?) always "request" CPU address (I hope that we
could forget about 32-bit application :-)).
BTW/FYI: About CPU access: Some user-level API is mainly handle based
so there is no need for CPU access by default.
About "visible" / non-visible VRAM parts: I assume that going
forward we will be able to get rid from it completely as soon as support
for resizable PCI BAR will be implemented and/or old/current h/w
will become obsolete.
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-25 19:41 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix, Serguei Sagalovitch, Blinzer, Paul,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Sander, Ben, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Koenig, Christian,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161125075817.GA18428-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
On Thu, Nov 24, 2016 at 11:58:17PM -0800, Christoph Hellwig wrote:
> On Thu, Nov 24, 2016 at 11:11:34AM -0700, Logan Gunthorpe wrote:
> > * Regular DAX in the FS doesn't work at this time because the FS can
> > move the file you think your transfer to out from under you. Though I
> > understand there's been some work with XFS to solve that issue.
>
> The file system will never move anything under locked down pages,
> locking down pages is used exactly to protect against that.
.. And ODP style mmu notifiers work correctly as well, I'd assume.
So this should work with ZONE_DEVICE, if it doesn't it is a filesystem
bug?
> really want a notification to the consumer if the file systems wants
> to remove the mapping. We have implemented that using FL_LAYOUTS locks
> for NFSD, but only XFS supports it so far. Without that a long term
> locked down region of memory (e.g. a kernel MR) would prevent various
> file operations that would simply hang.
So you imagine a signal back to user space asking user space to drop
any RDMA MRS so the FS can relocate things?
Do we need that, or should we encourage people to use either short
lived MRs or ODP MRs when working with scenarios that need FS
relocation?
Jason
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-25 19:34 UTC (permalink / raw)
To: Serguei Sagalovitch
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix, Blinzer, Paul,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Sander, Ben, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Christian König,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <7ff3cf70-b0c3-028e-fea8-c370a1185b65-5C7GfCeVMHo@public.gmane.org>
On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:
> b) Allocation may not have CPU address at all - only GPU one.
But you don't expect RDMA to work in the case, right?
GPU people need to stop doing this windowed memory stuff :)
Jason
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-25 19:32 UTC (permalink / raw)
To: Christian König
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix, Serguei Sagalovitch,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Blinzer, Paul, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Sander, Ben,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <3f2d2db3-fb75-2422-2a18-a8497fd5d70e-5C7GfCeVMHo@public.gmane.org>
On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
> >Like you say below we have to handle short lived in the usual way, and
> >that covers basically every device except IB MRs, including the
> >command queue on a NVMe drive.
>
> Well a problem which wasn't mentioned so far is that while GPUs do have a
> page table to mirror the CPU page table, they usually can't recover from
> page faults.
> So what we do is making sure that all memory accessed by the GPU Jobs stays
> in place while those jobs run (pretty much the same pinning you do for the
> DMA).
Yes, it is DMA, so this is a valid approach.
But, you don't need page faults from the GPU to do proper coherent
page table mirroring. Basically when the driver submits the work to
the GPU it 'faults' the pages into the CPU and mirror translation
table (instead of pinning).
Like in ODP, MMU notifiers/HMM are used to monitor for translation
changes. If a change comes in the GPU driver checks if an executing
command is touching those pages and blocks the MMU notifier until the
command flushes, then unfaults the page (blocking future commands) and
unblocks the mmu notifier.
The code moving the page will move it and the next GPU command that
needs it will refault it in the usual way, just like the CPU would.
This might be much more efficient since it optimizes for the common
case of unchanging translation tables.
This assumes the commands are fairly short lived of course, the
expectation of the mmu notifiers is that a flush is reasonably prompt
..
> >Serguei, what is your plan in GPU land for migration? Ie if I have a
> >CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
> >- do you still allow the CPU to access it? Or do you swap it back to
> >cachable memory if the CPU touches it?
>
> Depends on the policy in command, but currently it's the other way around
> most of the time.
>
> E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids reading
> because that is slow, the GPU in turn can access it with full speed.
>
> When we run out of VRAM we move those allocations to system memory and
> update both the CPU as well as the GPU page tables.
>
> So that move is transparent for both userspace as well as shaders running on
> the GPU.
That makes sense to me, but the objection that came back for
non-cachable CPU mappings is that it basically breaks too much stuff
subtly, eg atomics, unaligned accesses, the CPU threading memory
model, all change on various architectures and break when caching is
disabled.
IMHO that is OK for specialty things like the GPU where the mmap comes
in via drm or something and apps know to handle that buffer specially.
But it is certainly not OK for DAX where the application is coded for
normal file open()/mmap() is not prepared for a mmap where (eg)
unaligned read accesses or atomics don't work depending on how the
filesystem is setup.
Which is why I think iopmem is still problematic..
At the very least I think a mmap flag or open flag should be needed to
opt into this behavior and by default non-cachebale DAX mmaps should
be paged into system ram when the CPU accesses them.
I'm hearing most people say ZONE_DEVICE is the way to handle this,
which means the missing remaing piece for RDMA is some kind of DMA
core support for p2p address translation..
Jason
^ permalink raw reply
* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-25 19:05 UTC (permalink / raw)
To: Vishwanathapura, Niranjana
Cc: ira.weiny, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161125021350.GA74946-wPcXA7LoDC+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
On Thu, Nov 24, 2016 at 06:13:50PM -0800, Vishwanathapura, Niranjana wrote:
> In order to be truely device independent the hfi_vnic ULP should not depend
> on a device exported symbol. Instead device should register its functions
> with the ULP. Hence the approaches a) and b).
It is not device independent, it is hard linked to hfi1, just like our
other multi-component drivers.. So don't worry about that.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Serguei Sagalovitch @ 2016-11-25 17:59 UTC (permalink / raw)
To: Logan Gunthorpe, Jason Gunthorpe
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix, Blinzer, Paul,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Sander, Ben, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Koenig, Christian,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <9cc22068-ede8-c1bc-5d8b-cf6224a7ce05-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> Well, I guess there's some consensus building to do. The existing
> options are:
>
> * Device DAX: which could work but the problem I see with it is that it
> only allows one application to do these transfers. Or there would have
> to be some user-space coordination to figure which application gets what
> memeroy.
About one application restriction: so it is per memory mapping? I assume
that
it should not be problem for one application to do transfer to the
several devices
simultaneously? Am I right?
May be we should follow RDMA MR design and register memory for p2p
transfer from user
space?
What about the following:
a) Device DAX is created
b) "Normal" (movable, etc.) allocation will be done for PCIe memory and
CPU pointer/access will
be requested.
c) p2p_mr_register() will be called and CPU pointer (mmap( on DAX
Device)) will be returned.
Accordingly such memory will be marked as "unmovable" by e.g. graphics
driver.
d) When p2p is not needed then p2p_mr_unregister() will be called.
What do you think? Will it work?
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Serguei Sagalovitch @ 2016-11-25 17:20 UTC (permalink / raw)
To: Logan Gunthorpe, Christian König, Jason Gunthorpe,
Dan Williams
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Blinzer, Paul, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Sander, Ben,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <209107c7-3098-ca70-7d62-b55021d01faa-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
> A white list may end up being rather complicated if it has to cover
> different CPU generations and system architectures. I feel this is a
> decision user space could easily make.
>
> Logan
I agreed that it is better to leave up to user space to check what is
working
and what is not. I found that write is practically always working but
read very
often not. Also sometimes system BIOS update could fix the issue.
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Serguei Sagalovitch @ 2016-11-25 17:16 UTC (permalink / raw)
To: Christian König, Jason Gunthorpe, Logan Gunthorpe
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Blinzer, Paul, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Sander, Ben,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <3f2d2db3-fb75-2422-2a18-a8497fd5d70e-5C7GfCeVMHo@public.gmane.org>
On 2016-11-25 08:22 AM, Christian König wrote:
>
>> Serguei, what is your plan in GPU land for migration? Ie if I have a
>> CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
>> - do you still allow the CPU to access it? Or do you swap it back to
>> cachable memory if the CPU touches it?
>
> Depends on the policy in command, but currently it's the other way
> around most of the time.
>
> E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids
> reading because that is slow, the GPU in turn can access it with full
> speed.
>
> When we run out of VRAM we move those allocations to system memory and
> update both the CPU as well as the GPU page tables.
>
> So that move is transparent for both userspace as well as shaders
> running on the GPU.
I would like to add more in relation to CPU access :
a) we could have CPU-accessible part of VRAM ("inside" of PCIe BAR register)
and non-CPU accessible part. As the result if user needs to have
CPU access than memory should be located in CPU-accessible part
of VRAM or in system memory.
Application/user mode driver could specify preference/hints of
locations based on their assumption / knowledge about access
patterns requirements, game resolution, knowledge
about size of VRAM memory, etc. So if CPU access performance
is critical then such memory should be allocated in system memory
as the first (and may be only) choice.
b) Allocation may not have CPU address at all - only GPU one.
Also we may not be able to have CPU address/accesses for all VRAM
memory but memory may still be migrated in any case unrelated
if we have CPU address or not.
c) " VRAM, it becomes non-cachable "
Strictly speaking VRAM is configured as WC (write-combined memory) to
provide fast CPU write access. Also it was found that sometimes if CPU
access is not critical from performance perspective it may be useful
to allocate/program system memory also as WC to avoid needs for
extra "snooping" to synchronize with CPU caches during GPU access.
So potentially system memory could be WC too.
^ permalink raw reply
* Re: Enabling peer to peer device transactions for PCIe devices
From: Logan Gunthorpe @ 2016-11-25 16:45 UTC (permalink / raw)
To: Christian König, Jason Gunthorpe, Dan Williams
Cc: Haggai Eran, Bridgman, John,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
Kuehling, Felix, Serguei Sagalovitch,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
Blinzer, Paul, Suthikulpanit, Suravee,
linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Deucher, Alexander, Sander, Ben,
Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <c60815a1-aaac-52eb-1714-66abb28bdc01-5C7GfCeVMHo@public.gmane.org>
On 25/11/16 06:06 AM, Christian König wrote:
> Well Serguei send me a couple of documents about QPI when we started to
> discuss this internally as well and that's exactly one of the cases I
> had in mind when writing this.
>
> If I understood it correctly for such systems P2P is technical possible,
> but not necessary a good idea. Usually it is faster to just use a
> bouncing buffer when the peers are a bit "father" apart.
>
> That this problem is solved on newer hardware is good, but doesn't helps
> us at all if we at want to support at least systems from the last five
> years or so.
Well we have been testing with Sandy Bridge, I think the problem was
supposed to be fixed in Ivy but we never tested it so I can't say what
the performance turned out to be. Ivy is nearly 5 years old. I expect
this is something that will be improved more and more with subsequent
generations.
A white list may end up being rather complicated if it has to cover
different CPU generations and system architectures. I feel this is a
decision user space could easily make.
Logan
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox