From mboxrd@z Thu Jan 1 00:00:00 1970 From: Thomas Hellstrom Subject: Re: [PATCH 10/13] drm/ttm: provide dma aware ttm page pool code V7 Date: Fri, 11 Nov 2011 09:06:37 +0100 Message-ID: <4EBCD78D.2010304@vmware.com> References: <1320975417-13871-1-git-send-email-j.glisse@gmail.com> <1320975417-13871-11-git-send-email-j.glisse@gmail.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0781228640==" Return-path: Received: from smtp-outbound-2.vmware.com (smtp-outbound-2.vmware.com [65.115.85.73]) by gabe.freedesktop.org (Postfix) with ESMTP id F2DE2A0D58 for ; Fri, 11 Nov 2011 00:09:14 -0800 (PST) In-Reply-To: <1320975417-13871-11-git-send-email-j.glisse@gmail.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org Errors-To: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org To: j.glisse@gmail.com Cc: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org This is a multi-part message in MIME format. --===============0781228640== Content-Type: multipart/alternative; boundary="------------000009050809080403000708" This is a multi-part message in MIME format. --------------000009050809080403000708 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable On 11/11/2011 02:36 AM, j.glisse@gmail.com wrote: > From: Konrad Rzeszutek Wilk > > In TTM world the pages for the graphic drivers are kept in three differ= ent > pools: write combined, uncached, and cached (write-back). When the page= s > are used by the graphic driver the graphic adapter via its built in MMU > (or AGP) programs these pages in. The programming requires the virtual = address > (from the graphic adapter perspective) and the physical address (either= System RAM > or the memory on the card) which is obtained using the pci_map_* calls = (which does the > virtual to physical - or bus address translation). During the graphic a= pplication's > "life" those pages can be shuffled around, swapped out to disk, moved f= rom the > VRAM to System RAM or vice-versa. This all works with the existing TTM = pool code > - except when we want to use the software IOTLB (SWIOTLB) code to "map"= the physical > addresses to the graphic adapter MMU. We end up programming the bounce = buffer's > physical address instead of the TTM pool memory's and get a non-worky d= river. > There are two solutions: > 1) using the DMA API to allocate pages that are screened by the DMA API= , or > 2) using the pci_sync_* calls to copy the pages from the bounce-buffer = and back. > > This patch fixes the issue by allocating pages using the DMA API. The s= econd > is a viable option - but it has performance drawbacks and potential cor= rectness > issues - think of the write cache page being bounced (SWIOTLB->TTM), th= e > WC is set on the TTM page and the copy from SWIOTLB not making it to th= e TTM > page until the page has been recycled in the pool (and used by another = application). > > The bounce buffer does not get activated often - only in cases where we= have > a 32-bit capable card and we want to use a page that is allocated above= the > 4GB limit. The bounce buffer offers the solution of copying the content= s > of that 4GB page to an location below 4GB and then back when the operat= ion has been > completed (or vice-versa). This is done by using the 'pci_sync_*' calls= . > Note: If you look carefully enough in the existing TTM page pool code y= ou will > notice the GFP_DMA32 flag is used - which should guarantee that the pr= ovided page > is under 4GB. It certainly is the case, except this gets ignored in two= cases: > - If user specifies 'swiotlb=3Dforce' which bounces_every_ page. > - If user is using a Xen's PV Linux guest (which uses the SWIOTLB and= the > underlaying PFN's aren't necessarily under 4GB). > > To not have this extra copying done the other option is to allocate the= pages > using the DMA API so that there is not need to map the page and perform= the > expensive 'pci_sync_*' calls. > > This DMA API capable TTM pool requires for this the 'struct device' to > properly call the DMA API. It also has to track the virtual and bus add= ress of > the page being handed out in case it ends up being swapped out or de-al= located - > to make sure it is de-allocated using the proper's 'struct device'. > > Implementation wise the code keeps two lists: one that is attached to t= he > 'struct device' (via the dev->dma_pools list) and a global one to be us= ed when > the 'struct device' is unavailable (think shrinker code). The global li= st can > iterate over all of the 'struct device' and its associated dma_pool. Th= e list > in dev->dma_pools can only iterate the device's dma_pool. > /[struct d= evice_pool]\ > /---------------------------------------------------| dev = | > / +-------| dma_pool= | > /-----+------\ / \---------= -----------/ > |struct device| /-->[struct dma_pool for WC] | dma_pools +----+ /-| dev = | > | ... | \--->[struct dma_pool for uncached]<-/--| dma_pool= | > \-----+------/ / \---------= -----------/ > \----------------------------------------------/ > [Two pools associated with the device (WC and UC), and the parallel lis= t > containing the 'struct dev' and 'struct dma_pool' entries] > > The maximum amount of dma pools a device can have is six: write-combine= d, > uncached, and cached; then there are the DMA32 variants which are: > write-combined dma32, uncached dma32, and cached dma32. > > Currently this code only gets activated when any variant of the SWIOTLB= IOMMU > code is running (Intel without VT-d, AMD without GART, IBM Calgary and = Xen PV > with PCI devices). > > Tested-by: Michel D=C3=A4nzer > [v1: Using swiotlb_nr_tbl instead of swiotlb_enabled] > [v2: Major overhaul - added 'inuse_list' to seperate used from inuse an= d reorder > the order of lists to get better performance.] > [v3: Added comments/and some logic based on review, Added Jerome tag] > [v4: rebase on top of ttm_tt& ttm_backend merge] > [v5: rebase on top of ttm memory accounting overhaul] > [v6: New rebase on top of more memory accouting changes] > [v7: well rebase on top of no memory accounting changes] > Reviewed-by: Jerome Glisse > Signed-off-by: Konrad Rzeszutek Wilk > --- > =20 Acked-by: Thomas Hellstrom --------------000009050809080403000708 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 11/11/2011 02:36 AM, j.glisse@gmail.com wrote:
From: Konrad Rzeszutek Wilk &=
lt;konrad.wilk@oracle.com>

In TTM world the pages for the graphic drivers are kept in three differen=
t
pools: write combined, uncached, and cached (write-back). When the pages
are used by the graphic driver the graphic adapter via its built in MMU
(or AGP) programs these pages in. The programming requires the virtual ad=
dress
(from the graphic adapter perspective) and the physical address (either S=
ystem RAM
or the memory on the card) which is obtained using the pci_map_* calls (w=
hich does the
virtual to physical - or bus address translation). During the graphic app=
lication's
"life" those pages can be shuffled around, swapped out to disk, moved fro=
m the
VRAM to System RAM or vice-versa. This all works with the existing TTM po=
ol code
- except when we want to use the software IOTLB (SWIOTLB) code to "map" t=
he physical
addresses to the graphic adapter MMU. We end up programming the bounce bu=
ffer's
physical address instead of the TTM pool memory's and get a non-worky dri=
ver.
There are two solutions:
1) using the DMA API to allocate pages that are screened by the DMA API, =
or
2) using the pci_sync_* calls to copy the pages from the bounce-buffer an=
d back.

This patch fixes the issue by allocating pages using the DMA API. The sec=
ond
is a viable option - but it has performance drawbacks and potential corre=
ctness
issues - think of the write cache page being bounced (SWIOTLB->TTM), t=
he
WC is set on the TTM page and the copy from SWIOTLB not making it to the =
TTM
page until the page has been recycled in the pool (and used by another ap=
plication).

The bounce buffer does not get activated often - only in cases where we h=
ave
a 32-bit capable card and we want to use a page that is allocated above t=
he
4GB limit. The bounce buffer offers the solution of copying the contents
of that 4GB page to an location below 4GB and then back when the operatio=
n has been
completed (or vice-versa). This is done by using the 'pci_sync_*' calls.
Note: If you look carefully enough in the existing TTM page pool code you=
 will
notice the GFP_DMA32 flag is used  - which should guarantee that the prov=
ided page
is under 4GB. It certainly is the case, except this gets ignored in two c=
ases:
 - If user specifies 'swiotlb=3Dforce' which bounces _every_ page.
 - If user is using a Xen's PV Linux guest (which uses the SWIOTLB and th=
e
   underlaying PFN's aren't necessarily under 4GB).

To not have this extra copying done the other option is to allocate the p=
ages
using the DMA API so that there is not need to map the page and perform t=
he
expensive 'pci_sync_*' calls.

This DMA API capable TTM pool requires for this the 'struct device' to
properly call the DMA API. It also has to track the virtual and bus addre=
ss of
the page being handed out in case it ends up being swapped out or de-allo=
cated -
to make sure it is de-allocated using the proper's 'struct device'.

Implementation wise the code keeps two lists: one that is attached to the
'struct device' (via the dev->dma_pools list) and a global one to be u=
sed when
the 'struct device' is unavailable (think shrinker code). The global list=
 can
iterate over all of the 'struct device' and its associated dma_pool. The =
list
in dev->dma_pools can only iterate the device's dma_pool.
                                                            /[struct devi=
ce_pool]\
        /---------------------------------------------------| dev        =
        |
       /                                            +-------| dma_pool   =
        |
 /-----+------\                                    /        \------------=
--------/
 |struct device|     /-->[struct dma_pool for WC=
]</         /[struct device_pool]\
 | dma_pools   +----+                                     /-| dev        =
        |
 |  ...        |    \--->[struct dma_pool for uncached]<-/--| dma_p=
ool           |
 \-----+------/                                         /   \------------=
--------/
        \----------------------------------------------/
[Two pools associated with the device (WC and UC), and the parallel list
containing the 'struct dev' and 'struct dma_pool' entries]

The maximum amount of dma pools a device can have is six: write-combined,
uncached, and cached; then there are the DMA32 variants which are:
write-combined dma32, uncached dma32, and cached dma32.

Currently this code only gets activated when any variant of the SWIOTLB I=
OMMU
code is running (Intel without VT-d, AMD without GART, IBM Calgary and Xe=
n PV
with PCI devices).

Tested-by: Michel D=C3=A4nzer <m=
ichel@daenzer.net>
[v1: Using swiotlb_nr_tbl instead of swiotlb_enabled]
[v2: Major overhaul - added 'inuse_list' to seperate used from inuse and =
reorder
the order of lists to get better performance.]
[v3: Added comments/and some logic based on review, Added Jerome tag]
[v4: rebase on top of ttm_tt & ttm_backend merge]
[v5: rebase on top of ttm memory accounting overhaul]
[v6: New rebase on top of more memory accouting changes]
[v7: well rebase on top of no memory accounting changes]
Reviewed-by: Jerome Glisse <j=
glisse@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk &=
lt;konrad.wilk@oracle.com>
---
  
Acked-by: Thomas Hellstrom <thellstrom@vmware.com>
--------------000009050809080403000708-- --===============0781228640== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel --===============0781228640==--