CONFIG_DMA_CMA causes ttm performance problems/hangs.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mario Kleiner <mario.kleiner.de@gmail.com>
To: "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>
Cc: Thomas Hellstrom <thellstrom@vmware.com>,
	kamal@canonical.com, LKML <linux-kernel@vger.kernel.org>,
	ben@decadent.org.uk, m.szyprowski@samsung.com
Subject: CONFIG_DMA_CMA causes ttm performance problems/hangs.
Date: Fri, 08 Aug 2014 19:42:51 +0200	[thread overview]
Message-ID: <53E50C1B.9080507@gmail.com> (raw)

Hi all,

there is a rather severe performance problem i accidentally found when 
trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under 
Ubuntu 14.04 LTS with nouveau as graphics driver.

I was lazy and just installed the Ubuntu precompiled mainline kernel. 
That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA 
(contiguous memory allocator) size of 64 MB. Older Ubuntu kernels 
weren't compiled with CMA, so i only observed this on 3.16, but previous 
kernels would likely be affected too.

After a few minutes of regular desktop use like switching workspaces, 
scrolling text in a terminal window, Firefox with multiple tabs open, 
Thunderbird etc. (tested with KDE/Kwin, with/without desktop 
composition), i get chunky desktop updates, then multi-second freezes, 
after a few minutes the desktop hangs for over a minute on almost any 
GUI action like switching windows etc. --> Unuseable.

ftrace'ing shows the culprit being this callchain (typical good/bad 
example ftrace snippets at the end of this mail):

...ttm dma coherent memory allocations, e.g., from 
__ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform 
specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] --> 
dma_alloc_from_contiguous()

dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or when 
the machine is booted with kernel boot cmdline parameter "cma=0", so it 
triggers the fast alloc_pages_node() fallback at least on x86_64.

With CMA, this function becomes progressively more slow with every 
minute of desktop use, e.g., runtimes going up from < 0.3 usecs to 
hundreds or thousands of microseconds (before it gives up and 
alloc_pages_node() fallback is used), so this causes the 
multi-second/minute hangs of the desktop.

So it seems ttm memory allocations quickly fragment and/or exhaust the 
CMA memory area, and dma_alloc_from_contiguous() tries very hard to find 
a fitting hole big enough to satisfy allocations with a retry loop (see 
http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339) 
that takes forever.

This is not good, also not for other devices which actually need a 
non-fragmented CMA for DMA, so what to do? I doubt most current gpus 
still need physically contiguous dma memory, maybe with exception of 
some embedded gpus?

My naive approach would be to add a new gfp_t flag a la ___GFP_AVOIDCMA, 
and make callers of dma_alloc_from_contiguous() refrain from doing so if 
they have some fallback for getting memory. And then add that flag to 
ttm's ttm_dma_populate() gfp_flags, e.g., around here: 
http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884

However i'm not familiar enough with memory management, so likely 
greater minds here have much better ideas on how to deal with this?

thanks,
-mario

Typical snippet from an example trace of a badly stalling desktop with 
CMA (alloc_pages_node() fallback may have been missing in this traces 
ftrace_filter settings):

1)               |                          ttm_dma_pool_get_pages [ttm]() {
  1)               | ttm_dma_page_pool_fill_locked [ttm]() {
  1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1873.071 us | dma_alloc_from_contiguous();
  1) ! 1874.292 us |                                  }
  1) ! 1875.400 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1868.372 us | dma_alloc_from_contiguous();
  1) ! 1869.586 us |                                  }
  1) ! 1870.053 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1871.085 us | dma_alloc_from_contiguous();
  1) ! 1872.240 us |                                  }
  1) ! 1872.669 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1888.934 us | dma_alloc_from_contiguous();
  1) ! 1890.179 us |                                  }
  1) ! 1890.608 us |                                }
  1)   0.048 us    | ttm_set_pages_caching [ttm]();
  1) ! 7511.000 us |                              }
  1) ! 7511.306 us |                            }
  1) ! 7511.623 us |                          }

The good case (with cma=0 kernel cmdline, so dma_alloc_from_contiguous() 
no-ops,)

0)               |                          ttm_dma_pool_get_pages [ttm]() {
  0)               | ttm_dma_page_pool_fill_locked [ttm]() {
  0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.171 us    | dma_alloc_from_contiguous();
  0)   0.849 us    | __alloc_pages_nodemask();
  0)   3.029 us    |                                  }
  0)   3.882 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.037 us    | dma_alloc_from_contiguous();
  0)   0.163 us    | __alloc_pages_nodemask();
  0)   1.408 us    |                                  }
  0)   1.719 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.035 us    | dma_alloc_from_contiguous();
  0)   0.153 us    | __alloc_pages_nodemask();
  0)   1.454 us    |                                  }
  0)   1.720 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.036 us    | dma_alloc_from_contiguous();
  0)   0.112 us    | __alloc_pages_nodemask();
  0)   1.211 us    |                                  }
  0)   1.541 us    |                                }
  0)   0.035 us    | ttm_set_pages_caching [ttm]();
  0) + 10.902 us   |                              }
  0) + 11.577 us   |                            }
  0) + 11.988 us   |                          }

WARNING: multiple messages have this Message-ID (diff)

From: Mario Kleiner <mario.kleiner.de@gmail.com>
To: "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>
Cc: "Ben Skeggs" <skeggsb@gmail.com>,
	"Alex Deucher" <alexdeucher@gmail.com>,
	"Christian König" <deathsimple@vodafone.de>,
	"Thomas Hellstrom" <thellstrom@vmware.com>,
	m.szyprowski@samsung.com, LKML <linux-kernel@vger.kernel.org>,
	kamal@canonical.com, ben@decadent.org.uk,
	"Mario Kleiner" <mario.kleiner.de@gmail.com>
Subject: CONFIG_DMA_CMA causes ttm performance problems/hangs.
Date: Fri, 08 Aug 2014 19:42:51 +0200	[thread overview]
Message-ID: <53E50C1B.9080507@gmail.com> (raw)

Hi all,

there is a rather severe performance problem i accidentally found when 
trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under 
Ubuntu 14.04 LTS with nouveau as graphics driver.

I was lazy and just installed the Ubuntu precompiled mainline kernel. 
That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA 
(contiguous memory allocator) size of 64 MB. Older Ubuntu kernels 
weren't compiled with CMA, so i only observed this on 3.16, but previous 
kernels would likely be affected too.

After a few minutes of regular desktop use like switching workspaces, 
scrolling text in a terminal window, Firefox with multiple tabs open, 
Thunderbird etc. (tested with KDE/Kwin, with/without desktop 
composition), i get chunky desktop updates, then multi-second freezes, 
after a few minutes the desktop hangs for over a minute on almost any 
GUI action like switching windows etc. --> Unuseable.

ftrace'ing shows the culprit being this callchain (typical good/bad 
example ftrace snippets at the end of this mail):

...ttm dma coherent memory allocations, e.g., from 
__ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform 
specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] --> 
dma_alloc_from_contiguous()

dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or when 
the machine is booted with kernel boot cmdline parameter "cma=0", so it 
triggers the fast alloc_pages_node() fallback at least on x86_64.

With CMA, this function becomes progressively more slow with every 
minute of desktop use, e.g., runtimes going up from < 0.3 usecs to 
hundreds or thousands of microseconds (before it gives up and 
alloc_pages_node() fallback is used), so this causes the 
multi-second/minute hangs of the desktop.

So it seems ttm memory allocations quickly fragment and/or exhaust the 
CMA memory area, and dma_alloc_from_contiguous() tries very hard to find 
a fitting hole big enough to satisfy allocations with a retry loop (see 
http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339) 
that takes forever.

This is not good, also not for other devices which actually need a 
non-fragmented CMA for DMA, so what to do? I doubt most current gpus 
still need physically contiguous dma memory, maybe with exception of 
some embedded gpus?

My naive approach would be to add a new gfp_t flag a la ___GFP_AVOIDCMA, 
and make callers of dma_alloc_from_contiguous() refrain from doing so if 
they have some fallback for getting memory. And then add that flag to 
ttm's ttm_dma_populate() gfp_flags, e.g., around here: 
http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884

However i'm not familiar enough with memory management, so likely 
greater minds here have much better ideas on how to deal with this?

thanks,
-mario

Typical snippet from an example trace of a badly stalling desktop with 
CMA (alloc_pages_node() fallback may have been missing in this traces 
ftrace_filter settings):

1)               |                          ttm_dma_pool_get_pages [ttm]() {
  1)               | ttm_dma_page_pool_fill_locked [ttm]() {
  1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1873.071 us | dma_alloc_from_contiguous();
  1) ! 1874.292 us |                                  }
  1) ! 1875.400 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1868.372 us | dma_alloc_from_contiguous();
  1) ! 1869.586 us |                                  }
  1) ! 1870.053 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1871.085 us | dma_alloc_from_contiguous();
  1) ! 1872.240 us |                                  }
  1) ! 1872.669 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1888.934 us | dma_alloc_from_contiguous();
  1) ! 1890.179 us |                                  }
  1) ! 1890.608 us |                                }
  1)   0.048 us    | ttm_set_pages_caching [ttm]();
  1) ! 7511.000 us |                              }
  1) ! 7511.306 us |                            }
  1) ! 7511.623 us |                          }

The good case (with cma=0 kernel cmdline, so dma_alloc_from_contiguous() 
no-ops,)

0)               |                          ttm_dma_pool_get_pages [ttm]() {
  0)               | ttm_dma_page_pool_fill_locked [ttm]() {
  0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.171 us    | dma_alloc_from_contiguous();
  0)   0.849 us    | __alloc_pages_nodemask();
  0)   3.029 us    |                                  }
  0)   3.882 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.037 us    | dma_alloc_from_contiguous();
  0)   0.163 us    | __alloc_pages_nodemask();
  0)   1.408 us    |                                  }
  0)   1.719 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.035 us    | dma_alloc_from_contiguous();
  0)   0.153 us    | __alloc_pages_nodemask();
  0)   1.454 us    |                                  }
  0)   1.720 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.036 us    | dma_alloc_from_contiguous();
  0)   0.112 us    | __alloc_pages_nodemask();
  0)   1.211 us    |                                  }
  0)   1.541 us    |                                }
  0)   0.035 us    | ttm_set_pages_caching [ttm]();
  0) + 10.902 us   |                              }
  0) + 11.577 us   |                            }
  0) + 11.988 us   |                          }

next             reply	other threads:[~2014-08-08 17:42 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-08 17:42 Mario Kleiner [this message]
2014-08-08 17:42 ` CONFIG_DMA_CMA causes ttm performance problems/hangs Mario Kleiner
2014-08-09  5:39 ` Thomas Hellstrom
2014-08-09  5:39   ` Thomas Hellstrom
2014-08-09 13:33   ` Konrad Rzeszutek Wilk
2014-08-09 13:33     ` Konrad Rzeszutek Wilk
2014-08-09 13:58     ` Thomas Hellstrom
2014-08-09 13:58       ` Thomas Hellstrom
2014-08-10  3:06       ` Mario Kleiner
2014-08-10  3:11       ` Mario Kleiner
2014-08-10  3:11         ` Mario Kleiner
2014-08-10 11:03         ` Thomas Hellstrom
2014-08-10 11:03           ` Thomas Hellstrom
2014-08-10 18:02           ` Mario Kleiner
2014-08-10 18:02             ` Mario Kleiner
2014-08-11 10:11             ` Thomas Hellstrom
2014-08-11 10:11               ` Thomas Hellstrom
2014-08-11 15:17               ` Jerome Glisse
2014-08-11 15:17                 ` Jerome Glisse
2014-08-12 12:12                 ` Mario Kleiner
2014-08-12 12:12                   ` Mario Kleiner
2014-08-12 20:47                   ` Konrad Rzeszutek Wilk
2014-08-12 20:47                     ` Konrad Rzeszutek Wilk
2014-08-13  1:50                 ` Michel Dänzer
2014-08-13  2:04                   ` Mario Kleiner
2014-08-13  2:17                     ` Jerome Glisse
2014-08-13  2:17                       ` Jerome Glisse
2014-08-13  8:42                       ` Lucas Stach
2014-08-13  8:42                         ` Lucas Stach
2014-08-13  2:04                   ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53E50C1B.9080507@gmail.com \
    --to=mario.kleiner.de@gmail.com \
    --cc=ben@decadent.org.uk \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=kamal@canonical.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=m.szyprowski@samsung.com \
    --cc=thellstrom@vmware.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.