From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: Robin Murphy <robin.murphy@arm.com>,
"iommu@lists.linux-foundation.org"
<iommu@lists.linux-foundation.org>, "hch@lst.de" <hch@lst.de>,
"m.szyprowski@samsung.com" <m.szyprowski@samsung.com>
Cc: "shuah@kernel.org" <shuah@kernel.org>,
"will@kernel.org" <will@kernel.org>,
Linuxarm <linuxarm@huawei.com>,
"linux-kselftest@vger.kernel.org"
<linux-kselftest@vger.kernel.org>
Subject: RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs
Date: Thu, 29 Oct 2020 21:39:01 +0000 [thread overview]
Message-ID: <8fe90795064d4373b4af32959c4e9781@hisilicon.com> (raw)
In-Reply-To: <f9aebee4-5081-da8c-930c-c36617ab57c4@arm.com>
> -----Original Message-----
> From: Robin Murphy [mailto:robin.murphy@arm.com]
> Sent: Friday, October 30, 2020 8:38 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> iommu@lists.linux-foundation.org; hch@lst.de; m.szyprowski@samsung.com
> Cc: joro@8bytes.org; will@kernel.org; shuah@kernel.org; Linuxarm
> <linuxarm@huawei.com>; linux-kselftest@vger.kernel.org
> Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
>
> On 2020-10-27 03:53, Barry Song wrote:
> > Nowadays, there are increasing requirements to benchmark the performance
> > of dma_map and dma_unmap particually while the device is attached to an
> > IOMMU.
> >
> > This patch enables the support. Users can run specified number of threads
> > to do dma_map_page and dma_unmap_page on a specific NUMA node with
> the
> > specified duration. Then dma_map_benchmark will calculate the average
> > latency for map and unmap.
> >
> > A difficulity for this benchmark is that dma_map/unmap APIs must run on
> > a particular device. Each device might have different backend of IOMMU or
> > non-IOMMU.
> >
> > So we use the driver_override to bind dma_map_benchmark to a particual
> > device by:
> > echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
> > echo xxx > /sys/bus/platform/drivers/xxx/unbind
> > echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> >
> > For this moment, it supports platform device only, PCI device will also
> > be supported afterwards.
>
> Neat! This is something I've thought about many times, but never got
> round to attempting :)
I am happy you have the same needs. When I came to IOMMU area a half year ago,
the first thing I've done was writing a rough benchmark. At that time, I hacked kernel
to get a device behind an IOMMU.
Recently, I got some time to think about how to get "device" without ugly hacking and
then clean up code for sending patches out to provide a common benchmark in order
that everybody can use.
>
> I think the basic latency measurement for mapping and unmapping pages is
> enough to start with, but there are definitely some more things that
> would be interesting to look into for future enhancements:
>
> - a choice of mapping sizes, both smaller and larger than one page, to
> help characterise stuff like cache maintenance overhead and bounce
> buffer/IOVA fragmentation.
> - alternative allocation patterns like doing lots of maps first, then
> all their corresponding unmaps (to provoke things like the worst-case
> IOVA rcache behaviour).
> - ways to exercise a range of those parameters at once across
> different threads in a single test.
>
Yes, sure. Once we have a basic framework, we can add more benchmark patterns
by using different parameters in the userspace tool:
testing/selftests/dma/dma_map_benchmark.c
Similar function extensions have been carried out in GUP_BENCHMARK.
> But let's get a basic framework nailed down first...
Sure.
>
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Shuah Khan <shuah@kernel.org>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> > kernel/dma/Kconfig | 8 ++
> > kernel/dma/Makefile | 1 +
> > kernel/dma/map_benchmark.c | 202
> +++++++++++++++++++++++++++++++++++++
> > 3 files changed, 211 insertions(+)
> > create mode 100644 kernel/dma/map_benchmark.c
> >
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index c99de4a21458..949c53da5991 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
> > is technically out-of-spec.
> >
> > If unsure, say N.
> > +
> > +config DMA_MAP_BENCHMARK
> > + bool "Enable benchmarking of streaming DMA mapping"
> > + help
> > + Provides /sys/kernel/debug/dma_map_benchmark that helps with
> testing
> > + performance of dma_(un)map_page.
> > +
> > + See tools/testing/selftests/dma/dma_map_benchmark.c
> > diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
> > index dc755ab68aab..7aa6b26b1348 100644
> > --- a/kernel/dma/Makefile
> > +++ b/kernel/dma/Makefile
> > @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG) += debug.o
> > obj-$(CONFIG_SWIOTLB) += swiotlb.o
> > obj-$(CONFIG_DMA_COHERENT_POOL) += pool.o
> > obj-$(CONFIG_DMA_REMAP) += remap.o
> > +obj-$(CONFIG_DMA_MAP_BENCHMARK) += map_benchmark.o
> > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> > new file mode 100644
> > index 000000000000..16a5d7779d67
> > --- /dev/null
> > +++ b/kernel/dma/map_benchmark.c
> > @@ -0,0 +1,202 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2020 Hisilicon Limited.
> > + */
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/kthread.h>
> > +#include <linux/slab.h>
> > +#include <linux/debugfs.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/timekeeping.h>
> > +#include <linux/delay.h>
> > +#include <linux/platform_device.h>
>
> Nit: alphabetical order is always nice, when there's not an existing
> precedent of a complete mess...
>
> > +
> > +#define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark)
> > +
> > +struct map_benchmark {
> > + __u64 map_nsec;
> > + __u64 unmap_nsec;
> > + __u32 threads; /* how many threads will do map/unmap in parallel */
> > + __u32 seconds; /* how long the test will last */
> > + int node; /* which numa node this benchmark will run on */
> > + __u64 expansion[10]; /* For future use */
> > +};
>
> I'm no expert on userspace ABIs (and what little experience I do have is
> mostly of Win32...), so hopefully someone else will comment if there's
> anything of concern here. One thing I wonder is that there's a fair
> likelihood of functionality evolving here over time, so might it be
> appropriate to have some sort of explicit versioning parameter for
> robustness?
I copied that from gup_benchmark. There is no this kind of code to
compare version.
I believe there is a likelihood that kernel module is changed but users
are still using old userspace tool, this might lead to the incompatible
data structure.
But not sure if it is a big problem :-)
>
> > +struct map_benchmark_data {
> > + struct map_benchmark bparam;
> > + struct device *dev;
> > + struct dentry *debugfs;
> > + atomic64_t total_map_nsecs;
> > + atomic64_t total_map_loops;
> > + atomic64_t total_unmap_nsecs;
> > + atomic64_t total_unmap_loops;
> > +};
> > +
> > +static int map_benchmark_thread(void *data)
> > +{
> > + struct page *page;
> > + dma_addr_t dma_addr;
> > + struct map_benchmark_data *map = data;
> > + int ret = 0;
> > +
> > + page = alloc_page(GFP_KERNEL);
> > + if (!page)
> > + return -ENOMEM;
> > +
> > + while (!kthread_should_stop()) {
> > + ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> > +
> > + map_stime = ktime_get();
> > + dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
> DMA_BIDIRECTIONAL);
>
> Note that for a non-coherent device, this will give an underestimate of
> the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings, since
> the page will never be dirty in the cache (except possibly the very
> first time through).
Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1" after we have
this basic framework.
>
> > + if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
> > + dev_err(map->dev, "dma_map_page failed\n");
> > + ret = -ENOMEM;
> > + goto out;
> > + }
> > + map_etime = ktime_get();
> > +
> > + unmap_stime = ktime_get();
> > + dma_unmap_single(map->dev, dma_addr, PAGE_SIZE,
> DMA_BIDIRECTIONAL);
>
> Ahem, dma_map_page() pairs with dma_unmap_page(). Unfortunately
> DMA_API_DEBUG is unable to shout at you for that...
This is a typo. At the beginning, we used dma_map_single and dma_unmap_single.
After that, I changed to alloc_page so moved to dma_map_page. But I missed
the unmap part.
>
> However, maybe changing the other end to dma_map_single() might make
> more sense, since you're not allocating highmem pages or anything wacky,
> and that'll be easier to expand in future.
>
Yes. I can either change both to map_page/unmap_page or both to map_single/
unmap_single.
> > + unmap_etime = ktime_get();
> > +
> > + atomic64_add((long long)ktime_to_ns(ktime_sub(map_etime,
> map_stime)),
> > + &map->total_map_nsecs);
>
> ktime_to_ns() returns s64, which is already long long.
Got it.
>
> > + atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,
> unmap_stime)),
> > + &map->total_unmap_nsecs);
> > + atomic64_inc(&map->total_map_loops);
> > + atomic64_inc(&map->total_unmap_loops);
>
> I think it would be worth keeping track of the variances as well - it
> can be hard to tell if a reasonable-looking average is hiding terrible
> worst-case behaviour.
This is a sensible requirement. I believe it is better to be handled by the
existing kernel tracing method.
Maybe we need a histogram like:
Delay sample count
1-2us 1000 ***
2-3us 2000 *******
3-4us 100 *
.....
This will be more precise than the maximum latency in the worst
case.
I'd believe this can be handled by:
tracepoint A
Map
Tracepoint B
Tracepoint C
Unmap
Tracepoint D
Let the userspace ebpf to draw the histogram for the delta of B-A and D-C.
So I am planning to put this requirement into todo list and write an userspace
ebpf/bcc script for histogram and put in tools/ directory.
Please give your comments on this.
>
> > + }
> > +
> > +out:
> > + __free_page(page);
> > + return ret;
> > +}
> > +
> > +static int do_map_benchmark(struct map_benchmark_data *map)
> > +{
> > + struct task_struct **tsk;
> > + int threads = map->bparam.threads;
>
> I know it's debugfs, but maybe a bit of parameter validation wouldn't go
> amiss? I'm already imaginging that sinking feeling when the SSH
> connection stops responding and you realise you've just inadvertently
> launched INT_MAX threads...
Agreed.
>
> > + int node = map->bparam.node;
> > + const cpumask_t *cpu_mask = cpumask_of_node(node);
> > + int ret = 0;
> > + int i;
> > +
> > + tsk = kmalloc_array(threads, sizeof(tsk), GFP_KERNEL);
> > + if (!tsk)
> > + return -ENOMEM;
> > +
> > + get_device(map->dev);
> > +
> > + for (i = 0; i < threads; i++) {
> > + tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
> > + map->bparam.node, "dma-map-benchmark/%d", i);
> > + if (IS_ERR(tsk[i])) {
> > + dev_err(map->dev, "create dma_map thread failed\n");
> > + return PTR_ERR(tsk[i]);
> > + }
> > +
> > + if (node != NUMA_NO_NODE && node_online(node))
> > + kthread_bind_mask(tsk[i], cpu_mask);
> > +
> > + wake_up_process(tsk[i]);
>
> Might it be better to create all the threads first, *then* start kicking
> them?
The difficulty is that we don't know how many threads we should create as
the thread number is a parameter to test the contention of IOMMU driver.
In my test case, I'd like to test things like
One thread
Two threads
....
8 threads
12 threads
16 threads...
On the other hand, I think it is better to drop the memory of task_struct of those
test threads while we are not testing dma map.
>
> > + }
> > +
> > + ssleep(map->bparam.seconds);
> > +
> > + /* wait for the completion of benchmark threads */
> > + for (i = 0; i < threads; i++) {
> > + ret = kthread_stop(tsk[i]);
> > + if (ret)
> > + goto out;
> > + }
> > +
> > + /* average map nsec and unmap nsec */
> > + map->bparam.map_nsec = atomic64_read(&map->total_map_nsecs) /
> > + atomic64_read(&map->total_map_loops);
> > + map->bparam.unmap_nsec = atomic64_read(&map->total_unmap_nsecs)
> /
> > + atomic64_read(&map->total_unmap_loops);
> > +
> > +out:
> > + put_device(map->dev);
> > + kfree(tsk);
> > + return ret;
> > +}
> > +
> > +static long map_benchmark_ioctl(struct file *filep, unsigned int cmd,
> > + unsigned long arg)
> > +{
> > + struct map_benchmark_data *map = filep->private_data;
> > + int ret;
> > +
> > + if (copy_from_user(&map->bparam, (void __user *)arg,
> sizeof(map->bparam)))
> > + return -EFAULT;
> > +
> > + switch (cmd) {
> > + case DMA_MAP_BENCHMARK:
> > + ret = do_map_benchmark(map);
> > + break;
> > + default:
> > + return -EINVAL;
> > + }
> > +
> > + if (copy_to_user((void __user *)arg, &map->bparam,
> sizeof(map->bparam)))
> > + return -EFAULT;
> > +
> > + return ret;
> > +}
> > +
> > +static const struct file_operations map_benchmark_fops = {
> > + .open = simple_open,
> > + .unlocked_ioctl = map_benchmark_ioctl,
> > +};
> > +
> > +static int map_benchmark_probe(struct platform_device *pdev)
> > +{
> > + struct dentry *entry;
> > + struct map_benchmark_data *map;
> > +
> > + map = devm_kzalloc(&pdev->dev, sizeof(*map), GFP_KERNEL);
> > + if (!map)
> > + return -ENOMEM;
> > +
> > + map->dev = &pdev->dev;
> > + platform_set_drvdata(pdev, map);
> > +
> > + /*
> > + * we only permit a device bound with this driver, 2nd probe
> > + * will fail
> > + */
> > + entry = debugfs_create_file("dma_map_benchmark", 0600, NULL, map,
> > + &map_benchmark_fops);
> > + if (IS_ERR(entry))
> > + return PTR_ERR(entry);
> > + map->debugfs = entry;
> > +
> > + return 0;
> > +}
> > +
> > +static int map_benchmark_remove(struct platform_device *pdev)
> > +{
> > + struct map_benchmark_data *map = platform_get_drvdata(pdev);
> > +
> > + debugfs_remove(map->debugfs);
>
> Consider a trivial 3-line wrapper plus a devm_add_action() call ;)
>
Yes, I will.
> Robin.
>
> > +
> > + return 0;
> > +}
> > +
> > +static struct platform_driver map_benchmark_driver = {
> > + .driver = {
> > + .name = "dma_map_benchmark",
> > + },
> > + .probe = map_benchmark_probe,
> > + .remove = map_benchmark_remove,
> > +};
> > +
> > +module_platform_driver(map_benchmark_driver);
> > +
> > +MODULE_AUTHOR("Barry Song <song.bao.hua@hisilicon.com>");
> > +MODULE_DESCRIPTION("dma_map benchmark driver");
> > +MODULE_LICENSE("GPL");
Thanks
Barry
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
next prev parent reply other threads:[~2020-10-29 21:39 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-27 3:53 [PATCH 0/2] dma-mapping: provide a benchmark for streaming DMA mapping Barry Song
2020-10-27 3:53 ` [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs Barry Song
2020-10-29 19:38 ` Robin Murphy
2020-10-29 21:39 ` Song Bao Hua (Barry Song) [this message]
2020-10-30 15:48 ` Robin Murphy
2020-10-31 9:44 ` Song Bao Hua (Barry Song)
2020-10-31 21:42 ` Song Bao Hua (Barry Song)
2020-10-27 3:53 ` [PATCH 2/2] selftests/dma: add test application for DMA_MAP_BENCHMARK Barry Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8fe90795064d4373b4af32959c4e9781@hisilicon.com \
--to=song.bao.hua@hisilicon.com \
--cc=hch@lst.de \
--cc=iommu@lists.linux-foundation.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linuxarm@huawei.com \
--cc=m.szyprowski@samsung.com \
--cc=robin.murphy@arm.com \
--cc=shuah@kernel.org \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox