From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: Robin Murphy <robin.murphy@arm.com>,
"iommu@lists.linux-foundation.org"
<iommu@lists.linux-foundation.org>, "hch@lst.de" <hch@lst.de>,
"m.szyprowski@samsung.com" <m.szyprowski@samsung.com>
Cc: "shuah@kernel.org" <shuah@kernel.org>,
"will@kernel.org" <will@kernel.org>,
Linuxarm <linuxarm@huawei.com>,
"linux-kselftest@vger.kernel.org"
<linux-kselftest@vger.kernel.org>
Subject: RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs
Date: Sat, 31 Oct 2020 09:44:58 +0000 [thread overview]
Message-ID: <ebf48b2c57384f37aaaeb02ea0d3beff@hisilicon.com> (raw)
In-Reply-To: <472cf21a-5196-dbb5-caef-c1c0d982fe1c@arm.com>
> -----Original Message-----
> From: Robin Murphy [mailto:robin.murphy@arm.com]
> Sent: Saturday, October 31, 2020 4:48 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> iommu@lists.linux-foundation.org; hch@lst.de; m.szyprowski@samsung.com
> Cc: joro@8bytes.org; will@kernel.org; shuah@kernel.org; Linuxarm
> <linuxarm@huawei.com>; linux-kselftest@vger.kernel.org
> Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
>
> On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
> [...]
> >>> +struct map_benchmark {
> >>> + __u64 map_nsec;
> >>> + __u64 unmap_nsec;
> >>> + __u32 threads; /* how many threads will do map/unmap in parallel
> */
> >>> + __u32 seconds; /* how long the test will last */
> >>> + int node; /* which numa node this benchmark will run on */
> >>> + __u64 expansion[10]; /* For future use */
> >>> +};
> >>
> >> I'm no expert on userspace ABIs (and what little experience I do have
> >> is mostly of Win32...), so hopefully someone else will comment if
> >> there's anything of concern here. One thing I wonder is that there's
> >> a fair likelihood of functionality evolving here over time, so might
> >> it be appropriate to have some sort of explicit versioning parameter
> >> for robustness?
> >
> > I copied that from gup_benchmark. There is no this kind of code to
> > compare version.
> > I believe there is a likelihood that kernel module is changed but
> > users are still using old userspace tool, this might lead to the
> > incompatible data structure.
> > But not sure if it is a big problem :-)
>
> Yeah, like I say I don't really have a good feeling for what would be best here,
> I'm just thinking of what I do know and wary of the potential for a "640 bits
> ought to be enough for anyone" issue ;)
>
> >>> +struct map_benchmark_data {
> >>> + struct map_benchmark bparam;
> >>> + struct device *dev;
> >>> + struct dentry *debugfs;
> >>> + atomic64_t total_map_nsecs;
> >>> + atomic64_t total_map_loops;
> >>> + atomic64_t total_unmap_nsecs;
> >>> + atomic64_t total_unmap_loops;
> >>> +};
> >>> +
> >>> +static int map_benchmark_thread(void *data) {
> >>> + struct page *page;
> >>> + dma_addr_t dma_addr;
> >>> + struct map_benchmark_data *map = data;
> >>> + int ret = 0;
> >>> +
> >>> + page = alloc_page(GFP_KERNEL);
> >>> + if (!page)
> >>> + return -ENOMEM;
> >>> +
> >>> + while (!kthread_should_stop()) {
> >>> + ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> >>> +
> >>> + map_stime = ktime_get();
> >>> + dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
> >> DMA_BIDIRECTIONAL);
> >>
> >> Note that for a non-coherent device, this will give an underestimate
> >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings,
> >> since the page will never be dirty in the cache (except possibly the
> >> very first time through).
> >
> > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1"
> > after we have this basic framework.
>
> That wasn't so much about the direction itself, just that if it's anything other
> than FROM_DEVICE, we should probably do something to dirty the buffer by a
> reasonable amount before each map. Otherwise the measured performance is
> going to be unrealistic on many systems.
Maybe put a memset(buf, 0, PAGE_SIZE) before dma_map will help ?
>
> [...]
> >>> + atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,
> >> unmap_stime)),
> >>> + &map->total_unmap_nsecs);
> >>> + atomic64_inc(&map->total_map_loops);
> >>> + atomic64_inc(&map->total_unmap_loops);
> >>
> >> I think it would be worth keeping track of the variances as well - it
> >> can be hard to tell if a reasonable-looking average is hiding
> >> terrible worst-case behaviour.
> >
> > This is a sensible requirement. I believe it is better to be handled
> > by the existing kernel tracing method.
> >
> > Maybe we need a histogram like:
> > Delay sample count
> > 1-2us 1000 ***
> > 2-3us 2000 *******
> > 3-4us 100 *
> > .....
> > This will be more precise than the maximum latency in the worst case.
> >
> > I'd believe this can be handled by:
> > tracepoint A
> > Map
> > Tracepoint B
> >
> > Tracepoint C
> > Unmap
> > Tracepoint D
> >
> > Let the userspace ebpf to draw the histogram for the delta of B-A and D-C.
> >
> > So I am planning to put this requirement into todo list and write an
> > userspace ebpf/bcc script for histogram and put in tools/ directory.
> >
> > Please give your comments on this.
>
> Right, I wasn't suggesting trying to homebrew a full data collection system here
> - I agree there are better tools for that already - just that it's basically free to
> track a sum of squares alongside a sum, so that we can trivially calculate a
> useful variance (or standard
> deviation) figure alongside the mean at the end.
For this case, I am not sure if it is true. Unless we expose more data such as
min, max etc. to userspace, it makes no difference whether total_(un)map_nsecs
and total_(un)map_loops are exposed or not.
As
total loops = seconds / (avg_map_latency + avg_unmap_latency);
total_map_nsecs = total loop count * avg_map_latency
total_unmap_nsecs = total loop count * avg_unmap_latency
all of seconds, avg_unmap_latency, avg_unmap_latency are known by
userspace tool.
>
> [...]
> >>> + for (i = 0; i < threads; i++) {
> >>> + tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
> >>> + map->bparam.node, "dma-map-benchmark/%d", i);
> >>> + if (IS_ERR(tsk[i])) {
> >>> + dev_err(map->dev, "create dma_map thread failed\n");
> >>> + return PTR_ERR(tsk[i]);
> >>> + }
> >>> +
> >>> + if (node != NUMA_NO_NODE && node_online(node))
> >>> + kthread_bind_mask(tsk[i], cpu_mask);
> >>> +
> >>> + wake_up_process(tsk[i]);
> >>
> >> Might it be better to create all the threads first, *then* start
> >> kicking them?
> >
> > The difficulty is that we don't know how many threads we should create
> > as the thread number is a parameter to test the contention of IOMMU driver.
> > In my test case, I'd like to test things like One thread Two threads
> > ....
> > 8 threads
> > 12 threads
> > 16 threads...
> >
> > On the other hand, I think it is better to drop the memory of
> > task_struct of those test threads while we are not testing dma map.
>
> I simply meant splitting the loop here into two - one to create the threads and
> set their affinity, then another to wake them all up - so we don't start
> unnecessarily thrashing the system while we're still trying to set up the rest of
> the test ;)
Agreed.
>
> Robin.
Thanks
Barry
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
WARNING: multiple messages have this Message-ID (diff)
From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: Robin Murphy <robin.murphy@arm.com>,
"iommu@lists.linux-foundation.org"
<iommu@lists.linux-foundation.org>, "hch@lst.de" <hch@lst.de>,
"m.szyprowski@samsung.com" <m.szyprowski@samsung.com>
Cc: "joro@8bytes.org" <joro@8bytes.org>,
"will@kernel.org" <will@kernel.org>,
"shuah@kernel.org" <shuah@kernel.org>,
Linuxarm <linuxarm@huawei.com>,
"linux-kselftest@vger.kernel.org"
<linux-kselftest@vger.kernel.org>
Subject: RE: [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs
Date: Sat, 31 Oct 2020 09:44:58 +0000 [thread overview]
Message-ID: <ebf48b2c57384f37aaaeb02ea0d3beff@hisilicon.com> (raw)
In-Reply-To: <472cf21a-5196-dbb5-caef-c1c0d982fe1c@arm.com>
> -----Original Message-----
> From: Robin Murphy [mailto:robin.murphy@arm.com]
> Sent: Saturday, October 31, 2020 4:48 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> iommu@lists.linux-foundation.org; hch@lst.de; m.szyprowski@samsung.com
> Cc: joro@8bytes.org; will@kernel.org; shuah@kernel.org; Linuxarm
> <linuxarm@huawei.com>; linux-kselftest@vger.kernel.org
> Subject: Re: [PATCH 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
>
> On 2020-10-29 21:39, Song Bao Hua (Barry Song) wrote:
> [...]
> >>> +struct map_benchmark {
> >>> + __u64 map_nsec;
> >>> + __u64 unmap_nsec;
> >>> + __u32 threads; /* how many threads will do map/unmap in parallel
> */
> >>> + __u32 seconds; /* how long the test will last */
> >>> + int node; /* which numa node this benchmark will run on */
> >>> + __u64 expansion[10]; /* For future use */
> >>> +};
> >>
> >> I'm no expert on userspace ABIs (and what little experience I do have
> >> is mostly of Win32...), so hopefully someone else will comment if
> >> there's anything of concern here. One thing I wonder is that there's
> >> a fair likelihood of functionality evolving here over time, so might
> >> it be appropriate to have some sort of explicit versioning parameter
> >> for robustness?
> >
> > I copied that from gup_benchmark. There is no this kind of code to
> > compare version.
> > I believe there is a likelihood that kernel module is changed but
> > users are still using old userspace tool, this might lead to the
> > incompatible data structure.
> > But not sure if it is a big problem :-)
>
> Yeah, like I say I don't really have a good feeling for what would be best here,
> I'm just thinking of what I do know and wary of the potential for a "640 bits
> ought to be enough for anyone" issue ;)
>
> >>> +struct map_benchmark_data {
> >>> + struct map_benchmark bparam;
> >>> + struct device *dev;
> >>> + struct dentry *debugfs;
> >>> + atomic64_t total_map_nsecs;
> >>> + atomic64_t total_map_loops;
> >>> + atomic64_t total_unmap_nsecs;
> >>> + atomic64_t total_unmap_loops;
> >>> +};
> >>> +
> >>> +static int map_benchmark_thread(void *data) {
> >>> + struct page *page;
> >>> + dma_addr_t dma_addr;
> >>> + struct map_benchmark_data *map = data;
> >>> + int ret = 0;
> >>> +
> >>> + page = alloc_page(GFP_KERNEL);
> >>> + if (!page)
> >>> + return -ENOMEM;
> >>> +
> >>> + while (!kthread_should_stop()) {
> >>> + ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> >>> +
> >>> + map_stime = ktime_get();
> >>> + dma_addr = dma_map_page(map->dev, page, 0, PAGE_SIZE,
> >> DMA_BIDIRECTIONAL);
> >>
> >> Note that for a non-coherent device, this will give an underestimate
> >> of the real-world overhead of BIDIRECTIONAL or TO_DEVICE mappings,
> >> since the page will never be dirty in the cache (except possibly the
> >> very first time through).
> >
> > Agreed. I'd like to add a DIRECTION parameter like "-d 0", "-d 1"
> > after we have this basic framework.
>
> That wasn't so much about the direction itself, just that if it's anything other
> than FROM_DEVICE, we should probably do something to dirty the buffer by a
> reasonable amount before each map. Otherwise the measured performance is
> going to be unrealistic on many systems.
Maybe put a memset(buf, 0, PAGE_SIZE) before dma_map will help ?
>
> [...]
> >>> + atomic64_add((long long)ktime_to_ns(ktime_sub(unmap_etime,
> >> unmap_stime)),
> >>> + &map->total_unmap_nsecs);
> >>> + atomic64_inc(&map->total_map_loops);
> >>> + atomic64_inc(&map->total_unmap_loops);
> >>
> >> I think it would be worth keeping track of the variances as well - it
> >> can be hard to tell if a reasonable-looking average is hiding
> >> terrible worst-case behaviour.
> >
> > This is a sensible requirement. I believe it is better to be handled
> > by the existing kernel tracing method.
> >
> > Maybe we need a histogram like:
> > Delay sample count
> > 1-2us 1000 ***
> > 2-3us 2000 *******
> > 3-4us 100 *
> > .....
> > This will be more precise than the maximum latency in the worst case.
> >
> > I'd believe this can be handled by:
> > tracepoint A
> > Map
> > Tracepoint B
> >
> > Tracepoint C
> > Unmap
> > Tracepoint D
> >
> > Let the userspace ebpf to draw the histogram for the delta of B-A and D-C.
> >
> > So I am planning to put this requirement into todo list and write an
> > userspace ebpf/bcc script for histogram and put in tools/ directory.
> >
> > Please give your comments on this.
>
> Right, I wasn't suggesting trying to homebrew a full data collection system here
> - I agree there are better tools for that already - just that it's basically free to
> track a sum of squares alongside a sum, so that we can trivially calculate a
> useful variance (or standard
> deviation) figure alongside the mean at the end.
For this case, I am not sure if it is true. Unless we expose more data such as
min, max etc. to userspace, it makes no difference whether total_(un)map_nsecs
and total_(un)map_loops are exposed or not.
As
total loops = seconds / (avg_map_latency + avg_unmap_latency);
total_map_nsecs = total loop count * avg_map_latency
total_unmap_nsecs = total loop count * avg_unmap_latency
all of seconds, avg_unmap_latency, avg_unmap_latency are known by
userspace tool.
>
> [...]
> >>> + for (i = 0; i < threads; i++) {
> >>> + tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
> >>> + map->bparam.node, "dma-map-benchmark/%d", i);
> >>> + if (IS_ERR(tsk[i])) {
> >>> + dev_err(map->dev, "create dma_map thread failed\n");
> >>> + return PTR_ERR(tsk[i]);
> >>> + }
> >>> +
> >>> + if (node != NUMA_NO_NODE && node_online(node))
> >>> + kthread_bind_mask(tsk[i], cpu_mask);
> >>> +
> >>> + wake_up_process(tsk[i]);
> >>
> >> Might it be better to create all the threads first, *then* start
> >> kicking them?
> >
> > The difficulty is that we don't know how many threads we should create
> > as the thread number is a parameter to test the contention of IOMMU driver.
> > In my test case, I'd like to test things like One thread Two threads
> > ....
> > 8 threads
> > 12 threads
> > 16 threads...
> >
> > On the other hand, I think it is better to drop the memory of
> > task_struct of those test threads while we are not testing dma map.
>
> I simply meant splitting the loop here into two - one to create the threads and
> set their affinity, then another to wake them all up - so we don't start
> unnecessarily thrashing the system while we're still trying to set up the rest of
> the test ;)
Agreed.
>
> Robin.
Thanks
Barry
next prev parent reply other threads:[~2020-10-31 9:45 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-27 3:53 [PATCH 0/2] dma-mapping: provide a benchmark for streaming DMA mapping Barry Song
2020-10-27 3:53 ` Barry Song
2020-10-27 3:53 ` [PATCH 1/2] dma-mapping: add benchmark support for streaming DMA APIs Barry Song
2020-10-27 3:53 ` Barry Song
2020-10-29 19:38 ` Robin Murphy
2020-10-29 19:38 ` Robin Murphy
2020-10-29 21:39 ` Song Bao Hua (Barry Song)
2020-10-29 21:39 ` Song Bao Hua (Barry Song)
2020-10-30 15:48 ` Robin Murphy
2020-10-30 15:48 ` Robin Murphy
2020-10-31 9:44 ` Song Bao Hua (Barry Song) [this message]
2020-10-31 9:44 ` Song Bao Hua (Barry Song)
2020-10-31 21:42 ` Song Bao Hua (Barry Song)
2020-10-31 21:42 ` Song Bao Hua (Barry Song)
2020-10-27 3:53 ` [PATCH 2/2] selftests/dma: add test application for DMA_MAP_BENCHMARK Barry Song
2020-10-27 3:53 ` Barry Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ebf48b2c57384f37aaaeb02ea0d3beff@hisilicon.com \
--to=song.bao.hua@hisilicon.com \
--cc=hch@lst.de \
--cc=iommu@lists.linux-foundation.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linuxarm@huawei.com \
--cc=m.szyprowski@samsung.com \
--cc=robin.murphy@arm.com \
--cc=shuah@kernel.org \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.