From: Namhyung Kim <namhyung@kernel.org>
To: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@kernel.org>,
Arnaldo Carvalho de Melo <acme@kernel.org>,
Alexander Shishkin <alexander.shishkin@linux.intel.com>,
Jiri Olsa <jolsa@redhat.com>, Andi Kleen <ak@linux.intel.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
kernel-team@lge.com
Subject: Re: [PATCH v9 2/3]: perf record: enable asynchronous trace writing
Date: Fri, 5 Oct 2018 16:16:15 +0900 [thread overview]
Message-ID: <20181005071615.GC3768@sejong> (raw)
In-Reply-To: <5ed78452-ff2b-dfe5-ec29-888971cb4b55@linux.intel.com>
On Wed, Oct 03, 2018 at 07:12:10PM +0300, Alexey Budankov wrote:
>
> Trace file offset is calculated and updated linearly prior
> enqueuing aio write at record__pushfn().
>
> record__aio_sync() blocks till completion of started AIO operation
> and then proceeds.
>
> record__mmap_read_sync() implements a barrier for all incomplete
> aio write requests.
>
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
> Changes in v9:
> - enable AIO streaming only when --aio-cblocks option is specified explicitly
> Changes in v8:
> - split AIO completion check into separate record__aio_complete()
> Changes in v6:
> - handled errno == EAGAIN case from aio_write();
> Changes in v5:
> - data loss metrics decreased from 25% to 2x in trialed configuration;
> - avoided nanosleep() prior calling aio_suspend();
> - switched to per cpu multi record__aio_sync() aio
> - record_mmap_read_sync() now does global barrier just before
> switching trace file or collection stop;
> - resolved livelock on perf record -e intel_pt// -- dd if=/dev/zero of=/dev/null count=100000
> Changes in v4:
> - converted void *bf to struct perf_mmap *md in signatures
> - written comment in perf_mmap__push() just before perf_mmap__get();
> - written comment in record__mmap_read_sync() on possible restarting
> of aio_write() operation and releasing perf_mmap object after all;
> - added perf_mmap__put() for the cases of failed aio_write();
> Changes in v3:
> - written comments about nanosleep(0.5ms) call prior aio_suspend()
> to cope with intrusiveness of its implementation in glibc;
> - written comments about rationale behind coping profiling data
> into mmap->data buffer;
> ---
> tools/perf/builtin-record.c | 154 +++++++++++++++++++++++++++++++++++++++++++-
> tools/perf/perf.h | 3 +
> tools/perf/util/mmap.c | 73 +++++++++++++++++++++
> tools/perf/util/mmap.h | 4 ++
> 4 files changed, 233 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 0980dfe3396b..c63fe8c021a2 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -124,6 +124,118 @@ static int record__write(struct record *rec, struct perf_mmap *map __maybe_unuse
> return 0;
> }
>
> +#ifdef HAVE_AIO_SUPPORT
> +static int record__aio_write(struct aiocb *cblock, int trace_fd,
> + void *buf, size_t size, off_t off)
> +{
> + int rc;
> +
> + cblock->aio_fildes = trace_fd;
> + cblock->aio_buf = buf;
> + cblock->aio_nbytes = size;
> + cblock->aio_offset = off;
> + cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
> +
> + do {
> + rc = aio_write(cblock);
> + if (rc == 0) {
> + break;
> + } else if (errno != EAGAIN) {
> + cblock->aio_fildes = -1;
> + pr_err("failed to queue perf data, error: %m\n");
> + break;
> + }
> + } while (1);
> +
> + return rc;
> +}
> +
> +static int record__aio_complete(struct perf_mmap *md, struct aiocb *cblock)
> +{
> + void *rem_buf;
> + off_t rem_off;
> + size_t rem_size;
> + int rc, aio_errno;
> + ssize_t aio_ret, written;
> +
> + aio_errno = aio_error(cblock);
> + if (aio_errno == EINPROGRESS)
> + return 0;
> +
> + written = aio_ret = aio_return(cblock);
> + if (aio_ret < 0) {
> + if (aio_errno != EINTR)
> + pr_err("failed to write perf data, error: %m\n");
> + written = 0;
> + }
> +
> + rem_size = cblock->aio_nbytes - written;
> +
> + if (rem_size == 0) {
> + cblock->aio_fildes = -1;
> + /*
> + * md->refcount is incremented in perf_mmap__push() for
> + * every enqueued aio write request so decrement it because
> + * the request is now complete.
> + */
> + perf_mmap__put(md);
> + rc = 1;
> + } else {
> + /*
> + * aio write request may require restart with the
> + * reminder if the kernel didn't write whole
> + * chunk at once.
> + */
> + rem_off = cblock->aio_offset + written;
> + rem_buf = (void *)(cblock->aio_buf + written);
> + record__aio_write(cblock, cblock->aio_fildes,
> + rem_buf, rem_size, rem_off);
> + rc = 0;
> + }
> +
> + return rc;
> +}
> +
> +static void record__aio_sync(struct perf_mmap *md)
> +{
> + struct aiocb *cblock = &md->cblock;
> + struct timespec timeout = { 0, 1000 * 1000 * 1 }; // 1ms
> +
> + do {
> + if (cblock->aio_fildes == -1 || record__aio_complete(md, cblock))
> + return;
> +
> + while (aio_suspend((const struct aiocb**)&cblock, 1, &timeout)) {
> + if (!(errno == EAGAIN || errno == EINTR))
> + pr_err("failed to sync perf data, error: %m\n");
Is there somthing we can do in this error case? Any chance it gets
stuck in the loop?
> + }
> + } while (1);
> +}
> +
> +static int record__aio_pushfn(void *to, struct aiocb *cblock, void *bf, size_t size)
> +{
> + off_t off;
> + struct record *rec = to;
> + int ret, trace_fd = rec->session->data->file.fd;
> +
> + rec->samples++;
> +
> + off = lseek(trace_fd, 0, SEEK_CUR);
> + lseek(trace_fd, off + size, SEEK_SET);
It'd be nice if these lseek() could be removed and use
rec->bytes_written instead.
> + ret = record__aio_write(cblock, trace_fd, bf, size, off);
> + if (!ret) {
> + rec->bytes_written += size;
> +
> + if (switch_output_size(rec))
> + trigger_hit(&switch_output_trigger);
Doesn't it need the _sync() before the trigger? Maybe it should be
moved to record__mmap_read_evlist() or so..
> + } else {
> + lseek(trace_fd, off, SEEK_SET);
> + }
> +
> + return ret;
> +}
> +#endif
> +
> static int process_synthesized_event(struct perf_tool *tool,
> union perf_event *event,
> struct perf_sample *sample __maybe_unused,
> @@ -136,7 +248,6 @@ static int process_synthesized_event(struct perf_tool *tool,
> static int record__pushfn(struct perf_mmap *map, void *to, void *bf, size_t size)
> {
> struct record *rec = to;
> -
> rec->samples++;
> return record__write(rec, map, bf, size);
> }
> @@ -513,6 +624,24 @@ static struct perf_event_header finished_round_event = {
> .type = PERF_RECORD_FINISHED_ROUND,
> };
>
> +static void record__mmap_read_sync(struct record *rec __maybe_unused)
> +{
> +#ifdef HAVE_AIO_SUPPORT
> + int i;
> + struct perf_evlist *evlist = rec->evlist;
> + struct perf_mmap *maps = evlist->mmap;
> +
> + if (!rec->opts.nr_cblocks)
> + return;
> +
> + for (i = 0; i < evlist->nr_mmaps; i++) {
> + struct perf_mmap *map = &maps[i];
> + if (map->base)
> + record__aio_sync(map);
> + }
> +#endif
> +}
> +
> static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
> bool overwrite)
> {
> @@ -535,10 +664,29 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
> struct perf_mmap *map = &maps[i];
>
> if (map->base) {
> +#ifndef HAVE_AIO_SUPPORT
> if (perf_mmap__push(map, rec, record__pushfn) != 0) {
> rc = -1;
> goto out;
> }
> +#else
> + if (!rec->opts.nr_cblocks) {
> + if (perf_mmap__push(map, rec, record__pushfn) != 0) {
> + rc = -1;
> + goto out;
> + }
> + } else {
> + /*
> + * Call record__aio_sync() to wait till map->data buffer
> + * becomes available after previous aio write request.
> + */
> + record__aio_sync(map);
> + if (perf_mmap__aio_push(map, rec, record__aio_pushfn) != 0) {
> + rc = -1;
> + goto out;
> + }
> + }
> +#endif
If dummy aio functions are provided, the #ifdef can be removed and
just use the #else part assuming opts.nr_cblocks == 0.
Thanks,
Namhyung
> }
>
> if (map->auxtrace_mmap.base && !rec->opts.auxtrace_snapshot_mode &&
> @@ -1057,6 +1205,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> perf_evlist__toggle_bkw_mmap(rec->evlist, BKW_MMAP_DATA_PENDING);
>
> if (record__mmap_read_all(rec) < 0) {
> + record__mmap_read_sync(rec);
> trigger_error(&auxtrace_snapshot_trigger);
> trigger_error(&switch_output_trigger);
> err = -1;
> @@ -1068,6 +1217,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> if (!trigger_is_error(&auxtrace_snapshot_trigger))
> record__read_auxtrace_snapshot(rec);
> if (trigger_is_error(&auxtrace_snapshot_trigger)) {
> + record__mmap_read_sync(rec);
> pr_err("AUX area tracing snapshot failed\n");
> err = -1;
> goto out_child;
> @@ -1086,6 +1236,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> */
> if (rec->evlist->bkw_mmap_state == BKW_MMAP_RUNNING)
> continue;
> + record__mmap_read_sync(rec);
> trigger_ready(&switch_output_trigger);
>
> /*
> @@ -1139,6 +1290,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
> disabled = true;
> }
> }
> + record__mmap_read_sync(rec);
> trigger_off(&auxtrace_snapshot_trigger);
> trigger_off(&switch_output_trigger);
next prev parent reply other threads:[~2018-10-05 7:16 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-03 15:54 [PATCH v9 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
2018-10-03 16:01 ` [PATCH v9 1/3]: perf util: map data buffer for preserving collected data Alexey Budankov
2018-10-05 6:23 ` Namhyung Kim
2018-10-05 7:03 ` Alexey Budankov
2018-10-05 7:29 ` Alexey Budankov
2018-10-03 16:12 ` [PATCH v9 2/3]: perf record: enable asynchronous trace writing Alexey Budankov
2018-10-05 7:16 ` Namhyung Kim [this message]
2018-10-05 8:31 ` Alexey Budankov
2018-10-05 8:48 ` Namhyung Kim
2018-10-05 9:39 ` Alexey Budankov
2018-10-05 10:55 ` Namhyung Kim
2018-10-05 11:50 ` Alexey Budankov
2018-10-05 15:53 ` Alexey Budankov
2018-10-03 16:17 ` [PATCH v9 3/3]: perf record: extend trace writing to multi AIO Alexey Budankov
2018-10-05 7:22 ` Namhyung Kim
2018-10-05 7:54 ` Alexey Budankov
-- strict thread matches above, loose matches on Subject: below --
2018-10-05 13:35 [PATCH v10 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
2018-10-05 13:49 ` [PATCH v9 2/3]: perf record: enable asynchronous trace writing Alexey Budankov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181005071615.GC3768@sejong \
--to=namhyung@kernel.org \
--cc=acme@kernel.org \
--cc=ak@linux.intel.com \
--cc=alexander.shishkin@linux.intel.com \
--cc=alexey.budankov@linux.intel.com \
--cc=jolsa@redhat.com \
--cc=kernel-team@lge.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox