From: "Wangnan (F)" <wangnan0@huawei.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <acme@kernel.org>, <linux-kernel@vger.kernel.org>,
<pi3orama@163.com>, <lizefan@huawei.com>,
<netdev@vger.kernel.org>, <davem@davemloft.net>,
Adrian Hunter <adrian.hunter@intel.com>,
Arnaldo Carvalho de Melo <acme@redhat.com>,
David Ahern <dsahern@gmail.com>, Ingo Molnar <mingo@kernel.org>,
Yunlong Song <yunlong.song@huawei.com>
Subject: Re: [PATCH 27/53] perf/core: Put size of a sample at the end of it by PERF_SAMPLE_TAILSIZE
Date: Tue, 12 Jan 2016 20:36:23 +0800 [thread overview]
Message-ID: <5694F347.5010700@huawei.com> (raw)
In-Reply-To: <20160112061145.GA31444@ast-mbp.thefacebook.com>
On 2016/1/12 14:11, Alexei Starovoitov wrote:
> On Tue, Jan 12, 2016 at 01:33:28PM +0800, Wangnan (F) wrote:
>>
>> On 2016/1/12 2:09, Alexei Starovoitov wrote:
>>> On Mon, Jan 11, 2016 at 01:48:18PM +0000, Wang Nan wrote:
>>>> This patch introduces a PERF_SAMPLE_TAILSIZE flag which allows a size
>>>> field attached at the end of a sample. The idea comes from [1] that,
>>>> with tie size at tail of an event, it is possible for user program who
>>>> read from the ring buffer parse events backward.
>>>>
>>>> For example:
>>>>
>>>> head
>>>> |
>>>> V
>>>> +--+---+-------+----------+------+---+
>>>> |E6|...| B 8| C 11| D 7|E..|
>>>> +--+---+-------+----------+------+---+
>>>>
>>>> In this case, from the 'head' pointer provided by kernel, user program
>>>> can first see '6' by (*(head - sizeof(u64))), then it can get the start
>>>> pointer of record 'E', then it can read size and find start position
>>>> of record D, C, B in similar way.
>>> adding extra 8 bytes for every sample is quite unfortunate.
>>> How about another idea:
>>> . update data_tail pointer when head is about to overwrite it
>>>
>>> Ex:
>>> head data_tail
>>> | |
>>> V V
>>> +--+-------+-------+---+----+---+
>>> |E | ... | B | C | D | E |
>>> +--+-------+-------+---+----+---+
>>>
>>> if new sample F is about to overwrite B, the kernel would need
>>> to read the size of B from B's header and update data_tail to point C.
>>> Or even further.
>>> Comparing to TAILSIZE approach, now kernel will be doing both reads
>>> and writes into ring-buffer and there is a concern that reads may
>>> be hitting cold data, but if the records are small they may be
>>> actually on the same cache line brought by the previous
>>> read A's header, write E record cycle. So I think we shouldn't see
>>> cache misses.
>> After ring buffer rewind, we need a read before nearly
>> every write operations. The performance penalty depends on
>> configuration of write allocate. In addition, another data
>> dependency is required: we must wait for the size of
>> event B is retrived before overwrite it.
>>
>> Even in the very first try at 2013 in [1], reading from the ring
>> buffer is avoided. I don't think Peter changes his mind now.
>>
>>> Another concern is validity of records stored. If user space messes
>>> with ring-buffer, kernel won't be able to move data_tail properly
>>> and would need to indicate that to userspace somehow.
>>> But memory saving of 8 bytes per record could be sizable
>> Yes. But I have already discussed with Peter on this in [2].
>> Last month I suggested:
>>
>> <quote>
>>
>> 1. If PERF_SAMPLE_SIZE is selected, we can avoid outputting the event
>> size in header. Which eliminate extra space cost;
>> </quote>
>>
>> However:
>>
>> <quote>
>>
>> That would mandate you always parse the stream backwards. Which seems
>> rather unfortunate. Also, no you cannot recoup the extra space, see the
>> alignment and size requirement.
> hmm, in this kernel patch I see that you're adding 8 bytes for
> every record via this extra TAILSISZE flag and in perf you're
> walking the ring buffer backwards by reading this 8 byte
> sizes, comparing header sizes and so on until reaching beginning,
> where you start dumping it as normal.
> So for this 'signal to perf' approach to work the ring buffer
> will contain tailsizes everywhere just so that user space can
> find the beginning. That's not very pretty. imo if kernel
> can do header read to adjust data_tail it would make user
> space side clean. May be there are other solutions.
> Adding tailsize seems like brute force hack.
> There must be some nicer way.
Hi Peter,
What's your opinion? Should we reconsider moving size field from
header the end?
Or moving whole header to the end of a record?
Thank you.
next prev parent reply other threads:[~2016-01-12 12:36 UTC|newest]
Thread overview: 90+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-11 13:47 [PATCH 00/53] perf tools: Bugfix, BPF improvement and perf record flight record mode Wang Nan
2016-01-11 13:47 ` [PATCH 01/53] perf tools: Add -lutil in python lib list for broken python-config Wang Nan
2016-01-12 9:43 ` Jiri Olsa
2016-01-11 13:47 ` [PATCH 02/53] perf tools: Fix phony build target for build-test Wang Nan
2016-01-11 13:47 ` [PATCH 03/53] perf tools: Set parallel making options build-test Wang Nan
2016-01-11 13:47 ` [PATCH 04/53] perf tools: Pass O option to Makefile.perf in build-test Wang Nan
2016-01-11 13:47 ` [PATCH 05/53] perf tools: Test correct path of perf " Wang Nan
2016-01-11 15:24 ` Arnaldo Carvalho de Melo
2016-01-11 22:06 ` Arnaldo Carvalho de Melo
2016-01-11 22:39 ` Arnaldo Carvalho de Melo
2016-01-11 22:39 ` Arnaldo Carvalho de Melo
2016-01-12 7:16 ` Wangnan (F)
2016-01-12 14:08 ` Arnaldo Carvalho de Melo
2016-01-11 13:47 ` [PATCH 06/53] perf tools: Fix PowerPC native building Wang Nan
2016-01-11 13:47 ` [PATCH 07/53] tools: Move Makefile.arch from perf/config to tools/scripts Wang Nan
2016-01-11 13:52 ` Wangnan (F)
2016-01-11 14:10 ` Arnaldo Carvalho de Melo
2016-01-11 13:47 ` [PATCH 08/53] perf tools: Add missing sources in perf's MANIFEST Wang Nan
2016-01-11 13:48 ` [PATCH 09/53] perf: bpf: Fix build breakage due to libbpf Wang Nan
2016-01-11 13:48 ` [PATCH 10/53] tools build: Add BPF feature check to test-all Wang Nan
2016-01-11 13:48 ` [PATCH 11/53] perf test: Fix false TEST_OK result for 'perf test hist' Wang Nan
2016-01-11 14:25 ` Sergei Shtylyov
2016-01-11 14:58 ` Arnaldo Carvalho de Melo
2016-01-11 15:32 ` Arnaldo Carvalho de Melo
2016-01-11 13:48 ` [PATCH 12/53] perf test: Reset err after using it hold errcode in hist testcases Wang Nan
2016-01-11 13:48 ` [PATCH 13/53] perf tools: Prevent calling machine__delete() on non-allocated machine Wang Nan
2016-01-11 15:42 ` Arnaldo Carvalho de Melo
2016-01-12 7:03 ` Wangnan (F)
2016-01-12 14:07 ` Arnaldo Carvalho de Melo
2016-01-11 13:48 ` [PATCH 14/53] perf test: Check environment before start real BPF test Wang Nan
2016-01-11 21:55 ` Arnaldo Carvalho de Melo
2016-01-12 7:40 ` Wangnan (F)
2016-01-12 14:10 ` Arnaldo Carvalho de Melo
2016-01-11 13:48 ` [PATCH 15/53] perf tools: Fix symbols searching for offline module in buildid-cache Wang Nan
2016-01-11 13:48 ` [PATCH 16/53] perf tools: Fix mmap2 event allocation in synthesize code Wang Nan
2016-01-11 13:48 ` [PATCH 17/53] perf test: Improve bp_signal Wang Nan
2016-01-11 21:37 ` Arnaldo Carvalho de Melo
2016-01-12 4:13 ` Wangnan (F)
2016-01-12 9:21 ` Jiri Olsa
2016-01-12 14:11 ` Arnaldo Carvalho de Melo
2016-01-12 14:17 ` Will Deacon
2016-01-11 13:48 ` [PATCH 18/53] perf tools: Add API to config maps in bpf object Wang Nan
2016-01-11 13:48 ` [PATCH 19/53] perf tools: Enable BPF object configure syntax Wang Nan
2016-01-11 13:48 ` [PATCH 20/53] perf record: Apply config to BPF objects before recording Wang Nan
2016-01-11 13:48 ` [PATCH 21/53] perf tools: Enable passing event to BPF object Wang Nan
2016-01-11 13:48 ` [PATCH 22/53] perf tools: Support perf event alias name Wang Nan
2016-01-11 13:48 ` [PATCH 23/53] perf tools: Support setting different slots in a BPF map separately Wang Nan
2016-01-11 13:48 ` [PATCH 24/53] perf tools: Enable indices setting syntax for BPF maps Wang Nan
2016-01-11 13:48 ` [PATCH 25/53] perf tools: Introduce bpf-output event Wang Nan
2016-01-11 13:48 ` [PATCH 26/53] perf data: Support converting data from bpf_perf_event_output() Wang Nan
2016-01-11 13:48 ` [PATCH 27/53] perf/core: Put size of a sample at the end of it by PERF_SAMPLE_TAILSIZE Wang Nan
2016-01-11 18:09 ` Alexei Starovoitov
2016-01-12 5:33 ` Wangnan (F)
2016-01-12 6:11 ` Alexei Starovoitov
2016-01-12 12:36 ` Wangnan (F) [this message]
2016-01-12 19:56 ` Alexei Starovoitov
2016-01-13 4:34 ` Wangnan (F)
2016-01-13 5:14 ` Alexei Starovoitov
2016-01-12 14:05 ` Peter Zijlstra
2016-01-12 14:14 ` Peter Zijlstra
2016-01-11 13:48 ` [PATCH 28/53] perf tools: Move timestamp creation to util Wang Nan
2016-01-11 13:48 ` [PATCH 29/53] perf tools: Make ordered_events reusable Wang Nan
2016-01-11 21:33 ` Arnaldo Carvalho de Melo
2016-01-11 13:48 ` [PATCH 30/53] perf record: Extract synthesize code to record__synthesize() Wang Nan
2016-01-11 13:48 ` [PATCH 31/53] perf tools: Add perf_data_file__switch() helper Wang Nan
2016-01-11 13:48 ` [PATCH 32/53] perf record: Turns auxtrace_snapshot_enable into 3 states Wang Nan
2016-01-11 13:48 ` [PATCH 33/53] perf record: Introduce record__finish_output() to finish a perf.data Wang Nan
2016-01-11 13:48 ` [PATCH 34/53] perf record: Use OPT_BOOLEAN_SET for buildid cache related options Wang Nan
2016-01-11 13:48 ` [PATCH 35/53] perf record: Add '--timestamp-filename' option to append timestamp to output filename Wang Nan
2016-01-11 13:48 ` [PATCH 36/53] perf record: Split output into multiple files via '--switch-output' Wang Nan
2016-01-11 13:48 ` [PATCH 37/53] perf record: Force enable --timestamp-filename when --switch-output is provided Wang Nan
2016-01-11 13:48 ` [PATCH 38/53] perf record: Disable buildid cache options by default in switch output mode Wang Nan
2016-01-11 13:48 ` [PATCH 39/53] perf record: Re-synthesize tracking events after output switching Wang Nan
2016-01-11 13:48 ` [PATCH 40/53] perf record: Generate tracking events for process forked by perf Wang Nan
2016-01-11 13:48 ` [PATCH 41/53] perf record: Ensure return non-zero rc when mmap fail Wang Nan
2016-01-11 13:48 ` [PATCH 42/53] perf record: Prevent reading invalid data in record__mmap_read Wang Nan
2016-01-11 14:21 ` Sergei Shtylyov
2016-01-11 15:00 ` Arnaldo Carvalho de Melo
2016-01-11 15:01 ` Arnaldo Carvalho de Melo
2016-01-11 13:48 ` [PATCH 43/53] perf tools: Add evlist channel helpers Wang Nan
2016-01-11 13:48 ` [PATCH 44/53] perf tools: Automatically add new channel according to evlist Wang Nan
2016-01-11 13:48 ` [PATCH 45/53] perf tools: Operate multiple channels Wang Nan
2016-01-11 13:48 ` [PATCH 46/53] perf tools: Squash overwrite setting into channel Wang Nan
2016-01-11 13:48 ` [PATCH 47/53] perf record: Don't read from and poll overwrite channel Wang Nan
2016-01-11 13:48 ` [PATCH 48/53] perf tools: Enable overwrite settings Wang Nan
2016-01-11 13:48 ` [PATCH 49/53] perf tools: Consider TAILSIZE bit when caclulate is_pos Wang Nan
2016-01-11 13:48 ` [PATCH 50/53] perf tools: Set tailsize attribut bit for overwrite events Wang Nan
2016-01-11 13:48 ` [PATCH 51/53] perf record: Read from tailsize ring buffer Wang Nan
2016-01-11 13:48 ` [PATCH 52/53] perf record: Toggle tailsize ring buffer for reading Wang Nan
2016-01-11 13:48 ` [PATCH 53/53] perf record: Allow generate tracking events at the end of output Wang Nan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5694F347.5010700@huawei.com \
--to=wangnan0@huawei.com \
--cc=a.p.zijlstra@chello.nl \
--cc=acme@kernel.org \
--cc=acme@redhat.com \
--cc=adrian.hunter@intel.com \
--cc=alexei.starovoitov@gmail.com \
--cc=davem@davemloft.net \
--cc=dsahern@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lizefan@huawei.com \
--cc=mingo@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pi3orama@163.com \
--cc=yunlong.song@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).