From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753012AbcALFdx (ORCPT ); Tue, 12 Jan 2016 00:33:53 -0500 Received: from szxga02-in.huawei.com ([119.145.14.65]:65520 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750701AbcALFdv (ORCPT ); Tue, 12 Jan 2016 00:33:51 -0500 Message-ID: <56949028.2070208@huawei.com> Date: Tue, 12 Jan 2016 13:33:28 +0800 From: "Wangnan (F)" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Alexei Starovoitov , Peter Zijlstra CC: , , , , , , Adrian Hunter , Arnaldo Carvalho de Melo , David Ahern , Ingo Molnar , Yunlong Song Subject: Re: [PATCH 27/53] perf/core: Put size of a sample at the end of it by PERF_SAMPLE_TAILSIZE References: <1452520124-2073-1-git-send-email-wangnan0@huawei.com> <1452520124-2073-28-git-send-email-wangnan0@huawei.com> <20160111180913.GA25950@ast-mbp.thefacebook.com> In-Reply-To: <20160111180913.GA25950@ast-mbp.thefacebook.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.111.66.109] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A0B0202.56949038.002E,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2013-06-18 04:22:30, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 88067208c1c26f558470e61c78357c95 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2016/1/12 2:09, Alexei Starovoitov wrote: > On Mon, Jan 11, 2016 at 01:48:18PM +0000, Wang Nan wrote: >> This patch introduces a PERF_SAMPLE_TAILSIZE flag which allows a size >> field attached at the end of a sample. The idea comes from [1] that, >> with tie size at tail of an event, it is possible for user program who >> read from the ring buffer parse events backward. >> >> For example: >> >> head >> | >> V >> +--+---+-------+----------+------+---+ >> |E6|...| B 8| C 11| D 7|E..| >> +--+---+-------+----------+------+---+ >> >> In this case, from the 'head' pointer provided by kernel, user program >> can first see '6' by (*(head - sizeof(u64))), then it can get the start >> pointer of record 'E', then it can read size and find start position >> of record D, C, B in similar way. > adding extra 8 bytes for every sample is quite unfortunate. > How about another idea: > . update data_tail pointer when head is about to overwrite it > > Ex: > head data_tail > | | > V V > +--+-------+-------+---+----+---+ > |E | ... | B | C | D | E | > +--+-------+-------+---+----+---+ > > if new sample F is about to overwrite B, the kernel would need > to read the size of B from B's header and update data_tail to point C. > Or even further. > Comparing to TAILSIZE approach, now kernel will be doing both reads > and writes into ring-buffer and there is a concern that reads may > be hitting cold data, but if the records are small they may be > actually on the same cache line brought by the previous > read A's header, write E record cycle. So I think we shouldn't see > cache misses. After ring buffer rewind, we need a read before nearly every write operations. The performance penalty depends on configuration of write allocate. In addition, another data dependency is required: we must wait for the size of event B is retrived before overwrite it. Even in the very first try at 2013 in [1], reading from the ring buffer is avoided. I don't think Peter changes his mind now. > Another concern is validity of records stored. If user space messes > with ring-buffer, kernel won't be able to move data_tail properly > and would need to indicate that to userspace somehow. > But memory saving of 8 bytes per record could be sizable Yes. But I have already discussed with Peter on this in [2]. Last month I suggested: 1. If PERF_SAMPLE_SIZE is selected, we can avoid outputting the event size in header. Which eliminate extra space cost; However: That would mandate you always parse the stream backwards. Which seems rather unfortunate. Also, no you cannot recoup the extra space, see the alignment and size requirement. > and > user space wouldn't need to walk the whole buffer backwards and > can just start from valid data_tail, so the dumps of overwrite > ring-buffer will be faster too. > Thoughts? > Please also refer to [3]. In that patch we introduced a userspace ring buffer in perf and make it continously collect data from normal ring buffers. Since we have to wake up perf to read data, the cost is even higher. [1] http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net [2] http://lkml.kernel.org/r/20151203100801.GV3816@twins.programming.kicks-ass.net [3] http://lkml.kernel.org/r/1448373632-8806-1-git-send-email-yunlong.song@huawei.com