From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:42875)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <harsh@linux.vnet.ibm.com>) id 1RnT8b-00018k-1L
	for qemu-devel@nongnu.org; Wed, 18 Jan 2012 05:53:11 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <harsh@linux.vnet.ibm.com>) id 1RnT8U-0003GK-U9
	for qemu-devel@nongnu.org; Wed, 18 Jan 2012 05:53:05 -0500
Received: from e28smtp09.in.ibm.com ([122.248.162.9]:50622)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <harsh@linux.vnet.ibm.com>) id 1RnT8U-0003Fw-0V
	for qemu-devel@nongnu.org; Wed, 18 Jan 2012 05:52:58 -0500
Received: from /spool/local
	by e28smtp09.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use
	Only! Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <harsh@linux.vnet.ibm.com>;
	Wed, 18 Jan 2012 16:22:53 +0530
Received: from d28av05.in.ibm.com (d28av05.in.ibm.com [9.184.220.67])
	by d28relay01.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	q0IAqW7v3694718
	for <qemu-devel@nongnu.org>; Wed, 18 Jan 2012 16:22:33 +0530
Received: from d28av05.in.ibm.com (loopback [127.0.0.1])
	by d28av05.in.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id
	q0IAqVou017002
	for <qemu-devel@nongnu.org>; Wed, 18 Jan 2012 21:52:31 +1100
Message-ID: <4F16A46F.8070708@linux.vnet.ibm.com>
Date: Wed, 18 Jan 2012 16:22:31 +0530
From: Harsh Bora <harsh@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <1326193179-19677-1-git-send-email-harsh@linux.vnet.ibm.com>
	<1326193179-19677-3-git-send-email-harsh@linux.vnet.ibm.com>
	<CAJSP0QWVSENstH8HtBhcAf43Wr2PuTrpnf7P7ZWodE8t12uUFA@mail.gmail.com>
	<4F168D80.8000707@linux.vnet.ibm.com>
	<CAJSP0QWi8e8kzJDMKGGnVdo7b8F_GTtR64OBH5wJDAX7nFgB3A@mail.gmail.com>
	<4F16A1D3.7070205@linux.vnet.ibm.com>
In-Reply-To: <4F16A1D3.7070205@linux.vnet.ibm.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH v3 2/3] simpletrace-v2: Handle variable
 number/size of elements per trace record.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: vilanova@ac.upc.edu, qemu-devel@nongnu.org, stefanha@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com

On 01/18/2012 04:11 PM, Harsh Bora wrote:
> On 01/18/2012 04:01 PM, Stefan Hajnoczi wrote:
>> On Wed, Jan 18, 2012 at 9:14 AM, Harsh Bora<harsh@linux.vnet.ibm.com>
>> wrote:
>>> On 01/10/2012 10:11 PM, Stefan Hajnoczi wrote:
>>>>> + unused = fwrite(&record, ST_V2_REC_HDR_LEN, 1, trace_fp);
>>>>> writeout_idx += num_available;
>>>>> }
>>>>>
>>>>> idx = writeout_idx % TRACE_BUF_LEN;
>>>>> - while (get_trace_record(idx,&record)) {
>>>>> - trace_buf[idx].event = 0; /* clear valid bit */
>>>>> - unused = fwrite(&record, sizeof(record), 1, trace_fp);
>>>>> - idx = ++writeout_idx % TRACE_BUF_LEN;
>>>>> + while (get_trace_record(idx,&recordptr)) {
>>>>> + unused = fwrite(recordptr, recordptr->length, 1, trace_fp);
>>>>> + writeout_idx += recordptr->length;
>>>>> + g_free(recordptr);
>>>>> + recordptr = (TraceRecord *)&trace_buf[idx];
>>>>>
>>>>> + recordptr->event = 0;
>>>>> + idx = writeout_idx % TRACE_BUF_LEN;
>>>>> }
>>>>
>>>>
>>>> I'm wondering if it's worth using a different approach here. Writing
>>>> out individual records has bothered me.
>>>>
>>>> If we have a second buffer, as big as trace_buf[], then a function can
>>>> copy out all records and make them available in trace_buf again:
>>>>
>>>> /**
>>>> * Copy completed trace records out of the ring buffer
>>>> *
>>>> * @idx offset into trace_buf[]
>>>> * @buf destination buffer
>>>> * @len size of destination buffer
>>>> * @ret the number of bytes consumed
>>>> */
>>>> size_t consume_trace_records(unsigned int idx, void *buf, size_t len);
>>>>
>>>> That means consume gobbles up as many records as it can:
>>>> * Until it reaches an invalid record which has not been written yet
>>>> * Until it reaches the end of trace_buf[], the caller can call again
>>>> with idx wrapped to 0
>>>>
>>>> After copying into buf[] it clears the valid bit in trace_buf[].
>>>>
>>>> Then the loop which calls consume_trace_records() does a single
>>>> fwrite(3) and increments idx/writeout_idx.
>>>>
>>>> The advantage to this is that we write many records in one go and I
>>>> think it splits up the writeout steps a little nicer than what we've
>>>> previously done.
>>>>
>>>
>>> I think this optimization can be introduced as a separate patch later.
>>> Let me know if you think otherwise.
>>
>> Yes, that could be done later. However there is something incorrect
>> here. Threads will continue to write trace records into the
>> ringbuffer while the write-out thread is doing I/O. Think about what
>> happens when threads overtake the write-out index modulo ringbuffer
>> size. Since records are variable-length the write-out thread's next
>> index could point into the middle of an overwritten record. And that
>> means the ->length field is junk - we may crash if we use it.
>
> In case of overwritten records, the valid bit of event id will also be
> overwritten, and therefore we will not consider that record. Moreover,
> the writeout thread will further get to know that some events were
> dropped and will start with the latest trace_idx, right ?
>
> However, to handle the extreme rare case of having an overwritten value
> such that its valid bit appears to be set, we can put a check for <
> NR_TRACE_EVENTS. Even better to have a magic byte for events also ?
>
> Harsh

Also, I just realized that we need to put buffer boundary checks while 
copying a trace record:

 >> -static bool get_trace_record(unsigned int idx, TraceRecord *record)
 >> +static bool get_trace_record(unsigned int idx, TraceRecord **recordptr)
 >>   {
 >> -    if (!(trace_buf[idx].event&  TRACE_RECORD_VALID)) {
 >> +    TraceRecord *record = (TraceRecord *)&trace_buf[idx];
 >> +    if (!(record->event&  TRACE_RECORD_VALID)) {
 >>          return false;
 >>      }
 >>
 >>      __sync_synchronize(); /* read memory barrier before accessing 
record */
 >>
 >> -    *record = trace_buf[idx];
 >> -    record->event&= ~TRACE_RECORD_VALID;
 >> +    *recordptr = g_malloc(record->length);
 >> +    /* make a copy of record to avoid being overwritten */
 >> +    memcpy(*recordptr, record, record->length);

So, I need to use a wrapper which should copy byte by byte taking care 
of boundary overlaps, and that will also save us from crashing (as we 
will never be crossing buffer boundaries), right ?

Harsh

 >> +    (*recordptr)->event&= ~TRACE_RECORD_VALID;
 >>      return true;
 >>   }


>
>>
>>>>>
>>>>> fflush(trace_fp);
>>>>> @@ -231,7 +196,7 @@ void st_set_trace_file_enabled(bool enable)
>>>>> static const TraceRecord header = {
>>>>> .event = HEADER_EVENT_ID,
>>>>> .timestamp_ns = HEADER_MAGIC,
>>>>> - .x1 = HEADER_VERSION,
>>>>> + .length = HEADER_VERSION,
>>>>
>>>>
>>>> Hmm...this is kind of ugly (see comment about using .length above) but
>>>> in this case most parsers will have a special-case anyway to check the
>>>> magic number. We need to use the .length field because historically
>>>> that's where the version is located.
>>>>
>>>
>>> So, lets keep the version here only, right ?
>>
>> Yes, it's necessary to do .length = HEADER_VERSION.
>>
>> Stefan
>>
>
>