From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:53452)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <harsh@linux.vnet.ibm.com>) id 1RnTOg-0006dg-Cu
	for qemu-devel@nongnu.org; Wed, 18 Jan 2012 06:09:48 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <harsh@linux.vnet.ibm.com>) id 1RnTOR-0007U4-0x
	for qemu-devel@nongnu.org; Wed, 18 Jan 2012 06:09:42 -0500
Received: from e28smtp05.in.ibm.com ([122.248.162.5]:37435)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <harsh@linux.vnet.ibm.com>) id 1RnTOQ-0007T6-4y
	for qemu-devel@nongnu.org; Wed, 18 Jan 2012 06:09:26 -0500
Received: from /spool/local
	by e28smtp05.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use
	Only! Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <harsh@linux.vnet.ibm.com>;
	Wed, 18 Jan 2012 16:39:21 +0530
Received: from d28av03.in.ibm.com (d28av03.in.ibm.com [9.184.220.65])
	by d28relay04.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	q0IB9HVd4292760
	for <qemu-devel@nongnu.org>; Wed, 18 Jan 2012 16:39:17 +0530
Received: from d28av03.in.ibm.com (loopback [127.0.0.1])
	by d28av03.in.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id
	q0IB9GD7004714
	for <qemu-devel@nongnu.org>; Wed, 18 Jan 2012 22:09:17 +1100
Message-ID: <4F16A85C.8070301@linux.vnet.ibm.com>
Date: Wed, 18 Jan 2012 16:39:16 +0530
From: Harsh Bora <harsh@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <1326193179-19677-1-git-send-email-harsh@linux.vnet.ibm.com>
	<1326193179-19677-3-git-send-email-harsh@linux.vnet.ibm.com>
	<CAJSP0QWVSENstH8HtBhcAf43Wr2PuTrpnf7P7ZWodE8t12uUFA@mail.gmail.com>
	<4F168D80.8000707@linux.vnet.ibm.com>
	<CAJSP0QWi8e8kzJDMKGGnVdo7b8F_GTtR64OBH5wJDAX7nFgB3A@mail.gmail.com>
	<4F16A1D3.7070205@linux.vnet.ibm.com>	<4F16A46F.8070708@linux.vnet.ibm.com>
	<CAJSP0QXHK5Jmf1Zmrq4_A24eJJpKMyv9pwr459PYRwhGBC-8=g@mail.gmail.com>
In-Reply-To: <CAJSP0QXHK5Jmf1Zmrq4_A24eJJpKMyv9pwr459PYRwhGBC-8=g@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH v3 2/3] simpletrace-v2: Handle variable
 number/size of elements per trace record.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: aneesh.kumar@linux.vnet.ibm.com, vilanova@ac.upc.edu, stefanha@linux.vnet.ibm.com, qemu-devel@nongnu.org

On 01/18/2012 04:29 PM, Stefan Hajnoczi wrote:
> On Wed, Jan 18, 2012 at 10:52 AM, Harsh Bora<harsh@linux.vnet.ibm.com>  wrote:
>> On 01/18/2012 04:11 PM, Harsh Bora wrote:
>>>
>>> On 01/18/2012 04:01 PM, Stefan Hajnoczi wrote:
>>>>
>>>> On Wed, Jan 18, 2012 at 9:14 AM, Harsh Bora<harsh@linux.vnet.ibm.com>
>>>> wrote:
>>>>>
>>>>> On 01/10/2012 10:11 PM, Stefan Hajnoczi wrote:
>>>>>>>
>>>>>>> + unused = fwrite(&record, ST_V2_REC_HDR_LEN, 1, trace_fp);
>>>>>>> writeout_idx += num_available;
>>>>>>> }
>>>>>>>
>>>>>>> idx = writeout_idx % TRACE_BUF_LEN;
>>>>>>> - while (get_trace_record(idx,&record)) {
>>>>>>> - trace_buf[idx].event = 0; /* clear valid bit */
>>>>>>> - unused = fwrite(&record, sizeof(record), 1, trace_fp);
>>>>>>> - idx = ++writeout_idx % TRACE_BUF_LEN;
>>>>>>> + while (get_trace_record(idx,&recordptr)) {
>>>>>>> + unused = fwrite(recordptr, recordptr->length, 1, trace_fp);
>>>>>>> + writeout_idx += recordptr->length;
>>>>>>> + g_free(recordptr);
>>>>>>> + recordptr = (TraceRecord *)&trace_buf[idx];
>>>>>>>
>>>>>>> + recordptr->event = 0;
>>>>>>> + idx = writeout_idx % TRACE_BUF_LEN;
>>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm wondering if it's worth using a different approach here. Writing
>>>>>> out individual records has bothered me.
>>>>>>
>>>>>> If we have a second buffer, as big as trace_buf[], then a function can
>>>>>> copy out all records and make them available in trace_buf again:
>>>>>>
>>>>>> /**
>>>>>> * Copy completed trace records out of the ring buffer
>>>>>> *
>>>>>> * @idx offset into trace_buf[]
>>>>>> * @buf destination buffer
>>>>>> * @len size of destination buffer
>>>>>> * @ret the number of bytes consumed
>>>>>> */
>>>>>> size_t consume_trace_records(unsigned int idx, void *buf, size_t len);
>>>>>>
>>>>>> That means consume gobbles up as many records as it can:
>>>>>> * Until it reaches an invalid record which has not been written yet
>>>>>> * Until it reaches the end of trace_buf[], the caller can call again
>>>>>> with idx wrapped to 0
>>>>>>
>>>>>> After copying into buf[] it clears the valid bit in trace_buf[].
>>>>>>
>>>>>> Then the loop which calls consume_trace_records() does a single
>>>>>> fwrite(3) and increments idx/writeout_idx.
>>>>>>
>>>>>> The advantage to this is that we write many records in one go and I
>>>>>> think it splits up the writeout steps a little nicer than what we've
>>>>>> previously done.
>>>>>>
>>>>>
>>>>> I think this optimization can be introduced as a separate patch later.
>>>>> Let me know if you think otherwise.
>>>>
>>>>
>>>> Yes, that could be done later. However there is something incorrect
>>>> here. Threads will continue to write trace records into the
>>>> ringbuffer while the write-out thread is doing I/O. Think about what
>>>> happens when threads overtake the write-out index modulo ringbuffer
>>>> size. Since records are variable-length the write-out thread's next
>>>> index could point into the middle of an overwritten record. And that
>>>> means the ->length field is junk - we may crash if we use it.
>>>
>>>
>>> In case of overwritten records, the valid bit of event id will also be
>>> overwritten, and therefore we will not consider that record. Moreover,
>>> the writeout thread will further get to know that some events were
>>> dropped and will start with the latest trace_idx, right ?
>>>
>>> However, to handle the extreme rare case of having an overwritten value
>>> such that its valid bit appears to be set, we can put a check for<
>>> NR_TRACE_EVENTS. Even better to have a magic byte for events also ?
>>>
>>> Harsh
>>
>>
>> Also, I just realized that we need to put buffer boundary checks while
>> copying a trace record:
>>
>>
>>>> -static bool get_trace_record(unsigned int idx, TraceRecord *record)
>>>> +static bool get_trace_record(unsigned int idx, TraceRecord **recordptr)
>>>>    {
>>>> -    if (!(trace_buf[idx].event&    TRACE_RECORD_VALID)) {
>>>> +    TraceRecord *record = (TraceRecord *)&trace_buf[idx];
>>>> +    if (!(record->event&    TRACE_RECORD_VALID)) {
>>>>           return false;
>>>>       }
>>>>
>>>>       __sync_synchronize(); /* read memory barrier before accessing record
>>>> */
>>>>
>>>> -    *record = trace_buf[idx];
>>>> -    record->event&= ~TRACE_RECORD_VALID;
>>>> +    *recordptr = g_malloc(record->length);
>>>> +    /* make a copy of record to avoid being overwritten */
>>>> +    memcpy(*recordptr, record, record->length);
>>
>> So, I need to use a wrapper which should copy byte by byte taking care of
>> boundary overlaps, and that will also save us from crashing (as we will
>> never be crossing buffer boundaries), right ?
>
> It won't be enough if length has a junk value like 2^32 - 1.  We don't
> want to allocate/copy 4 GB :).

So, whats the max limit that we may want to propose for a record length 
? Obviously less than buffer length, or TRACE_BUF_LEN / 2 ?

Hope, a check for max buffer length with byte-wise copy will help ?

Harsh

>
> Stefan
>