From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755575AbYIXQvo (ORCPT ); Wed, 24 Sep 2008 12:51:44 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752264AbYIXQvf (ORCPT ); Wed, 24 Sep 2008 12:51:35 -0400 Received: from tomts20.bellnexxia.net ([209.226.175.74]:43258 "EHLO tomts20-srv.bellnexxia.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751517AbYIXQve (ORCPT ); Wed, 24 Sep 2008 12:51:34 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqYEALYK2khMQWq+/2dsb2JhbACBXrwIgWU Date: Wed, 24 Sep 2008 12:51:31 -0400 From: Mathieu Desnoyers To: Peter Zijlstra Cc: Martin Bligh , Steven Rostedt , linux-kernel@vger.kernel.org, Ingo Molnar , Thomas Gleixner , Andrew Morton , prasad@linux.vnet.ibm.com, Linus Torvalds , "Frank Ch. Eigler" , David Wilder , hch@lst.de, Tom Zanussi , Steven Rostedt Subject: Re: [RFC PATCH 1/3] Unified trace buffer Message-ID: <20080924165131.GA1045@Krystal> References: <20080924051056.650388887@goodmis.org> <20080924051400.195780424@goodmis.org> <1222268595.16700.149.camel@lappy.programming.kicks-ass.net> <33307c790809240847r31c8b683na15ff5488b60d25b@mail.gmail.com> <20080924161347.GA31451@Krystal> <1222274361.16700.182.camel@lappy.programming.kicks-ass.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline In-Reply-To: <1222274361.16700.182.camel@lappy.programming.kicks-ass.net> X-Editor: vi X-Info: http://krystal.dyndns.org:8080 X-Operating-System: Linux/2.6.21.3-grsec (i686) X-Uptime: 12:41:24 up 111 days, 21:21, 8 users, load average: 1.54, 0.97, 0.83 User-Agent: Mutt/1.5.16 (2007-06-11) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Peter Zijlstra (peterz@infradead.org) wrote: > On Wed, 2008-09-24 at 12:13 -0400, Mathieu Desnoyers wrote: > > * Martin Bligh (mbligh@google.com) wrote: > > > Thanks for creating this so quickly ;-) > > > > > > >> We can record either the fast way of reserving a part of the buffer: > > > >> > > > >> event = ring_buffer_lock_reserve(buffer, event_id, length, &flags); > > > >> event->data = record_this_data; > > > >> ring_buffer_unlock_commit(buffer, event, flags); > > > > > > > > This can, in generic, not work. Due to the simple fact that we might > > > > straddle a page boundary. Therefore I think its best to limit our self > > > > to the write interface below, so that it can handle that. > > > > > > I'm not sure why this is any harder to deal with in write, than it is > > > in reserve? We should be able to make reserve handle this just > > > as well? > > > > > > If you use write rather than reserve, you have to copy all the data > > > twice for every event. > > > > > > > I think we all agree that a supplementary copy is no wanted, but I think > > this question is orthogonal to having a write wrapper. > > > This reserve/commit mechanism > > deals with synchronization (cli/spinlock or cmpxchg_local scheme...). > > Right > > > We can then use this offset to see in which page(s) we have to write. > > This offset + len can in fact cross multiple page boundaries. > > Sure > > > Doing this elegantly could involve a page array that would represent the > > buffer data : > > > > struct page **buffer; > > I really don't like the page array, but we can do without.. > > > And be given as parameter to the read() and write() methods, which would > > deal with page-crossing. > > > > e.g. > > > size_t write(struct page **buffer, size_t woffset, void *data, size_t len); > > > > Therefore, we could have code which writes in the buffers, without extra > > copy, and without using vmap, in multiple writes for a single event, > > which would deal with data alignment, e.g. : > > > > size_t woffset, evsize = 0; > > > > evsize += write(NULL, evsize, &var1, sizeof(var1)); > > evsize += write(NULL, evsize, &var2, sizeof(var2)); > > evsize += write(NULL, evsize, &var3, sizeof(var3)); > > > > woffset = reserve(..., evsize); > > > > woffset += write(buffer, woffset, &var1, sizeof(var1)); > > woffset += write(buffer, woffset, &var2, sizeof(var2)); > > woffset += write(buffer, woffset, &var3, sizeof(var3)); > > > > commit(..., evsize); > > > > Does that make sense ? > > Yes, we can do the sub-write, how about: > > struct ringbuffer_write_state > ringbuffer_write_start(struct ringbuffer *buffer, unsigned long size); > > int ringbuffer_write(struct ringbuffer_write_state *state, > const void *buf, unsigned long size); > > void ringbuffer_write_finish(struct ringbuffer_write_state *state); > > > That way write_start() can do the reserve and set a local write > iterator. write() can then do whatever, either the direct copy of break > it up - will error on overflowing the reserved size. write_finish() will > clean up (sti, preempt_enable etc..) > Yup, that looks neat. I don't know if it's worth separating data alignment concerns from this part of the infrastructure so it stays simple. OTOH, embedding automatic alignment of data elements would be easy to do here, e.g. : struct ringbuffer_write_state ringbuffer_write_start(struct ringbuffer *buffer, unsigned long size); int ringbuffer_write(struct ringbuffer_write_state *state, const void *buf, unsigned long size, unsigned long alignment); #define ringbuffer_compute_size(size, alignment) \ ringbuffer_write(NULL, NULL, size, alignment) void ringbuffer_write_finish(struct ringbuffer_write_state *state); So ringbuffer_compute_size could be useds to compute the total slot size needed to write the event before doing the ringbuffer_write_start(). It would be good to keep ringbuffer_write() mostly as a static inline so the compiler could statically compile in much of these operations. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68