* [RFC] Tracer Ring Buffer splice() vs page cache [was: Re: Perf and ftrace [was Re: PyTimechart]]
@ 2010-05-14 18:32 Mathieu Desnoyers
2010-05-14 18:49 ` Peter Zijlstra
0 siblings, 1 reply; 7+ messages in thread
From: Mathieu Desnoyers @ 2010-05-14 18:32 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Steven Rostedt, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo
* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2010-05-13 at 12:31 -0400, Mathieu Desnoyers wrote:
> >
> > In addition, this would play well with mmap() too: we can simply add a
> > ring_buffer_get_mmap_offset() method to the backend (exported through another
> > ioctl) that would let user-space know the start of the mmap'd buffer range
> > currently owned by the reader. So we can inform user-space of the currently
> > owned page range without even changing the underlying memory map.
>
> I still think keeping refs to splice pages is tricky at best. Suppose
> they're spliced into the pagecache of a file, it could stay there for a
> long time under some conditions.
>
> Also, the splice-client (say the pagecache) and the mmap will both want
> the pageframe to contain different information.
[CCing memory management specialists]
You bring a very interesting point. Let me describe what I want to achieve, and
see what others have to say about it:
I want the ring buffer to allocate pages only at ring buffer creation (never
while tracing). There are a few reasons why I want to do that, ranging from
improved performance to limited system disturbance.
Now let suppose we have the synchronization mechanism (detailed in the original
thread, but not relevant to this part of the discussion) that lets us give the
pages to the ring buffer "reader", which sends them to splice() so it can use
them as write buffers. Let also suppose that the ring buffer reader blocks until
the pages are written to the disk (synchronous write). In my scheme, the reader
still has pointers to these pages.
The point you bring here is that when the ring buffer "reader" is woken up,
these pages could still be in the page cache. So when the reader gives these
pages back to the ring buffer (so they can be used for writing again), the page
cache may still hold a reference to them, so the pages in the page cache and the
version on disk could be unsynchronized, and therefore this could possibly lead
to trace file corruption (in the worse case).
So I have three questions here:
1 - could we enforce removal of these pages from the page cache by calling
"page_cache_release()" before giving these pages back to the ring buffer ?
2 - or maybe is there a page flag we could specify when we allocate them to
ask for these pages to never be put in the page cache ? (but they should be
still usable as write buffers)
3 - is there something more we need to do to grab a reference on the pages
before passing them to splice(), so that when we call page_cache_release()
they don't get reclaimed ?
Thanks,
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] Tracer Ring Buffer splice() vs page cache [was: Re: Perf and ftrace [was Re: PyTimechart]]
2010-05-14 18:32 [RFC] Tracer Ring Buffer splice() vs page cache [was: Re: Perf and ftrace [was Re: PyTimechart]] Mathieu Desnoyers
@ 2010-05-14 18:49 ` Peter Zijlstra
2010-05-17 22:42 ` Mathieu Desnoyers
0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2010-05-14 18:49 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
Jens Axboe
On Fri, 2010-05-14 at 14:32 -0400, Mathieu Desnoyers wrote:
> [CCing memory management specialists]
And jet you forgot Jens who wrote it ;-)
> So I have three questions here:
>
> 1 - could we enforce removal of these pages from the page cache by calling
> "page_cache_release()" before giving these pages back to the ring buffer ?
>
> 2 - or maybe is there a page flag we could specify when we allocate them to
> ask for these pages to never be put in the page cache ? (but they should be
> still usable as write buffers)
>
> 3 - is there something more we need to do to grab a reference on the pages
> before passing them to splice(), so that when we call page_cache_release()
> they don't get reclaimed ?
There is no guarantee it is the pagecache they end up in, it could be a
network packet queue, a pipe, or anything that implements .splice_write.
>From what I understand of splice() is that it assumes it passes
ownership of the page, you're not supposed to touch them again, non of
the above three are feasible.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] Tracer Ring Buffer splice() vs page cache [was: Re: Perf and ftrace [was Re: PyTimechart]]
2010-05-14 18:49 ` Peter Zijlstra
@ 2010-05-17 22:42 ` Mathieu Desnoyers
2010-05-18 12:19 ` Peter Zijlstra
0 siblings, 1 reply; 7+ messages in thread
From: Mathieu Desnoyers @ 2010-05-17 22:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Steven Rostedt, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
Jens Axboe
* Peter Zijlstra (peterz@infradead.org) wrote:
> On Fri, 2010-05-14 at 14:32 -0400, Mathieu Desnoyers wrote:
>
> > [CCing memory management specialists]
>
> And jet you forgot Jens who wrote it ;-)
oops ! thanks for adding him.
>
> > So I have three questions here:
> >
> > 1 - could we enforce removal of these pages from the page cache by calling
> > "page_cache_release()" before giving these pages back to the ring buffer ?
> >
> > 2 - or maybe is there a page flag we could specify when we allocate them to
> > ask for these pages to never be put in the page cache ? (but they should be
> > still usable as write buffers)
> >
> > 3 - is there something more we need to do to grab a reference on the pages
> > before passing them to splice(), so that when we call page_cache_release()
> > they don't get reclaimed ?
>
> There is no guarantee it is the pagecache they end up in, it could be a
> network packet queue, a pipe, or anything that implements .splice_write.
>
> >From what I understand of splice() is that it assumes it passes
> ownership of the page, you're not supposed to touch them again, non of
> the above three are feasible.
Yup, I've looked more deeply at the splice() code, and I now see why things
don't fall apart in LTTng currently. My implementation seems to be causing
splice() to perform a copy. My ring buffer splice implementation is derived from
kernel/relay.c. I override
pipe_buf_operations release op with:
static void ltt_relay_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *pbuf)
{
}
and
splice_pipe_desc spd_release file op with:
static void ltt_relay_page_release(struct splice_pipe_desc *spd, unsigned int i)
{
}
My understanding is that by keeping 2 references on the pages (the ring buffer +
the pipe), splice safely refuses to move the pages and performs a copy instead.
I'll continue to look into this. One of the things I noticed that that we could
possibly use the "steal()" operation to steal the pages back from the page cache
to repopulate the ring buffer rather than continuously allocating new pages. If
steal() fails for some reasons, then we can fall back on page allocation. I'm
not sure it is safe to assume anything about pages being in the page cache
though. Maybe the safest route is to just allocate new pages for now.
Thoughts ?
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] Tracer Ring Buffer splice() vs page cache [was: Re: Perf and ftrace [was Re: PyTimechart]]
2010-05-17 22:42 ` Mathieu Desnoyers
@ 2010-05-18 12:19 ` Peter Zijlstra
2010-05-18 15:16 ` Mathieu Desnoyers
0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2010-05-18 12:19 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
Jens Axboe
On Mon, 2010-05-17 at 18:42 -0400, Mathieu Desnoyers wrote:
> I'll continue to look into this. One of the things I noticed that that we could
> possibly use the "steal()" operation to steal the pages back from the page cache
> to repopulate the ring buffer rather than continuously allocating new pages. If
> steal() fails for some reasons, then we can fall back on page allocation. I'm
> not sure it is safe to assume anything about pages being in the page cache
> though.
Also, suppose it was still in the page-cache and still dirty, a steal()
would then punch a hole in the file.
> Maybe the safest route is to just allocate new pages for now.
Yes, that seems to be the only sane approach.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] Tracer Ring Buffer splice() vs page cache [was: Re: Perf and ftrace [was Re: PyTimechart]]
2010-05-18 12:19 ` Peter Zijlstra
@ 2010-05-18 15:16 ` Mathieu Desnoyers
2010-05-18 15:23 ` Peter Zijlstra
0 siblings, 1 reply; 7+ messages in thread
From: Mathieu Desnoyers @ 2010-05-18 15:16 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Steven Rostedt, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
Jens Axboe
* Peter Zijlstra (peterz@infradead.org) wrote:
> On Mon, 2010-05-17 at 18:42 -0400, Mathieu Desnoyers wrote:
> > I'll continue to look into this. One of the things I noticed that that we could
> > possibly use the "steal()" operation to steal the pages back from the page cache
> > to repopulate the ring buffer rather than continuously allocating new pages. If
> > steal() fails for some reasons, then we can fall back on page allocation. I'm
> > not sure it is safe to assume anything about pages being in the page cache
> > though.
>
> Also, suppose it was still in the page-cache and still dirty, a steal()
> would then punch a hole in the file.
page_cache_pipe_buf_steal starts by doing a wait_on_page_writeback(page); and
then does a try_to_release_page(page, GFP_KERNEL). Only if that succeeds is the
action of stealing succeeding.
>
> > Maybe the safest route is to just allocate new pages for now.
>
> Yes, that seems to be the only sane approach.
Yes, a good start anyway.
Thanks,
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] Tracer Ring Buffer splice() vs page cache [was: Re: Perf and ftrace [was Re: PyTimechart]]
2010-05-18 15:16 ` Mathieu Desnoyers
@ 2010-05-18 15:23 ` Peter Zijlstra
2010-05-18 15:43 ` Mathieu Desnoyers
0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2010-05-18 15:23 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
Jens Axboe
On Tue, 2010-05-18 at 11:16 -0400, Mathieu Desnoyers wrote:
> > Also, suppose it was still in the page-cache and still dirty, a steal()
> > would then punch a hole in the file.
>
> page_cache_pipe_buf_steal starts by doing a wait_on_page_writeback(page); and
> then does a try_to_release_page(page, GFP_KERNEL). Only if that succeeds is the
> action of stealing succeeding.
If you're going to wait for writeback I don't really see the advantage
of stealing over simply allocating a new page.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] Tracer Ring Buffer splice() vs page cache [was: Re: Perf and ftrace [was Re: PyTimechart]]
2010-05-18 15:23 ` Peter Zijlstra
@ 2010-05-18 15:43 ` Mathieu Desnoyers
0 siblings, 0 replies; 7+ messages in thread
From: Mathieu Desnoyers @ 2010-05-18 15:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Steven Rostedt, Frederic Weisbecker, Pierre Tardy, Ingo Molnar,
Arnaldo Carvalho de Melo, Tom Zanussi, Paul Mackerras,
linux-kernel, arjan, ziga.mahkovec, davem, linux-mm,
Andrew Morton, KOSAKI Motohiro, Christoph Lameter, Tejun Heo,
Jens Axboe
* Peter Zijlstra (peterz@infradead.org) wrote:
> On Tue, 2010-05-18 at 11:16 -0400, Mathieu Desnoyers wrote:
> > > Also, suppose it was still in the page-cache and still dirty, a steal()
> > > would then punch a hole in the file.
> >
> > page_cache_pipe_buf_steal starts by doing a wait_on_page_writeback(page); and
> > then does a try_to_release_page(page, GFP_KERNEL). Only if that succeeds is the
> > action of stealing succeeding.
>
> If you're going to wait for writeback I don't really see the advantage
> of stealing over simply allocating a new page.
That would allow the ring buffer to use a bounded amount of memory and not
pollute the page cache uselessly. When allocating pages as you propose, the
tracer will quickly fill and pollute the page cache with trace file pages, which
will have a large impact on I/O behavior. But in 99.9999% of use-cases, we don't
ever need to access them after they have been saved to disk.
By re-stealing its own pages after waiting for the writeback to complete, the
ring buffer would use a bounded amount of pages. If larger buffers are needed,
the user just has to specify a larger buffer size.
Thanks,
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-05-18 15:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-14 18:32 [RFC] Tracer Ring Buffer splice() vs page cache [was: Re: Perf and ftrace [was Re: PyTimechart]] Mathieu Desnoyers
2010-05-14 18:49 ` Peter Zijlstra
2010-05-17 22:42 ` Mathieu Desnoyers
2010-05-18 12:19 ` Peter Zijlstra
2010-05-18 15:16 ` Mathieu Desnoyers
2010-05-18 15:23 ` Peter Zijlstra
2010-05-18 15:43 ` Mathieu Desnoyers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).