Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Machek <pavel@ucw.cz>, Roland Dreier <rdreier@cisco.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org,
	Paul Mackerras <paulus@samba.org>,
	Anton Blanchard <anton@samba.org>,
	general@lists.openfabrics.org, akpm@linux-foundation.org,
	torvalds@linux-foundation.org, Jeff Squyres <jsquyres@cisco.com>
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify
Date: Wed, 30 Sep 2009 10:02:32 -0600	[thread overview]
Message-ID: <20090930160232.GZ22310@obsidianresearch.com> (raw)
In-Reply-To: <20090930094456.GD24621@elte.hu>

On Wed, Sep 30, 2009 at 11:44:56AM +0200, Ingo Molnar wrote:
> > > OK.  It would be nice to tie into something more general, but I 
> > > think I agree -- perf counters are missing the filtering and the "no 
> > > lost events" that ummunotify does have. [...]
> 
> Performance events filtering is being worked on and now with the proper 
> non-DoS limit you've added you can lose events too, dont you? So it's 
> all a question of how much buffering to add - and with perf events too 
> you can buffer arbitrary large amount of events.

No, the ummunotify does not loose events, that is the fundamental
difference between it and all tracing schemes.

Every call to ibv_reg_mr is paired with a call to ummunotify to create
a matching watcher. Both calls allocate some kernel memory, if one
fails the entire operation fails and userspace can do whatever it does
on memory allocation failure.

After that point the scheme is perfectly lossless.

Performance event filtering would use the same kind of kernel memory,
call ibv_reg_mr, then install a filter, both allocate kernel memory,
if one fails the op fails. But then when the ring buffer overflows
you've lost events.

All the tracing schemes are lossy - since they loose events when the
ring buffer fills up. So to do that we either need to make a recovery
scheme of some sort, or make trace points that are blocking..

So, here is a concrete proposal how ummunotify could be absorbed by
perf events tracing, with filters.
- The filter expression must be able to trigger on a MMU event,
  triggering on the intersection of the MMU event address range and
  filter expression address range.
- The traces must be choosen so that there is exactly one filter
  expression per ibv_reg_mr region
- Each filter has a clearable saturating counter that increments every
  time the filter matches an event
- Each filter has a 64 bit user space assigned tag.
- An API similar to ummunotify exists:
     struct perf_filter_tag foo[100]
     int rc = perf_filters_read_and_clear_non_zero_counters(foo,100);
- Optionally - the mmap ring would contain only 64 bit user space
  filter tags, not trace events.

This would then duplicate the functions of ummunotify, including the
lossless collection of events. The flow would more or less be the
same:

 struct my_data *ptr = calloc()
 ptr->reg_handle = ibv_reg_mr(base,len)
 ptr->filter_handle = perf_filter_register("string matching base->len",ptr)

 [..]
 // fast path
 if (atomically(perf_map->head) != last_perf_map_head) {
     struct perf_filter_tag foo[100]
     int rc = perf_filters_read_and_clear_non_zero_counters(foo,100);
     for (unsigned int i = 0; i != rc; i++)
         ((struct my_data *)foo[i])->invalid = 1;

     perf_empty_mmap_ring(perf_map);
 }

If 'optionally' is done then the app can trundle through the mmap
and only use the above syscall loop if the mmap overflows. That would
be quite ideal.

It also must be guarenteed that when a trace point is hit the mmap
atomics are updated and visible to another user space thread before
the trace point returns - otherwise it is not synchronous enough and
will be racey.

> A question: what is the typical size/scope of the rbtree of the watched 
> regions of memory in practical (test) deployments of the ummunofity 
> code?

Jeff can you comment?

IIRC it is many tens (hundreds?) of thousands of watches.

> Per tracepoint filtering is possible via the perf event patches Li Zefan 
> has posted to lkml recently, under this subject:

Performance of the filter add is probably a bit of a concern..

Regards,
Jason

next prev parent reply	other threads:[~2009-09-30 16:02 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-11  4:38 [GIT PULL] please pull ummunotify Roland Dreier
2009-09-11  5:56 ` KOSAKI Motohiro
2009-09-11  6:03   ` Roland Dreier
2009-09-11  6:11     ` KOSAKI Motohiro
2009-09-11 16:42       ` Gleb Natapov
2009-09-11  6:15     ` Brice Goglin
2009-09-11  6:21       ` KOSAKI Motohiro
2009-09-11  6:22       ` Roland Dreier
2009-09-11  6:40         ` [ofa-general] " Jason Gunthorpe
2009-09-11 16:58           ` Roland Dreier
2009-09-15  7:03             ` KOSAKI Motohiro
2009-09-15  8:27               ` Roland Dreier
2009-09-15 12:38               ` Jeff Squyres
2009-09-15 11:34 ` Pavel Machek
2009-09-15 14:57   ` [ofa-general] " Roland Dreier
2009-09-28 20:49     ` Pavel Machek
2009-09-28 21:40       ` Jason Gunthorpe
2009-09-16 16:30 ` Roland Dreier
2009-09-16 16:40   ` Linus Torvalds
2009-09-17 11:30 ` Peter Zijlstra
2009-09-17 14:24   ` [ofa-general] " Roland Dreier
2009-09-17 14:32     ` Roland Dreier
2009-09-17 14:49       ` Peter Zijlstra
2009-09-17 15:03         ` Roland Dreier
2009-09-17 15:22           ` Peter Zijlstra
2009-09-17 15:45           ` Roland Dreier
2009-09-18 11:50             ` Ingo Molnar
2009-09-29 17:13             ` Pavel Machek
2009-09-30  9:44               ` Ingo Molnar
2009-09-30 16:02                 ` Jason Gunthorpe [this message]
2009-10-12 18:19                   ` Ingo Molnar
2009-10-12 19:30                     ` Jason Gunthorpe
2009-10-12 20:20                       ` Ingo Molnar
2009-10-13  4:05                         ` Jason Gunthorpe
2009-10-13  6:40                           ` Ingo Molnar
2009-10-13 16:27                             ` Jason Gunthorpe
2009-10-13  5:43                         ` Brice Goglin
2009-10-13  6:38                           ` Ingo Molnar
2009-09-30 17:06                 ` Roland Dreier
2009-10-02 16:32                 ` Roland Dreier
2009-10-02 20:45                   ` Pavel Machek
2009-10-07 22:34                   ` Roland Dreier
2009-10-12 17:33                     ` Peter Zijlstra
2009-09-17 14:43     ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090930160232.GZ22310@obsidianresearch.com \
    --to=jgunthorpe@obsidianresearch.com \
    --cc=akpm@linux-foundation.org \
    --cc=anton@samba.org \
    --cc=general@lists.openfabrics.org \
    --cc=jsquyres@cisco.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=paulus@samba.org \
    --cc=pavel@ucw.cz \
    --cc=peterz@infradead.org \
    --cc=rdreier@cisco.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox