* Logging braindump
@ 2012-03-19 20:53 Tommi Virtanen
2012-03-22 11:26 ` Sage Weil
2012-03-22 16:38 ` Colin McCabe
0 siblings, 2 replies; 9+ messages in thread
From: Tommi Virtanen @ 2012-03-19 20:53 UTC (permalink / raw)
To: ceph-devel
So, we've been talking about an in-memory buffer that would contain
debug-level messages, and a separate thread/process [1] that would
write a subset of these to disk. Thus, on crash, we'd have a lot of
detail available (the ring buffer entries not yet overwritten),
without paying the IO/serialization/storage cost of dumping it all out
to disk.
[1] e.g. varnish uses an mmaped file for a ringbuffer, consumes it
from a difference process; that way, the log consumer cannot corrupt
the server process memory. Of course, this means log messages cannot
contain pointers to indirect data. And that means the buffer is not
made of constant size entries, like Disruptor tends to assume.. though
I think you could view Disruptor sequence numbers as byte offsets, if
you wanted.
RING BUFFER
For the ring buffer, we've been looking at Disruptor[2]-style
"consumer tries to catch up with a sequence number from producer"
design. As we have multiple producers (threads creating log entries),
the cost of creating a log entry would be a single cmpxchg op, and
then whatever work is needed to lay out the event in the ringbuffer.
http://martinfowler.com/articles/lmax.html
http://code.google.com/p/disruptor/
The in-memory data format could just use whatever data format is most
convenient. The ringbuffer could be an array tiny structs with the
base fields like thread id embedded there, and pointers to separately
allocated data for items that aren't always present. But this means we
need to be very careful about memory management; we want the data
pointed to to stay alive and unmodified until the producer loops
around the ringbuffer.
Alternatively, interpret Disruptor sequence numbers as byte offsets,
serialize message first, allocate that much space from ring buffer
(still just one cmpxchg). This pushes more of the work to the producer
of log messages, but avoids having an intermediate data format that
needs to be converted to another format, and simplifies memory
management tremendously.
DISK FORMAT
The process writing the events to disk should be more focused on
usefulness and longevity of the data. If the ring buffer is just
arrays with pointers, here we should take the performance hit to
convert to one of the known formats. I feel strongly in favor of
structured logging, as parsing bazillion log entries is slow, and
maintaining the parsing rules is actually harder than structured
logging in the first place.
The status quo is hoping to improve syslog, but there's so much
Enterprise in this stuff that I'm not holding my breath..
http://lwn.net/Articles/484731/ . Work that has come out includes the
structured syslog format below, CEE querying further down.
Some candidates:
- Scribe and Flume are pre-existing "log collectors" that emphasize a
DAG of log flow, lots of Java everywhere.. I'm not thrilled.
https://github.com/facebook/scribe https://cwiki.apache.org/FLUME/
- journald: I'm just going to pretend it doesn't exist, at least for 2
years: http://blog.gerhards.net/2011/11/journald-and-rsyslog.html
- syslog's structured logging extension:
http://tools.ietf.org/html/rfc5424#section-6.5
essentially, [key=value key2=val2] MESSAGE
<165>1 2003-10-11T22:14:15.003Z mymachine.example.com
evntslog - ID47 [exampleSDID@32473 iut="3" eventSource=
"Application" eventID="1011"] BOMAn application
event log entry...
- JSON: http://json.org/
{"key": "value", "key2": "val2", "message": "MESSAGE"}
the convention is lines of json, separated by newlines, each line is
a full event
biggest downside is numbers are always floats (need to stringify
large numbers), binary data must be encoded somehow into a utf-8
string (base64 is most common).
- GELF: compressed JSON with specific fields:
https://github.com/Graylog2/graylog2-docs/wiki/GELF
- Google Protocol Buffers: considered clumsy these days (code
generation from IDL etc); only Google has significant investment in
the format
- Thrift: considered clumsy these days (code generation from IDL etc);
only Facebook has significant investment in the format
- BSON: sort of close to binary encoding of JSON + extra data types,
not a huge improvement in speed/space.. http://bsonspec.org/
- Avro: Apache-sponspored data format, nicely self-describing,
apparently slow? http://avro.apache.org/
- MessagePack: binary encoding for JSON, claims to beat others in
speed.. http://msgpack.org/
And all of these can be compressed with e.g. Snappy as they flow to
disk. http://code.google.com/p/snappy/
Downside of just all but JSON: we'd need to bundle the library --
distro support just isn't there yet.
Should the disk format be binary? That makes it less friendly to the
admin. I'm not sure which way to go. JSON is simpler and friendlier,
e.g. MessagePack has identical data model but is faster and takes less
space. Some options:
a. make configurable so simple installations don't need to suffer binary logs
b. just pick one and stick with it
QUERYING / ANALYSIS
- use a format from above that is mapreduce-friedly, or can be 1:1
imported into another storage system
- software like Graylog may be of use, but I fear we'll overwhelm it
with events: http://graylog2.org/
- Cassandra's Brisk is a really easy way to run SQL-like Hive queries
over structured data, and has a design that'll ingest any amount of
data, Just Add Hardware(tm):
http://www.datastax.com/docs/0.8/brisk/index
- the standards process is churning out things like CEE, but I'm not
holding my breath: http://cee.mitre.org/
MY RECOMMENDATIONS [biased, as always ;-]
- bundle the MessagePack library
- in thread that calls log: serialize as MessagePack onto stack,
allocate needed bytes from ringbuffer, copy event to ringbuffer
- write to disk is now very simple, could even be done in a different
process (mmap header+ringbuffer)
- let disk files be named after timestamp they were started at, start
new ones based on time & size (no .0 -> .1 -> .2 renaming needed)
- make it really simple to process+delete chunks of log, feeding them
into Brisk or Graylog, then deleting from the node (perhaps after a
delay, so last 24h is locally browseable)
(and don't remove things that haven't been processed)
Hope that made sense. Let's talk more, especially if it didn't ;)
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: Logging braindump
2012-03-19 20:53 Logging braindump Tommi Virtanen
@ 2012-03-22 11:26 ` Sage Weil
2012-03-22 16:38 ` Colin McCabe
1 sibling, 0 replies; 9+ messages in thread
From: Sage Weil @ 2012-03-22 11:26 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: ceph-devel
This is well-timed.. I just started playing with this code on the flight
over here.
There are a couple pieces to what I've done so far:
- clearly segmented logging code into a new set of classes in src/log,
without dependencies on everything else
- set up config infrastructure to control 'log' and 'gather' levels for
different subsystems, where gather controls what messages are
generated, and log controls which of those reach disk (in non-crash
situation)
- made a relatively naive implementation that works (messages on heap,
separate thread to write out, dump recent history on crash).
- a lot of the weird interaction/bootstrapping stuff with contexts and
config now go away entirely if we just rely on cerr for errors during
config parsing, etc. (like that simple spinlock Greg was complaining
about).
> MY RECOMMENDATIONS [biased, as always ;-]
>
> - bundle the MessagePack library
I'm worried about requiring tools to read logs. And I think most of the
current overhead is the sync writes (which are easily addressed). But
string serialization is also a significant chunk, I think, so this may be
worth it... need to do some real tests.
> - in thread that calls log: serialize as MessagePack onto stack,
> allocate needed bytes from ringbuffer, copy event to ringbuffer
> - write to disk is now very simple, could even be done in a different
> process (mmap header+ringbuffer)
My naive implementation puts entries on heap and links them into a list:
can't easily extract from a core, but otherwise equivalent to a ring of
pointers. My limited tests showed std::string assignment had a huge
overhead, though, so I suspect writing a stringbuf implementation that
writes into cheaply preallocated buffers (maybe allocated off a large
preallocated ring buffer), optimized for common case (40-80 byte message).
But I need to read more of the Disruptor stuff to see what exactly they're
doing and why.
In any case, the new structure makes it easier to test implementations
given things are better contained.
> - let disk files be named after timestamp they were started at, start
> new ones based on time & size (no .0 -> .1 -> .2 renaming needed)
> - make it really simple to process+delete chunks of log, feeding them
> into Brisk or Graylog, then deleting from the node (perhaps after a
> delay, so last 24h is locally browseable)
> (and don't remove things that haven't been processed)
Currently that's done with rename, sighup, wait for new log to
appear, then process+delete. Are there problems there that make it worth
taking a non-standard approach to the log file naming?
Anyway, pushing what I have now to wip-log. The exposed interface is just
the macro in common/dout.h now. Alternative approaches will probably
change log/Entry.h and probably the allocation steps, but otherwise be
pretty simple to swap in/out.
The main piece that is sitll missing is an interface for logging
structured content. I suspect the way to do that is by putting stuff in
the ostream stream (ala std::setiosflags(foo) and their ilk) so that you
can easily stick key/value pairs in th code (... << log::kv("foo", "bar")
<< ... or something)? Hopefully that can reduce to something efficient,
at least, so we don't have to replace the current dout logging sites with
something completely different.
sage
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: Logging braindump
2012-03-19 20:53 Logging braindump Tommi Virtanen
2012-03-22 11:26 ` Sage Weil
@ 2012-03-22 16:38 ` Colin McCabe
2012-03-22 18:17 ` Mark Kampe
1 sibling, 1 reply; 9+ messages in thread
From: Colin McCabe @ 2012-03-22 16:38 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: ceph-devel
On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen
<tommi.virtanen@dreamhost.com> wrote:
> [mmap'ed buffer discussion]
I always thought mmap'ed circular buffers were an elegant approach for
getting data that survived a process crash, but not paying the
overhead of write(2) and read(2). The main problem is that you need
special tools to read the circular buffer files off of the disk. As
Sage commented, that is probably undesirable for many users.
An mmap'ed buffer, even a lockless one, is a simple beast. Do you
really need a whole library just for that? Maybe I'm just
old-fashioned.
> DISK FORMAT
> - GELF: compressed JSON with specific fields:
> https://github.com/Graylog2/graylog2-docs/wiki/GELF
> - Google Protocol Buffers: considered clumsy these days (code
> generation from IDL etc); only Google has significant investment in
> the format
Hadoop also has a significant investment in Google Protocol Buffers.
It's almost the only thing we send over the wire these days. I don't
think this affects your decision about logging at all, though.
> - Thrift: considered clumsy these days (code generation from IDL etc);
> only Facebook has significant investment in the format
> - BSON: sort of close to binary encoding of JSON + extra data types,
> not a huge improvement in speed/space.. http://bsonspec.org/
> - Avro: Apache-sponspored data format, nicely self-describing,
> apparently slow? http://avro.apache.org/
> - MessagePack: binary encoding for JSON, claims to beat others in
> speed.. http://msgpack.org/
I'm not sure why you consider code generation from a schema file "clumsy."
To me, the whole point of structured logging is that
1. it is more efficient in terms of space/speed, partly because it
doesn't need to be self-describing, and
2. it is a stable ABI that binds the system together
The sticking point for most programmers is point #2. They don't want
to commit to a log format ahead of time because it slows down
development. (Pretty much the same reason Linus doesn't want a
stable in-kernel ABI.) In my opinion, that's the reason why
structured logging has not made much headway in the world at large.
Journald is an interesting experiment-- we'll see how many programmers
carefully design structured journald log messages, and how many just
output a single field called "text" or something similar. :)
If you "just" want better efficiency, there's a lot of ways to get
that which don't involve changing the Ceph logging APIs. Probably the
easiest one is just cutting down on hugely verbose (as in more than 80
character) log messages. There's no shortage of those.
For an unrelated project, I have been considering a system where all
logs are generated and go to a circular buffer, but are not sent to
permanent storage until/unless there is a crash, or the particular log
message type is enabled. It seems like something like this is a good
compromise between too much logging and not enough.
Colin
>
> And all of these can be compressed with e.g. Snappy as they flow to
> disk. http://code.google.com/p/snappy/
>
> Downside of just all but JSON: we'd need to bundle the library --
> distro support just isn't there yet.
>
> Should the disk format be binary? That makes it less friendly to the
> admin. I'm not sure which way to go. JSON is simpler and friendlier,
> e.g. MessagePack has identical data model but is faster and takes less
> space. Some options:
> a. make configurable so simple installations don't need to suffer binary logs
> b. just pick one and stick with it
>
>
>
> QUERYING / ANALYSIS
>
> - use a format from above that is mapreduce-friedly, or can be 1:1
> imported into another storage system
> - software like Graylog may be of use, but I fear we'll overwhelm it
> with events: http://graylog2.org/
> - Cassandra's Brisk is a really easy way to run SQL-like Hive queries
> over structured data, and has a design that'll ingest any amount of
> data, Just Add Hardware(tm):
> http://www.datastax.com/docs/0.8/brisk/index
> - the standards process is churning out things like CEE, but I'm not
> holding my breath: http://cee.mitre.org/
>
>
>
> MY RECOMMENDATIONS [biased, as always ;-]
>
> - bundle the MessagePack library
> - in thread that calls log: serialize as MessagePack onto stack,
> allocate needed bytes from ringbuffer, copy event to ringbuffer
> - write to disk is now very simple, could even be done in a different
> process (mmap header+ringbuffer)
> - let disk files be named after timestamp they were started at, start
> new ones based on time & size (no .0 -> .1 -> .2 renaming needed)
> - make it really simple to process+delete chunks of log, feeding them
> into Brisk or Graylog, then deleting from the node (perhaps after a
> delay, so last 24h is locally browseable)
> (and don't remove things that haven't been processed)
>
>
>
> Hope that made sense. Let's talk more, especially if it didn't ;)
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Logging braindump
2012-03-22 16:38 ` Colin McCabe
@ 2012-03-22 18:17 ` Mark Kampe
2012-03-22 18:29 ` Tommi Virtanen
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Mark Kampe @ 2012-03-22 18:17 UTC (permalink / raw)
To: Colin McCabe; +Cc: Tommi Virtanen, ceph-devel
On 03/22/12 09:38, Colin McCabe wrote:
> On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen
> <tommi.virtanen@dreamhost.com> wrote:
>> [mmap'ed buffer discussion]
>
> I always thought mmap'ed circular buffers were an elegant approach for
> getting data that survived a process crash, but not paying the
> overhead of write(2) and read(2). The main problem is that you need
> special tools to read the circular buffer files off of the disk. As
> Sage commented, that is probably undesirable for many users.
(a) I actually favor not simply mmaping the circular buffer,
but having a program that pulls the data out of memory
and writes it to disk (ala Varnish). In addition to doing
huge writes (greatly reducing the write overhead), it can
filter what it processes, so that we have extensive logging
for the last few seconds, and more manageable logs on disk
extending farther back in time (modulo log rotation).
(b) The most interesting logs are probably the ones in coredumps
(that didn't make it out to disk) for which we want a
crawler/extractor anyway. It probably isn't very hard to
make the program that extracts logs from memory also be
able to pick the pockets of dead bodies (put a big self
identifying header on the front of each buffer).
Note also that having the ability to extract the logs from
a coredump pretty much eliminates any motivations to flush
log entries out to disk promptly/expensively. If the process
exits clealy, we'll get the logs. If the process produces
a coredump, we'll still get the logs.
(c) I have always loved text logs that I can directly view.
Their immediate and effortless accessibility encourages
their use, which encourages work in optimizing their content
(lots of the stuff you need, and little else).
But binary logs are less than half the size (cheaper to
take and keep twice as much info), and a program that
formats them can take arguments about which records/fields
you want and how you want them formatted ... and getting
the output the way you want it (whether for browsing or
subsequent reprocessing) is a huge win. You get used to
running the log processing command quickly, but the benefits
(d) If somebody really wants text logs for archival, it is completely
trivial to run the output of the log-extractor through the
formatter before writing it to disk ... so the in memory
format need not be tied to the on-disk format. The rotation
code won't care.
> An mmap'ed buffer, even a lockless one, is a simple beast. Do you
> really need a whole library just for that? Maybe I'm just
> old-fashioned.
IMHO, surprisingly few things involving large numbers of performance
critical threads turn out to be simple :-) For example:
If we are logging a lot, buffer management has the potential
to become a bottle-neck ... so we need to be able to allocate
a record of the required size from the circular buffer
with atomic instructions (at least in non-wrap situations).
But if records are allocated and then filled, we have to
consider how to handle the case where the filling is
delayed, and the reader catches up with an incomplete
log record (e.g. skip it, wait how long, ???).
And while we hope this will never happen, we have to deal
with what happens when the writer catches up with the
reader, or worse, an incomplete log block ... where we might
have to determine whether or not the owner is deceased (making
it safe to break his record lock) ... or should we simply take
down the service at that point (on the assumption that something
has gone very wrong).
If we are going to use multiple buffers, we may have to
do a transaction dance (last guy in has to close this
buffer to new writes, start a new one, and somebody has
to wait for pending additions to complete, queue this
one for delivery or perhaps even flush it to disk if we
don't have some other thread/process doing this).
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: Logging braindump
2012-03-22 18:17 ` Mark Kampe
@ 2012-03-22 18:29 ` Tommi Virtanen
2012-03-22 20:39 ` Sage Weil
2012-03-22 20:32 ` Sage Weil
2012-03-22 22:32 ` Colin McCabe
2 siblings, 1 reply; 9+ messages in thread
From: Tommi Virtanen @ 2012-03-22 18:29 UTC (permalink / raw)
To: Mark Kampe; +Cc: Colin McCabe, ceph-devel
On Thu, Mar 22, 2012 at 11:17, Mark Kampe <Mark.Kampe@dreamhost.com> wrote:
> (a) I actually favor not simply mmaping the circular buffer,
> but having a program that pulls the data out of memory
> and writes it to disk (ala Varnish). In addition to doing
> huge writes (greatly reducing the write overhead), it can
> filter what it processes, so that we have extensive logging
> for the last few seconds, and more manageable logs on disk
> extending farther back in time (modulo log rotation).
varnishlog communicates via a mmap'ed ringbuffer. These are not in conflict.
> (b) The most interesting logs are probably the ones in coredumps
> (that didn't make it out to disk) for which we want a
> crawler/extractor anyway. It probably isn't very hard to
> make the program that extracts logs from memory also be
> able to pick the pockets of dead bodies (put a big self
> identifying header on the front of each buffer).
A single memory area for this makes it easier.. that's why I like the
ringbuffer.
> If we are logging a lot, buffer management has the potential
> to become a bottle-neck ... so we need to be able to allocate
> a record of the required size from the circular buffer
> with atomic instructions (at least in non-wrap situations).
>
> But if records are allocated and then filled, we have to
> consider how to handle the case where the filling is
> delayed, and the reader catches up with an incomplete
> log record (e.g. skip it, wait how long, ???).
>
> And while we hope this will never happen, we have to deal
> with what happens when the writer catches up with the
> reader, or worse, an incomplete log block ... where we might
> have to determine whether or not the owner is deceased (making
> it safe to break his record lock) ... or should we simply take
> down the service at that point (on the assumption that something
> has gone very wrong).
The Disruptor design handles all these, is simple in the sense of
"that's what I would have built", and looks good.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Logging braindump
2012-03-22 18:29 ` Tommi Virtanen
@ 2012-03-22 20:39 ` Sage Weil
0 siblings, 0 replies; 9+ messages in thread
From: Sage Weil @ 2012-03-22 20:39 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Mark Kampe, Colin McCabe, ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2548 bytes --]
On Thu, 22 Mar 2012, Tommi Virtanen wrote:
> > If we are logging a lot, buffer management has the potential
> > to become a bottle-neck ... so we need to be able to allocate
> > a record of the required size from the circular buffer
> > with atomic instructions (at least in non-wrap situations).
> >
> > But if records are allocated and then filled, we have to
> > consider how to handle the case where the filling is
> > delayed, and the reader catches up with an incomplete
> > log record (e.g. skip it, wait how long, ???).
> >
> > And while we hope this will never happen, we have to deal
> > with what happens when the writer catches up with the
> > reader, or worse, an incomplete log block ... where we might
> > have to determine whether or not the owner is deceased (making
> > it safe to break his record lock) ... or should we simply take
> > down the service at that point (on the assumption that something
> > has gone very wrong).
>
> The Disruptor design handles all these, is simple in the sense of
> "that's what I would have built", and looks good.
My only problem with the disrupter stuff was that, as i was reading it, it
was very much like "yeah, given the limitations of Java, that's what you
would do," but we're in a slightly different boat. i.e., they use a
ringbuffer of pointers to preallocated objects.
My guess is that the best bet would be preallocated Entry objects (either
in a flat buffer or on the heap) with a preallocated per-entry buffer
(say, 80 chars) that will spill over into something slow/simple when
necessary. And something disrupter-like to claim slots in the ringbuffer.
But in any case, I think the key is first measuring how much time we spend
- rendering the current entries
- queueing each entry
under varying levels of concurrency. With the current code, for instance,
I think most time is spent convering crap into strings and waiting for a
blocking flush. We aren't logging millions of items, only hundreds.
--
Ignoring the nitty gitty of the log queueing, though... does the basic
framework make sense? That is,
- a set of predefined subsystems, each with their own log levels
- a level to control which entries are gathered/rendered, with a fast
contitional check (in, say, dout macro)
- a level to control which entries are logged d
- a dump on crash (or other event) of everything we have
?
sage
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Logging braindump
2012-03-22 18:17 ` Mark Kampe
2012-03-22 18:29 ` Tommi Virtanen
@ 2012-03-22 20:32 ` Sage Weil
2012-03-27 17:48 ` Sage Weil
2012-03-22 22:32 ` Colin McCabe
2 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2012-03-22 20:32 UTC (permalink / raw)
To: Mark Kampe; +Cc: Colin McCabe, Tommi Virtanen, ceph-devel
On Thu, 22 Mar 2012, Mark Kampe wrote:
> On 03/22/12 09:38, Colin McCabe wrote:
> > On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen
> > <tommi.virtanen@dreamhost.com> wrote:
> > > [mmap'ed buffer discussion]
> >
> > I always thought mmap'ed circular buffers were an elegant approach for
> > getting data that survived a process crash, but not paying the
> > overhead of write(2) and read(2). The main problem is that you need
> > special tools to read the circular buffer files off of the disk. As
> > Sage commented, that is probably undesirable for many users.
>
> (a) I actually favor not simply mmaping the circular buffer,
> but having a program that pulls the data out of memory
> and writes it to disk (ala Varnish). In addition to doing
> huge writes (greatly reducing the write overhead), it can
> filter what it processes, so that we have extensive logging
> for the last few seconds, and more manageable logs on disk
> extending farther back in time (modulo log rotation).
The code in wip-log does this (dumps high level messages unless/until
there is a reason to dump it all). Well, more importantly, it sets up the
infrastructure to control those logging levels (the actually logging code
is very short and simple). I don't see the value in putting this in a
separate process, though (as opposed to a separate thread); that just
seems like it would complicate ops...
> (b) The most interesting logs are probably the ones in coredumps
> (that didn't make it out to disk) for which we want a
> crawler/extractor anyway. It probably isn't very hard to
> make the program that extracts logs from memory also be
> able to pick the pockets of dead bodies (put a big self
> identifying header on the front of each buffer).
>
> Note also that having the ability to extract the logs from
> a coredump pretty much eliminates any motivations to flush
> log entries out to disk promptly/expensively. If the process
> exits clealy, we'll get the logs. If the process produces
> a coredump, we'll still get the logs.
This mostly seems like an argument to keep the in-memory layout simple, to
keep the post-mortem code simple. But at worst it's a
complexity/performance tradeoff... and, if we're lucky, there's no
difficult choice to make. Honestly, I don't think the overhead is where
we think it is. But need to do some real testing to see.
> (c) I have always loved text logs that I can directly view.
> Their immediate and effortless accessibility encourages
> their use, which encourages work in optimizing their content
> (lots of the stuff you need, and little else).
>
> But binary logs are less than half the size (cheaper to
> take and keep twice as much info), and a program that
> formats them can take arguments about which records/fields
> you want and how you want them formatted ... and getting
> the output the way you want it (whether for browsing or
> subsequent reprocessing) is a huge win. You get used to
> running the log processing command quickly, but the benefits
>
> (d) If somebody really wants text logs for archival, it is completely
> trivial to run the output of the log-extractor through the
> formatter before writing it to disk ... so the in memory
> format need not be tied to the on-disk format. The rotation
> code won't care.
Yeah. My hope is that we can find some way that makes structure
logging/annotation easy that is an easy transition from the current stuff.
We'll see. It may just be dout(n) << string vs dlog(n) << something
structured.. but we'll see.
> > An mmap'ed buffer, even a lockless one, is a simple beast. Do you
> > really need a whole library just for that? Maybe I'm just
> > old-fashioned.
>
> IMHO, surprisingly few things involving large numbers of performance
> critical threads turn out to be simple :-) For example:
>
> If we are logging a lot, buffer management has the potential
> to become a bottle-neck ... so we need to be able to allocate
> a record of the required size from the circular buffer
> with atomic instructions (at least in non-wrap situations).
>
> But if records are allocated and then filled, we have to
> consider how to handle the case where the filling is
> delayed, and the reader catches up with an incomplete
> log record (e.g. skip it, wait how long, ???).
>
> And while we hope this will never happen, we have to deal
> with what happens when the writer catches up with the
> reader, or worse, an incomplete log block ... where we might
> have to determine whether or not the owner is deceased (making
> it safe to break his record lock) ... or should we simply take
> down the service at that point (on the assumption that something
> has gone very wrong).
>
> If we are going to use multiple buffers, we may have to
> do a transaction dance (last guy in has to close this
> buffer to new writes, start a new one, and somebody has
> to wait for pending additions to complete, queue this
> one for delivery or perhaps even flush it to disk if we
> don't have some other thread/process doing this).
Honestly I don't think these will be too hard to solve, since we don't
care about how quickly logged data is written out... but we'll see! With
disruptor, for instance, latency is king, but in our case it's only
producer latency that matters, not consumer.
sage
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Logging braindump
2012-03-22 20:32 ` Sage Weil
@ 2012-03-27 17:48 ` Sage Weil
0 siblings, 0 replies; 9+ messages in thread
From: Sage Weil @ 2012-03-27 17:48 UTC (permalink / raw)
To: ceph-devel
I did some rough benchmarking (N entries over M threads) on the current
code vs wip-log2 (my naive implementation), which
- allocates an Entry struct on the heap, with 80 chars preallocated
- you write into a streambuf into that, overflowing into a std::string on
the heap
- takes a mutex and adds it to a linked list
with an async thread to write it out.
For 1M lines, on a 16-core system, -O2,
threads old code new (total) new (queue) no queue
1 4.7 ~7.5 1.4 1.3
10 8.0 ~8 1.54 1.0
16 7.8 ~8 1.62 1.1
100 5.1 ~8 1.75 1.34
A few notes:
- The new (queue) is just the time to queue 1M events, and the total
includes a call to log.flush(). Normally that would happen in another
thread, of course, so it's only interesting as a rough measure of
"total work".
- Commenting out the mutex on submit_entry and one thread shaves off .1
seconds or so... about the same as the no 'queue case'.
- Preallocating entries instead of calling new only saves about .1
seconds.
- Using PreallocatedStreambuf instead of a ostringstream saves about .4
seconds in the single thread case (if we fit).
- Overflowing the buffer takes about .1 seconds longer.. still better
than the ostringstream. It depends a bit on _how_ much we overflow,
though.. it does powers of two over the original prealloc size, so we
may have to realloc multiple times. This is probably what
ostringstream is doing anyway.
My thoughts:
- I don't think we care too much about the concurrency issue (yet).
- I think we only care about queuing time, since that is what slows down
running code. Having an async thread write to disk will incur some
background load, but in reality we won't be writing 1M records in 10
seconds.
- That is especially true since we will probably collect lots of log data
that we never write.
- Most of the work so far adds infrastructure to control dual log levels
and to clean up the logging interface, making it easy to implement
different functionality. I think it's worth merging as-is.
- We'll want to tune the amount of space preallocated in each Entry.
Missing pieces:
- Need to throttle when writeout gets behind to cap total memory usage.
Pushing to wip-log branch for review.
sage
On Thu, 22 Mar 2012, Sage Weil wrote:
> On Thu, 22 Mar 2012, Mark Kampe wrote:
> > On 03/22/12 09:38, Colin McCabe wrote:
> > > On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen
> > > <tommi.virtanen@dreamhost.com> wrote:
> > > > [mmap'ed buffer discussion]
> > >
> > > I always thought mmap'ed circular buffers were an elegant approach for
> > > getting data that survived a process crash, but not paying the
> > > overhead of write(2) and read(2). The main problem is that you need
> > > special tools to read the circular buffer files off of the disk. As
> > > Sage commented, that is probably undesirable for many users.
> >
> > (a) I actually favor not simply mmaping the circular buffer,
> > but having a program that pulls the data out of memory
> > and writes it to disk (ala Varnish). In addition to doing
> > huge writes (greatly reducing the write overhead), it can
> > filter what it processes, so that we have extensive logging
> > for the last few seconds, and more manageable logs on disk
> > extending farther back in time (modulo log rotation).
>
> The code in wip-log does this (dumps high level messages unless/until
> there is a reason to dump it all). Well, more importantly, it sets up the
> infrastructure to control those logging levels (the actually logging code
> is very short and simple). I don't see the value in putting this in a
> separate process, though (as opposed to a separate thread); that just
> seems like it would complicate ops...
>
> > (b) The most interesting logs are probably the ones in coredumps
> > (that didn't make it out to disk) for which we want a
> > crawler/extractor anyway. It probably isn't very hard to
> > make the program that extracts logs from memory also be
> > able to pick the pockets of dead bodies (put a big self
> > identifying header on the front of each buffer).
> >
> > Note also that having the ability to extract the logs from
> > a coredump pretty much eliminates any motivations to flush
> > log entries out to disk promptly/expensively. If the process
> > exits clealy, we'll get the logs. If the process produces
> > a coredump, we'll still get the logs.
>
> This mostly seems like an argument to keep the in-memory layout simple, to
> keep the post-mortem code simple. But at worst it's a
> complexity/performance tradeoff... and, if we're lucky, there's no
> difficult choice to make. Honestly, I don't think the overhead is where
> we think it is. But need to do some real testing to see.
>
> > (c) I have always loved text logs that I can directly view.
> > Their immediate and effortless accessibility encourages
> > their use, which encourages work in optimizing their content
> > (lots of the stuff you need, and little else).
> >
> > But binary logs are less than half the size (cheaper to
> > take and keep twice as much info), and a program that
> > formats them can take arguments about which records/fields
> > you want and how you want them formatted ... and getting
> > the output the way you want it (whether for browsing or
> > subsequent reprocessing) is a huge win. You get used to
> > running the log processing command quickly, but the benefits
> >
> > (d) If somebody really wants text logs for archival, it is completely
> > trivial to run the output of the log-extractor through the
> > formatter before writing it to disk ... so the in memory
> > format need not be tied to the on-disk format. The rotation
> > code won't care.
>
> Yeah. My hope is that we can find some way that makes structure
> logging/annotation easy that is an easy transition from the current stuff.
> We'll see. It may just be dout(n) << string vs dlog(n) << something
> structured.. but we'll see.
>
> > > An mmap'ed buffer, even a lockless one, is a simple beast. Do you
> > > really need a whole library just for that? Maybe I'm just
> > > old-fashioned.
> >
> > IMHO, surprisingly few things involving large numbers of performance
> > critical threads turn out to be simple :-) For example:
> >
> > If we are logging a lot, buffer management has the potential
> > to become a bottle-neck ... so we need to be able to allocate
> > a record of the required size from the circular buffer
> > with atomic instructions (at least in non-wrap situations).
> >
> > But if records are allocated and then filled, we have to
> > consider how to handle the case where the filling is
> > delayed, and the reader catches up with an incomplete
> > log record (e.g. skip it, wait how long, ???).
> >
> > And while we hope this will never happen, we have to deal
> > with what happens when the writer catches up with the
> > reader, or worse, an incomplete log block ... where we might
> > have to determine whether or not the owner is deceased (making
> > it safe to break his record lock) ... or should we simply take
> > down the service at that point (on the assumption that something
> > has gone very wrong).
> >
> > If we are going to use multiple buffers, we may have to
> > do a transaction dance (last guy in has to close this
> > buffer to new writes, start a new one, and somebody has
> > to wait for pending additions to complete, queue this
> > one for delivery or perhaps even flush it to disk if we
> > don't have some other thread/process doing this).
>
> Honestly I don't think these will be too hard to solve, since we don't
> care about how quickly logged data is written out... but we'll see! With
> disruptor, for instance, latency is king, but in our case it's only
> producer latency that matters, not consumer.
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Logging braindump
2012-03-22 18:17 ` Mark Kampe
2012-03-22 18:29 ` Tommi Virtanen
2012-03-22 20:32 ` Sage Weil
@ 2012-03-22 22:32 ` Colin McCabe
2 siblings, 0 replies; 9+ messages in thread
From: Colin McCabe @ 2012-03-22 22:32 UTC (permalink / raw)
To: Mark Kampe; +Cc: Tommi Virtanen, ceph-devel
On Thu, Mar 22, 2012 at 11:17 AM, Mark Kampe <Mark.Kampe@dreamhost.com> wrote:
> On 03/22/12 09:38, Colin McCabe wrote:
>> An mmap'ed buffer, even a lockless one, is a simple beast. Do you
>> really need a whole library just for that? Maybe I'm just
>> old-fashioned.
>
>
> IMHO, surprisingly few things involving large numbers of performance
> critical threads turn out to be simple :-) For example:
>
> If we are logging a lot, buffer management has the potential
> to become a bottle-neck ... so we need to be able to allocate
> a record of the required size from the circular buffer
> with atomic instructions (at least in non-wrap situations).
>
> But if records are allocated and then filled, we have to
> consider how to handle the case where the filling is
> delayed, and the reader catches up with an incomplete
> log record (e.g. skip it, wait how long, ???).
The easiest solution is to store pointers in the circular buffer
rather than directly writing to it. Another solution would be epoch
numbers that increment each time the log buffer wraps around.
C.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-03-27 17:48 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-19 20:53 Logging braindump Tommi Virtanen
2012-03-22 11:26 ` Sage Weil
2012-03-22 16:38 ` Colin McCabe
2012-03-22 18:17 ` Mark Kampe
2012-03-22 18:29 ` Tommi Virtanen
2012-03-22 20:39 ` Sage Weil
2012-03-22 20:32 ` Sage Weil
2012-03-27 17:48 ` Sage Weil
2012-03-22 22:32 ` Colin McCabe
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.