From mboxrd@z Thu Jan 1 00:00:00 1970 From: Linda Knippers Subject: Re: Kernel audit output is inconsistent, hard to parse Date: Thu, 31 Jan 2008 16:11:38 -0500 Message-ID: <47A2398A.2050309@hp.com> References: <479FAF24.8050109@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <479FAF24.8050109@redhat.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: linux-audit-bounces@redhat.com Errors-To: linux-audit-bounces@redhat.com To: John Dennis Cc: linux-audit@redhat.com List-Id: linux-audit@redhat.com At one time we talked about converting to a binary record format. Maybe some structure would be helpful in defining and enforcing rules for what information in the records can/should be and what it all means. -- ljk John Dennis wrote: > The format of audit messages from the kernel is a mess. The > bottom line is one cannot parse the audit messages without special > case knowledge of each audit message because the data formatting does > not follow any regular rules. > > I don't know how it got this way, but it really needs to be fixed. > > The primary offense is string formatting, specifically the use or > non-use of the functions audit_log_hex() audit_log_untrustedstring(), > and audit_log_n_untrustedstring(); depending on circumstances. > > The net result is a field value might be one of the following cases: > > 1) a string without quotes (maybe a string, maybe an int, etc.) > > 2) a string enclosed in quotes (implies a string with no escaped chars) > > 3) a string which is represented as a sequence of hex values (not > enclosed in quotes, but how do you distinguish this from case 1?) > > Given the name=value formatting it is absolutely impossible to > correctly interpret the value component unless you know how > audit_log_format was invoked to generate the name=value pair. This > will be dependent on the kernel version, the field name, and the audit > record type. To be specific, during parsing only case 2 is > unambiguous. You cannot determine between case 1 and case > 3. Heuristics based on each character being in the hexadecimal > character set fail for a significant subset of data, thus you don't > know if the value is a string encoded in hexadecimal which needs to be > decoded or a string which happens to be composed of hexadecimal > characters but is not encoded. > > Thus we have the situation where to correctly parse the name=value > pair one must know the audit record type, the field name, and the > kernel version. That is just plain CRAZY and UNNECESSARY. Trying hide > this logic in auparse is just a band-aid over the problem compounded > by the fact auparse does not always get it right either. This is in > conjunction with the fact auparse has no way to know the kernel > version of the audit data it is attempting to parse (nor should it > even have tables based on kernel version). > > The answer is to make the output parsable without special case > knowledge. It would appear many of these problems were introduced with > the functions audit_log_hex() audit_log_untrustedstring(), and > audit_log_n_untrustedstring() which attempt to correct for a double > quote, white space, or non-printable character in the output > string. However these are not used uniformly nor do they follow any > common approach for string representations in user land (why not?). > > All field values without exception need to be enclosed in quotes to > delimit the value. Special characters inside the quotes need to > be escaped, following some standard convention. Please, lets not > invent a new encoding, this problem has already been solved > elsewhere many times before! > > Also note the function audit_log_n_untrustedstring() in audit.c has a > bug and ignores the len parameter (it iterates till it finds a NULL > terminator even though it's supposed to stop after n chars). > > Suggested Fix: > -------------- > > Most of these problems can easily be fixed if there is exactly one > central place to format an audit field value. The function > audit_log_vformat() could very easily ensure consistent formatting via > % format specifiers in the format string, e.g.: > > audit_log_format("n=%d path=%s", n, path); > > Building audit output piecewise would be deprecated, e.g. these types > of sequences would be eliminated. > > audit_log_format(ab, " n=%d", n); > audit_log_format(ab, " name="); > audit_log_foo(); > > and replaced with: > > audit_log_format(ab, " n=%d name=%s", foo_to_string(foo)); > > Whenever audit_log_vformat() encounters a % format specifier it > formats to a string, then it converts the string to an escaped quoted > string, and then inserts the escaped quoted string into the buffer > (e.g. n="123" name="foo bar\n" ) > > This way the formatting is consistent, easy to apply, and is never > special cased by the caller. > > There are no performance penalties of any note, calling a routine to > escape only needs to be done when the format specifier is %s. Currently > this is already done for a subset of output strings, so all we're doing > is removing the responsibility for escaping from the caller and doing > it consistently instead of in a subset of cases. > > I don't really care what the encoding is. I only care that it is an > encoding with wide support. Backslash quoting is very popular, > familiar and has many implementations. The MIME quoted-printable > transfer encoding would be another option but might pose some problems > with line endings. I think backslash quoting would be a good choice. I > suspect everyone reading this message already knows exactly how to > interpret a string with backslash escapes. > > Auparse is not the answer: > -------------------------- > > Auparse is not the answer to irregular kernel audit message > formatting. First of all it forces auparse to have special case logic > which is not 100% robust and is tied to the kernel source code > version. > > Second, in it's current implementation auparse confuses transfer > decoding and substitution, two entirely different concepts needing to > be applied in entirely different circumstances, but which have been > conflated. > > auparse_get_field_str() returns the field value in it's encoded form, > this is almost never of value to the caller. The caller wants the > field value to be unencoded so it can operate on it. If you want the > field value to be unencoded you have to call > auparse_interpret_field(). But auparse_interpret_field() performs two > distinctly different operations, it both decodes AND performs > contextual substitution. Contextual substitution only has meaning when > applied on the same host and at approximately the same time as when > the audit record was generated. Contextual substitution is mainly of > value for human readable output, it is difficult to utilize with > automated machine processing. At the moment it is not possible to get > a decoded value from auparse without it also performing undesired > substitution. > > While we're at it: > ------------------ > > If we do fix the format of audit messages we might as well fix some > other inconsistencies at the same time. > > 1) The initial part of AVC messages do not follow the standard > name=value formatting used everywhere else in audit. > > a) It includes the string "avc:" which is redundant with the audit > record type (e.g. type=AVC), the string "avc:" should be removed, > it serves no purpose and only makes parsing much harder because of > the inconsistency. > > b) denied|granted are bare words without a field name, it should be > seresult="denied", once again to avoid special case parsing. > > c) The list of operations are enclosed in curly braces {} without a > field name, this should be seperms=xxx, where xxx is a list. The > use of curly braces to encode a list in audit data is unique. We > should define how any audit message should encode a list of values > and use that consistently for all audit data. While one could > define a syntax such as "[value1, value2]" or some such, it might > be informative to look at how other transfer mechanisms such as > structured markup and ldap handle this case. They both utilize the > concept of multi-valued attributes. Thus there is no list > structure, but an attribute is allowed to repeat itself and in the > process implicitly creates a list of values for the attribute. Thus > {read write} might be represented as seperms="read" > seperms="write". This regularity makes parsing much easier, it > avoids special case syntax. > > 2) (Note, this is not a kernel issue) The host data is currently > prepended to the audit record with the format host=xxx. Is this an > encoded string or not? It should be encoded and it should be > encoded in exactly the same format as the name/value pairs in the > audit records. The same holds true for the record type, it should > follow the same syntax as every other name/value pair. > > 3) The string "audit(ssssss.mmmm:iiii):" is a critical delimiter, it > separates record properties (e.g. host, type, timestamp) from > record data, which must be a sequence of name="value" pairs. But > the time stamp should really follow the name/value pair encoding > used elsewhere. > > Desired syntax: > --------------- > > Records consist of a sequence of name="value" pairs. > > Ordering of name/value pairs is significant for multi-valued > attributes (i.e. where name appears more than once), insignificant > otherwise. > > The value MUST be enclosed in double quotes with interior characters > properly escaped. > > White space between name, '=', and "value" is insignificant and > ignored. > > The audit record is partitioned into two parts > > a) record properties (i.e. host, record type, timestamp) > > b) record data > > The partition of properties and data occurs at a colon delimiter, > i.e. properties : data > > The current formatting of the record timestamp > (e.g. audit(ssss.mmm:iii) is inconsistent with > all other name/value pairs. It should be "seconds="sss" > milliseconds="mmm" serial="iii", this allows parsing to be regular and > consistent. > > Thus an audit record with consistent syntax would look like this, > where brackets [] indicate optional components: > > [host=""] type="" seconds="" milliseconds="" serial="" : name="" [name=""] > > What has to change and what's optional: > --------------------------------------- > > The formatting of name/value pairs in the kernel must be fixed, it is > simply impossible to correctly parse in it's current state. > > The rest of the suggested changes are syntactic sugar which would make > parsing easier because of regular syntax, but they are not > critical. We could retain the existing formats if backwards > compatibility is felt to trump syntactic cleanliness and ease in > parsing. It's a judgment call over when and how to introduce change > and the anticipated impact. > >