From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Dennis Subject: Kernel audit output is inconsistent, hard to parse Date: Tue, 29 Jan 2008 17:56:36 -0500 Message-ID: <479FAF24.8050109@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from [172.16.80.36] (finch.boston.redhat.com [172.16.80.36]) by mail.boston.redhat.com (8.13.1/8.13.1) with ESMTP id m0TMuaVP003434 for ; Tue, 29 Jan 2008 17:56:37 -0500 List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: linux-audit-bounces@redhat.com Errors-To: linux-audit-bounces@redhat.com To: linux-audit@redhat.com List-Id: linux-audit@redhat.com The format of audit messages from the kernel is a mess. The bottom line is one cannot parse the audit messages without special case knowledge of each audit message because the data formatting does not follow any regular rules. I don't know how it got this way, but it really needs to be fixed. The primary offense is string formatting, specifically the use or non-use of the functions audit_log_hex() audit_log_untrustedstring(), and audit_log_n_untrustedstring(); depending on circumstances. The net result is a field value might be one of the following cases: 1) a string without quotes (maybe a string, maybe an int, etc.) 2) a string enclosed in quotes (implies a string with no escaped chars) 3) a string which is represented as a sequence of hex values (not enclosed in quotes, but how do you distinguish this from case 1?) Given the name=value formatting it is absolutely impossible to correctly interpret the value component unless you know how audit_log_format was invoked to generate the name=value pair. This will be dependent on the kernel version, the field name, and the audit record type. To be specific, during parsing only case 2 is unambiguous. You cannot determine between case 1 and case 3. Heuristics based on each character being in the hexadecimal character set fail for a significant subset of data, thus you don't know if the value is a string encoded in hexadecimal which needs to be decoded or a string which happens to be composed of hexadecimal characters but is not encoded. Thus we have the situation where to correctly parse the name=value pair one must know the audit record type, the field name, and the kernel version. That is just plain CRAZY and UNNECESSARY. Trying hide this logic in auparse is just a band-aid over the problem compounded by the fact auparse does not always get it right either. This is in conjunction with the fact auparse has no way to know the kernel version of the audit data it is attempting to parse (nor should it even have tables based on kernel version). The answer is to make the output parsable without special case knowledge. It would appear many of these problems were introduced with the functions audit_log_hex() audit_log_untrustedstring(), and audit_log_n_untrustedstring() which attempt to correct for a double quote, white space, or non-printable character in the output string. However these are not used uniformly nor do they follow any common approach for string representations in user land (why not?). All field values without exception need to be enclosed in quotes to delimit the value. Special characters inside the quotes need to be escaped, following some standard convention. Please, lets not invent a new encoding, this problem has already been solved elsewhere many times before! Also note the function audit_log_n_untrustedstring() in audit.c has a bug and ignores the len parameter (it iterates till it finds a NULL terminator even though it's supposed to stop after n chars). Suggested Fix: -------------- Most of these problems can easily be fixed if there is exactly one central place to format an audit field value. The function audit_log_vformat() could very easily ensure consistent formatting via % format specifiers in the format string, e.g.: audit_log_format("n=%d path=%s", n, path); Building audit output piecewise would be deprecated, e.g. these types of sequences would be eliminated. audit_log_format(ab, " n=%d", n); audit_log_format(ab, " name="); audit_log_foo(); and replaced with: audit_log_format(ab, " n=%d name=%s", foo_to_string(foo)); Whenever audit_log_vformat() encounters a % format specifier it formats to a string, then it converts the string to an escaped quoted string, and then inserts the escaped quoted string into the buffer (e.g. n="123" name="foo bar\n" ) This way the formatting is consistent, easy to apply, and is never special cased by the caller. There are no performance penalties of any note, calling a routine to escape only needs to be done when the format specifier is %s. Currently this is already done for a subset of output strings, so all we're doing is removing the responsibility for escaping from the caller and doing it consistently instead of in a subset of cases. I don't really care what the encoding is. I only care that it is an encoding with wide support. Backslash quoting is very popular, familiar and has many implementations. The MIME quoted-printable transfer encoding would be another option but might pose some problems with line endings. I think backslash quoting would be a good choice. I suspect everyone reading this message already knows exactly how to interpret a string with backslash escapes. Auparse is not the answer: -------------------------- Auparse is not the answer to irregular kernel audit message formatting. First of all it forces auparse to have special case logic which is not 100% robust and is tied to the kernel source code version. Second, in it's current implementation auparse confuses transfer decoding and substitution, two entirely different concepts needing to be applied in entirely different circumstances, but which have been conflated. auparse_get_field_str() returns the field value in it's encoded form, this is almost never of value to the caller. The caller wants the field value to be unencoded so it can operate on it. If you want the field value to be unencoded you have to call auparse_interpret_field(). But auparse_interpret_field() performs two distinctly different operations, it both decodes AND performs contextual substitution. Contextual substitution only has meaning when applied on the same host and at approximately the same time as when the audit record was generated. Contextual substitution is mainly of value for human readable output, it is difficult to utilize with automated machine processing. At the moment it is not possible to get a decoded value from auparse without it also performing undesired substitution. While we're at it: ------------------ If we do fix the format of audit messages we might as well fix some other inconsistencies at the same time. 1) The initial part of AVC messages do not follow the standard name=value formatting used everywhere else in audit. a) It includes the string "avc:" which is redundant with the audit record type (e.g. type=AVC), the string "avc:" should be removed, it serves no purpose and only makes parsing much harder because of the inconsistency. b) denied|granted are bare words without a field name, it should be seresult="denied", once again to avoid special case parsing. c) The list of operations are enclosed in curly braces {} without a field name, this should be seperms=xxx, where xxx is a list. The use of curly braces to encode a list in audit data is unique. We should define how any audit message should encode a list of values and use that consistently for all audit data. While one could define a syntax such as "[value1, value2]" or some such, it might be informative to look at how other transfer mechanisms such as structured markup and ldap handle this case. They both utilize the concept of multi-valued attributes. Thus there is no list structure, but an attribute is allowed to repeat itself and in the process implicitly creates a list of values for the attribute. Thus {read write} might be represented as seperms="read" seperms="write". This regularity makes parsing much easier, it avoids special case syntax. 2) (Note, this is not a kernel issue) The host data is currently prepended to the audit record with the format host=xxx. Is this an encoded string or not? It should be encoded and it should be encoded in exactly the same format as the name/value pairs in the audit records. The same holds true for the record type, it should follow the same syntax as every other name/value pair. 3) The string "audit(ssssss.mmmm:iiii):" is a critical delimiter, it separates record properties (e.g. host, type, timestamp) from record data, which must be a sequence of name="value" pairs. But the time stamp should really follow the name/value pair encoding used elsewhere. Desired syntax: --------------- Records consist of a sequence of name="value" pairs. Ordering of name/value pairs is significant for multi-valued attributes (i.e. where name appears more than once), insignificant otherwise. The value MUST be enclosed in double quotes with interior characters properly escaped. White space between name, '=', and "value" is insignificant and ignored. The audit record is partitioned into two parts a) record properties (i.e. host, record type, timestamp) b) record data The partition of properties and data occurs at a colon delimiter, i.e. properties : data The current formatting of the record timestamp (e.g. audit(ssss.mmm:iii) is inconsistent with all other name/value pairs. It should be "seconds="sss" milliseconds="mmm" serial="iii", this allows parsing to be regular and consistent. Thus an audit record with consistent syntax would look like this, where brackets [] indicate optional components: [host=""] type="" seconds="" milliseconds="" serial="" : name="" [name=""] What has to change and what's optional: --------------------------------------- The formatting of name/value pairs in the kernel must be fixed, it is simply impossible to correctly parse in it's current state. The rest of the suggested changes are syntactic sugar which would make parsing easier because of regular syntax, but they are not critical. We could retain the existing formats if backwards compatibility is felt to trump syntactic cleanliness and ease in parsing. It's a judgment call over when and how to introduce change and the anticipated impact. -- John Dennis