From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3733AC43334 for ; Mon, 11 Jul 2022 22:26:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1657578368; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post; bh=VFbkojgDPWYrmesJTttXMf5AlHi/hDF8vtKPPWQBltk=; b=fD8GEkVRLQpVA8Wwgq+aivr05wxN72oU0TgJyg3YzNTkKyfzlw7H7Th6Ia8PNQfIMl9AWP NOxL7JLbMy1Yxrtxr6ws2vsU3yDWVA+31Gpe+s1HZjiRvY5wvPj1hXEvDVeCXmFs/Yl8nr MUFotsQeztS0fxmHCAk9l+m2xCcUdV8= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-344-Xa2F0pjpPu-DNIYQkmqbwQ-1; Mon, 11 Jul 2022 18:26:04 -0400 X-MC-Unique: Xa2F0pjpPu-DNIYQkmqbwQ-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 93FAF3C02B61; Mon, 11 Jul 2022 22:26:03 +0000 (UTC) Received: from mm-prod-listman-01.mail-001.prod.us-east-1.aws.redhat.com (unknown [10.30.29.100]) by smtp.corp.redhat.com (Postfix) with ESMTP id 1852740B40C8; Mon, 11 Jul 2022 22:26:02 +0000 (UTC) Received: from mm-prod-listman-01.mail-001.prod.us-east-1.aws.redhat.com (localhost [IPv6:::1]) by mm-prod-listman-01.mail-001.prod.us-east-1.aws.redhat.com (Postfix) with ESMTP id C204F1947060; Mon, 11 Jul 2022 22:26:01 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) by mm-prod-listman-01.mail-001.prod.us-east-1.aws.redhat.com (Postfix) with ESMTP id B9C88194705E for ; Mon, 11 Jul 2022 22:26:00 +0000 (UTC) Received: by smtp.corp.redhat.com (Postfix) id A93E140CF8EA; Mon, 11 Jul 2022 22:26:00 +0000 (UTC) Received: from x2.localnet (unknown [10.22.17.85]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 6A431400DFDB; Mon, 11 Jul 2022 22:26:00 +0000 (UTC) From: Steve Grubb To: linux-audit@redhat.com Subject: Re: Trying to understand audisp-remote network behavior Date: Mon, 11 Jul 2022 18:25:59 -0400 Message-ID: <13478993.uLZWGnKmhe@x2> Organization: Red Hat In-Reply-To: <20220707040529.DFABD138817@pb-smtp1.pobox.com> References: <20220707040529.DFABD138817@pb-smtp1.pobox.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.11.54.1 X-BeenThere: linux-audit@redhat.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux Audit Discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-audit-bounces@redhat.com Sender: "Linux-audit" X-Scanned-By: MIMEDefang 2.84 on 10.11.54.1 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=linux-audit-bounces@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hello, On Thursday, July 7, 2022 12:05:28 AM EDT Ken Hornstein wrote: > So we've been struggling with getting audisp-remote working in a > reliable manner. In summary, it works but the networking seems fragile. > We are using Kerberos authentication with audisp-remote, but that > doesn't seem to be related to the fragility (sadly the Kerberos support > does make it trivial to completely hang the server, but that's another > issue). 2 Weeks ago I wrote a model to go looking for certain kinds of problems in kerberos. The results were that it's probably leaking memory. And on the client side, I don't think it was fully resetting all the kerberos variables on failure - which may be contributing to the problems. > This is on RHEL 7 which ships with audit-2.8.5, but as far as > I can tell the relevant code hasn't changed much from there to what > is on GitHub. There are differences. I'd trust the current code in github more than the old code. > After staring at the code a lot and doing some experiments, here's what > I believe to be true. I'll gladly take corrections for anything I get > wrong. > > - If a connection has _never_ been made successfully by audisp-remote, > it will retry the connection (in theory there's a limit to retries, > but that seems to be per-message; it will retry on every new message). > Fine, that seems reasonable. In github, the first connection should do unlimited retries. > - If the connection is lost for almost any reason (see below), the > connection is never retried using the default configuration. There might > be some corner cases where a retry can happen, but in my experience that > is rare. Once it's gone, it never gets retried, and audit messages build > up until the queue overflows. The behavior for what to do became a configuration item around 3.0. > - In theory if a graceful shutdown is received by audisp-remote (either > a zero-length read or a "ENDING" audit message), then retries can > happen; this is indicated by the "remote_ended" flag in the code. This would happen if, for example, the aggregating server needed to reboot. > But > in my experience that is rare; during my experiments when I rebooted > our audit server that message was never sent (I guess the audit server > stop was received after the interfaces were shut down). If the audit > server crashes or you have a network failure, you end up getting an > error on a write and then the network is marked down and you get into > never-retry state. > > - If you turn on heartbeats via heartbeat_timeout, the network connection > _will_ retry when a heartbeat is sent. However, the subtle issue here > is that a heartbeat is only sent when there are no incoming audit > messages within the heartbeat timeout. It is advisable to use the heartbeat option. This way each end can detect the other "disappeared" for some reason. > The key issue seems to be in this part of the loop in main() (this section > is entered when audisp-remote receives an audit record): > > // See if input fd is also set > if (FD_ISSET(ifd, &rfd)) { > do { > if (remote_fgets(event, sizeof(event), > ifd)) { if (!transport_ok && remote_ended && (config.remote_ending_action > == FA_RECONNECT || !connected_once)) { quiet = 1; > if (init_transport() == > ET_SUCCESS) > { remote_ended = 0; connected_once = 1; } > quiet = 0; > } > > In short, when a new audit record is received, init_transport() > (which tries to connect to the audit server) is only called _IF_ the > connection is down (transport_ok == 0) _and_ remote_ended is true _and_ > remote_ending_action is set to FA_RECONNECT (the default) _or_ there > hasn't been at least one successful connection (connected_once == 0). > > The problem with that is at least in our environment remote_ended is > never set to 1, so when the connection drops it is never retried, and > there aren't any other entry points in the normal event loop that would > ever cause the connection to retry. I want to think this has been fixed in the current code. It is one of the subtle changes since 2.8.5. > The heartbeat code calls relay_event() directly (code that sends audit > events normally calls send_one() which returns if transport_ok is false) > and relay_event() calls either relay_sock_ascii() or relay_sock_managed() > and those two functions will call init_transport() if the network > connection is down. But as mentioned above, you need to make sure that > you try to send a heartbeat every so often; if you have a server generating > audit messages constantly then there won't be a heartbeat if you set the > heartbeat timeout too high. > > You _can_ get a network connection retry if you encounter an error > inside of relay_sock_ascii() or relay_sock_managed(); I can't say > that didn't happen with us, but it sure seemed like it wasn't sufficient > and having the transport marked as failed was inevitible. > > So, I guess my questions are: > > - Is this all accurate? It's been a long time since I did anything with 2.8.5. I'll take your word for it. > - Is this how it's SUPPOSED to be? At least for us, network glitches > happen enough that most of our hosts ended up with overflowing > audisp-remote queues. Setting the heartbeat timeout seems to have > resolved that (but it took a little experimentation to figure out > the right value). It just seems surprising that it was easy to get > into a situation where you'd never retry a connection. I know there are people on this list that are using it reliably in production. But, the problems were worked out mostly in the 3.0 release. The kerberos code is donated code. I have not personally tested it myself due to the problems in setting up the infrastructure. But from my review 2 weeks ago, it looks like it would have problems in any error situation. I committed some updates today which should make krb5 support better. The non-kerberos code has been heavily tested. You might try that to see if it works better. But if you are on the old code, there were problems fixed in the 3.0 release. I think people using it are not using the krb5 code and create a vpn or ssh tunnel for encryption. Best Regards, -Steve -- Linux-audit mailing list Linux-audit@redhat.com https://listman.redhat.com/mailman/listinfo/linux-audit