[PATCH] audit: add backlog high water mark metric

public inbox for audit@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] audit: add backlog high water mark metric
@ 2026-03-23 15:07 Ricardo Robaina
  2026-03-23 16:48 ` Steve Grubb
  2026-04-10 21:34 ` Paul Moore
  0 siblings, 2 replies; 10+ messages in thread
From: Ricardo Robaina @ 2026-03-23 15:07 UTC (permalink / raw)
  To: audit, linux-kernel; +Cc: paul, eparis, sgrubb, Ricardo Robaina

Currently, determining the optimal `audit_backlog_limit` relies on
instantaneous polling of the queue size. This misses transient
micro-bursts, making it difficult for system administrators to know
if their queue is adequately sized or if they are at risk of
dropping events.

This patch introduces `backlog_max_depth`, a high-water mark metric
that tracks the maximum number of buffers in the audit queue since
the system was booted or the metric was last reset. To minimize
performance overhead in the fast-path, the metric is updated using
a lockless cmpxchg loop in `__audit_log_end()`.

Userspace can read-and-clear this metric by sending an `AUDIT_SET`
message with the `AUDIT_STATUS_BACKLOG_MAX_DEPTH` mask. To support
periodic telemetry polling (e.g., statsd, Prometheus), the reset
operation atomically returns the snapshot of the high-water mark
right before zeroing it, ensuring no peaks are lost between polls.

Link: https://github.com/linux-audit/audit-kernel/issues/63
Suggested-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
---
 include/linux/audit.h      |  3 ++-
 include/uapi/linux/audit.h |  2 ++
 kernel/audit.c             | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index d79218bf075a..53132b303c20 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -22,7 +22,8 @@
 			  AUDIT_STATUS_BACKLOG_LIMIT | \
 			  AUDIT_STATUS_BACKLOG_WAIT_TIME | \
 			  AUDIT_STATUS_LOST | \
-			  AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL)
+			  AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL | \
+			  AUDIT_STATUS_BACKLOG_MAX_DEPTH)
 
 #define AUDIT_INO_UNSET ((unsigned long)-1)
 #define AUDIT_DEV_UNSET ((dev_t)-1)
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index e8f5ce677df7..862ca93c0c31 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -355,6 +355,7 @@ enum {
 #define AUDIT_STATUS_BACKLOG_WAIT_TIME		0x0020
 #define AUDIT_STATUS_LOST			0x0040
 #define AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL	0x0080
+#define AUDIT_STATUS_BACKLOG_MAX_DEPTH		0x0100
 
 #define AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT	0x00000001
 #define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME	0x00000002
@@ -486,6 +487,7 @@ struct audit_status {
 	__u32           backlog_wait_time_actual;/* time spent waiting while
 						  * message limit exceeded
 						  */
+	__u32		backlog_max_depth; /* message queue max depth */
 };
 
 struct audit_features {
diff --git a/kernel/audit.c b/kernel/audit.c
index e1d489bc2dff..256053cb6132 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -163,6 +163,9 @@ static struct sk_buff_head audit_retry_queue;
 /* queue msgs waiting for new auditd connection */
 static struct sk_buff_head audit_hold_queue;
 
+/* audit queue high water mark since last startup or reset */
+static atomic_t audit_backlog_max_depth __read_mostly = ATOMIC_INIT(0);
+
 /* queue servicing thread */
 static struct task_struct *kauditd_task;
 static DECLARE_WAIT_QUEUE_HEAD(kauditd_wait);
@@ -1286,6 +1289,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 		s.backlog		   = skb_queue_len(&audit_queue);
 		s.feature_bitmap	   = AUDIT_FEATURE_BITMAP_ALL;
 		s.backlog_wait_time	   = audit_backlog_wait_time;
+		s.backlog_max_depth	   = atomic_read(&audit_backlog_max_depth);
 		s.backlog_wait_time_actual = atomic_read(&audit_backlog_wait_time_actual);
 		audit_send_reply(skb, seq, AUDIT_GET, 0, 0, &s, sizeof(s));
 		break;
@@ -1399,6 +1403,12 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
 			audit_log_config_change("backlog_wait_time_actual", 0, actual, 1);
 			return actual;
 		}
+		if (s.mask == AUDIT_STATUS_BACKLOG_MAX_DEPTH) {
+			u32 old_depth = atomic_xchg(&audit_backlog_max_depth, 0);
+
+			audit_log_config_change("backlog_max_depth", 0, old_depth, 1);
+			return old_depth;
+		}
 		break;
 	}
 	case AUDIT_GET_FEATURE:
@@ -2761,6 +2771,25 @@ int audit_signal_info(int sig, struct task_struct *t)
 	return audit_signal_info_syscall(t);
 }
 
+/*
+ * audit_update_backlog_max_depth - update the audit queue high water mark
+ *
+ * Safely updates the audit_backlog_max_depth metric using a lockless
+ * cmpxchg loop. This ensures the high-water mark is accurately tracked
+ * even when multiple CPUs are logging audit records concurrently.
+ */
+static inline void audit_update_backlog_max_depth(void)
+{
+	u32 q_len = skb_queue_len(&audit_queue);
+	u32 q_max = atomic_read(&audit_backlog_max_depth);
+
+	while (unlikely(q_len > q_max)) {
+		if (likely(atomic_try_cmpxchg(&audit_backlog_max_depth,
+					      &q_max, q_len)))
+			break;
+	}
+}
+
 /**
  * __audit_log_end - enqueue one audit record
  * @skb: the buffer to send
@@ -2777,6 +2806,9 @@ static void __audit_log_end(struct sk_buff *skb)
 
 		/* queue the netlink packet */
 		skb_queue_tail(&audit_queue, skb);
+
+		/* update backlog high water mark */
+		audit_update_backlog_max_depth();
 	} else {
 		audit_log_lost("rate limit exceeded");
 		kfree_skb(skb);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] audit: add backlog high water mark metric
  2026-03-23 15:07 [PATCH] audit: add backlog high water mark metric Ricardo Robaina
@ 2026-03-23 16:48 ` Steve Grubb
  2026-04-10 21:34 ` Paul Moore
  1 sibling, 0 replies; 10+ messages in thread
From: Steve Grubb @ 2026-03-23 16:48 UTC (permalink / raw)
  To: audit, linux-kernel, Ricardo Robaina; +Cc: paul, eparis, Ricardo Robaina

On Monday, March 23, 2026 11:07:00 AM Eastern Daylight Time Ricardo Robaina 
wrote:
> Currently, determining the optimal `audit_backlog_limit` relies on
> instantaneous polling of the queue size. This misses transient
> micro-bursts, making it difficult for system administrators to know
> if their queue is adequately sized or if they are at risk of
> dropping events.
> 
> This patch introduces `backlog_max_depth`, a high-water mark metric
> that tracks the maximum number of buffers in the audit queue since
> the system was booted or the metric was last reset. To minimize
> performance overhead in the fast-path, the metric is updated using
> a lockless cmpxchg loop in `__audit_log_end()`.
> 
> Userspace can read-and-clear this metric by sending an `AUDIT_SET`
> message with the `AUDIT_STATUS_BACKLOG_MAX_DEPTH` mask. To support
> periodic telemetry polling (e.g., statsd, Prometheus), the reset
> operation atomically returns the snapshot of the high-water mark
> right before zeroing it, ensuring no peaks are lost between polls.

From a user space point of view, this looks good. User space support was co-
developed alongside of this patch to ensure it works as advertised.

Acked-by: Steve Grubb <sgrubb@redhat.com>

-Steve

> Link: https://github.com/linux-audit/audit-kernel/issues/63
> Suggested-by: Steve Grubb <sgrubb@redhat.com>
> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
> ---
>  include/linux/audit.h      |  3 ++-
>  include/uapi/linux/audit.h |  2 ++
>  kernel/audit.c             | 32 ++++++++++++++++++++++++++++++++
>  3 files changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index d79218bf075a..53132b303c20 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -22,7 +22,8 @@
>  			  AUDIT_STATUS_BACKLOG_LIMIT | \
>  			  AUDIT_STATUS_BACKLOG_WAIT_TIME | \
>  			  AUDIT_STATUS_LOST | \
> -			  AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL)
> +			  AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL | \
> +			  AUDIT_STATUS_BACKLOG_MAX_DEPTH)
> 
>  #define AUDIT_INO_UNSET ((unsigned long)-1)
>  #define AUDIT_DEV_UNSET ((dev_t)-1)
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index e8f5ce677df7..862ca93c0c31 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -355,6 +355,7 @@ enum {
>  #define AUDIT_STATUS_BACKLOG_WAIT_TIME		0x0020
>  #define AUDIT_STATUS_LOST			0x0040
>  #define AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL	0x0080
> +#define AUDIT_STATUS_BACKLOG_MAX_DEPTH		0x0100
> 
>  #define AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT	0x00000001
>  #define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME	0x00000002
> @@ -486,6 +487,7 @@ struct audit_status {
>  	__u32           backlog_wait_time_actual;/* time spent waiting while
>  						  * message limit exceeded
>  						  */
> +	__u32		backlog_max_depth; /* message queue max depth */
>  };
> 
>  struct audit_features {
> diff --git a/kernel/audit.c b/kernel/audit.c
> index e1d489bc2dff..256053cb6132 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -163,6 +163,9 @@ static struct sk_buff_head audit_retry_queue;
>  /* queue msgs waiting for new auditd connection */
>  static struct sk_buff_head audit_hold_queue;
> 
> +/* audit queue high water mark since last startup or reset */
> +static atomic_t audit_backlog_max_depth __read_mostly = ATOMIC_INIT(0);
> +
>  /* queue servicing thread */
>  static struct task_struct *kauditd_task;
>  static DECLARE_WAIT_QUEUE_HEAD(kauditd_wait);
> @@ -1286,6 +1289,7 @@ static int audit_receive_msg(struct sk_buff *skb,
> struct nlmsghdr *nlh, s.backlog		   = skb_queue_len(&audit_queue);
>  		s.feature_bitmap	   = AUDIT_FEATURE_BITMAP_ALL;
>  		s.backlog_wait_time	   = audit_backlog_wait_time;
> +		s.backlog_max_depth	   = atomic_read(&audit_backlog_max_depth);
>  		s.backlog_wait_time_actual =
> atomic_read(&audit_backlog_wait_time_actual); audit_send_reply(skb, seq,
> AUDIT_GET, 0, 0, &s, sizeof(s));
>  		break;
> @@ -1399,6 +1403,12 @@ static int audit_receive_msg(struct sk_buff *skb,
> struct nlmsghdr *nlh, audit_log_config_change("backlog_wait_time_actual",
> 0, actual, 1); return actual;
>  		}
> +		if (s.mask == AUDIT_STATUS_BACKLOG_MAX_DEPTH) {
> +			u32 old_depth = atomic_xchg(&audit_backlog_max_depth, 0);
> +
> +			audit_log_config_change("backlog_max_depth", 0, old_depth, 
1);
> +			return old_depth;
> +		}
>  		break;
>  	}
>  	case AUDIT_GET_FEATURE:
> @@ -2761,6 +2771,25 @@ int audit_signal_info(int sig, struct task_struct
> *t) return audit_signal_info_syscall(t);
>  }
> 
> +/*
> + * audit_update_backlog_max_depth - update the audit queue high water mark
> + *
> + * Safely updates the audit_backlog_max_depth metric using a lockless
> + * cmpxchg loop. This ensures the high-water mark is accurately tracked
> + * even when multiple CPUs are logging audit records concurrently.
> + */
> +static inline void audit_update_backlog_max_depth(void)
> +{
> +	u32 q_len = skb_queue_len(&audit_queue);
> +	u32 q_max = atomic_read(&audit_backlog_max_depth);
> +
> +	while (unlikely(q_len > q_max)) {
> +		if (likely(atomic_try_cmpxchg(&audit_backlog_max_depth,
> +					      &q_max, q_len)))
> +			break;
> +	}
> +}
> +
>  /**
>   * __audit_log_end - enqueue one audit record
>   * @skb: the buffer to send
> @@ -2777,6 +2806,9 @@ static void __audit_log_end(struct sk_buff *skb)
> 
>  		/* queue the netlink packet */
>  		skb_queue_tail(&audit_queue, skb);
> +
> +		/* update backlog high water mark */
> +		audit_update_backlog_max_depth();
>  	} else {
>  		audit_log_lost("rate limit exceeded");
>  		kfree_skb(skb);





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] audit: add backlog high water mark metric
  2026-03-23 15:07 [PATCH] audit: add backlog high water mark metric Ricardo Robaina
  2026-03-23 16:48 ` Steve Grubb
@ 2026-04-10 21:34 ` Paul Moore
  2026-04-15  3:45   ` Steve Grubb
  1 sibling, 1 reply; 10+ messages in thread
From: Paul Moore @ 2026-04-10 21:34 UTC (permalink / raw)
  To: Ricardo Robaina; +Cc: audit, linux-kernel, eparis, sgrubb

On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina <rrobaina@redhat.com> wrote:
>
> Currently, determining the optimal `audit_backlog_limit` relies on
> instantaneous polling of the queue size. This misses transient
> micro-bursts, making it difficult for system administrators to know
> if their queue is adequately sized or if they are at risk of
> dropping events.
>
> This patch introduces `backlog_max_depth`, a high-water mark metric
> that tracks the maximum number of buffers in the audit queue since
> the system was booted or the metric was last reset. To minimize
> performance overhead in the fast-path, the metric is updated using
> a lockless cmpxchg loop in `__audit_log_end()`.
>
> Userspace can read-and-clear this metric by sending an `AUDIT_SET`
> message with the `AUDIT_STATUS_BACKLOG_MAX_DEPTH` mask. To support
> periodic telemetry polling (e.g., statsd, Prometheus), the reset
> operation atomically returns the snapshot of the high-water mark
> right before zeroing it, ensuring no peaks are lost between polls.
>
> Link: https://github.com/linux-audit/audit-kernel/issues/63
> Suggested-by: Steve Grubb <sgrubb@redhat.com>
> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
> ---
>  include/linux/audit.h      |  3 ++-
>  include/uapi/linux/audit.h |  2 ++
>  kernel/audit.c             | 32 ++++++++++++++++++++++++++++++++
>  3 files changed, 36 insertions(+), 1 deletion(-)

I sat on this for a bit because I wanted to think on it for a while.
While I agree audit could benefit from better statistics around
queue/backlog status, I'm not sure a single "max" value alone is worth
a bit in the audit_status bitmask.  My concern is that the max queue
length only provides a single snapshot of what the queue looked like,
it doesn't give any indication of the average queue length over a span
of time.  Some audit users are willing to live with occasional drops,
and the max size doesn't help them arrive at a good balance.

As for the users who can't tolerate any audit record drops?  They
shouldn't be running with a backlog limit anyway so the maximum queue
value will be of limit use.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] audit: add backlog high water mark metric
  2026-04-10 21:34 ` Paul Moore
@ 2026-04-15  3:45   ` Steve Grubb
  2026-04-15 15:19     ` Paul Moore
  0 siblings, 1 reply; 10+ messages in thread
From: Steve Grubb @ 2026-04-15  3:45 UTC (permalink / raw)
  To: Ricardo Robaina, Paul Moore; +Cc: audit, linux-kernel, eparis

Hello Paul,

On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul Moore wrote:
> On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina <rrobaina@redhat.com> 
wrote:
> > Currently, determining the optimal `audit_backlog_limit` relies on
> > instantaneous polling of the queue size. This misses transient
> > micro-bursts, making it difficult for system administrators to know
> > if their queue is adequately sized or if they are at risk of
> > dropping events.
> > 
> > This patch introduces `backlog_max_depth`, a high-water mark metric
> > that tracks the maximum number of buffers in the audit queue since
> > the system was booted or the metric was last reset. To minimize
> > performance overhead in the fast-path, the metric is updated using
> > a lockless cmpxchg loop in `__audit_log_end()`.
> > 
> > Userspace can read-and-clear this metric by sending an `AUDIT_SET`
> > message with the `AUDIT_STATUS_BACKLOG_MAX_DEPTH` mask. To support
> > periodic telemetry polling (e.g., statsd, Prometheus), the reset
> > operation atomically returns the snapshot of the high-water mark
> > right before zeroing it, ensuring no peaks are lost between polls.
> > 
> > Link: https://github.com/linux-audit/audit-kernel/issues/63
> > Suggested-by: Steve Grubb <sgrubb@redhat.com>
> > Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
> > ---
> > 
> >  include/linux/audit.h      |  3 ++-
> >  include/uapi/linux/audit.h |  2 ++
> >  kernel/audit.c             | 32 ++++++++++++++++++++++++++++++++
> >  3 files changed, 36 insertions(+), 1 deletion(-)
> 
> I sat on this for a bit because I wanted to think on it for a while.
> While I agree audit could benefit from better statistics around
> queue/backlog status, I'm not sure a single "max" value alone is worth
> a bit in the audit_status bitmask.  My concern is that the max queue
> length only provides a single snapshot of what the queue looked like,
> it doesn't give any indication of the average queue length over a span
> of time.  Some audit users are willing to live with occasional drops,
> and the max size doesn't help them arrive at a good balance.
> 
> As for the users who can't tolerate any audit record drops?  They
> shouldn't be running with a backlog limit anyway so the maximum queue
> value will be of limit use.

The existing audit_lost counter tells administrators they have already 
failed; the proposed backlog_max_depth tells them they are at risk of 
failing. These are different signals serving different operational needs. The 
dominant real-world deployment — compliance-driven systems that must use a 
finite backlog limit for memory safety but cannot tolerate dropped events — 
has no existing mechanism to verify their limit is correctly sized between 
polling intervals. Instantaneous backlog polling is blind to sub-second 
bursts. Only a high-water mark, atomically reset at each poll, closes this 
gap. The average queue length would not answer the question 'did I ever come 
close to the limit?' — only the maximum can.

On the bitmask concern: the last addition was 
AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL, six years ago.

If you don't think this closes the gap on what people need, the patch could 
be amended to include  backlog_lost_since_reset (drops since last poll) 
alongside the max so that you get two metrics for the price of one bit. But 
this is absolutely needed because people are flying blind without it.

-Steve



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] audit: add backlog high water mark metric
  2026-04-15  3:45   ` Steve Grubb
@ 2026-04-15 15:19     ` Paul Moore
  2026-04-15 15:21       ` Paul Moore
  0 siblings, 1 reply; 10+ messages in thread
From: Paul Moore @ 2026-04-15 15:19 UTC (permalink / raw)
  To: Steve Grubb; +Cc: Ricardo Robaina, audit, linux-kernel, eparis

On Tue, Apr 14, 2026 at 11:45 PM Steve Grubb <sgrubb@redhat.com> wrote:
> On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul Moore wrote:
> > On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina <rrobaina@redhat.com>
> wrote:
>

...

> ... compliance-driven systems that must use a finite backlog limit for memory safety but cannot tolerate dropped events ...

You must pick one of those two requirements, or at the very least
prioritize them; it is simply impossible to both limit the backlog
queue and require zero dropped events.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] audit: add backlog high water mark metric
  2026-04-15 15:19     ` Paul Moore
@ 2026-04-15 15:21       ` Paul Moore
  2026-04-16 20:33         ` Steve Grubb
  0 siblings, 1 reply; 10+ messages in thread
From: Paul Moore @ 2026-04-15 15:21 UTC (permalink / raw)
  To: Steve Grubb; +Cc: Ricardo Robaina, audit, linux-kernel, eparis

On Wed, Apr 15, 2026 at 11:19 AM Paul Moore <paul@paul-moore.com> wrote:
> On Tue, Apr 14, 2026 at 11:45 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul Moore wrote:
> > > On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina <rrobaina@redhat.com>
> > wrote:
> >
>
> ...
>
> > ... compliance-driven systems that must use a finite backlog limit for memory safety but cannot tolerate dropped events ...
>
> You must pick one of those two requirements, or at the very least
> prioritize them; it is simply impossible to both limit the backlog
> queue and require zero dropped events.

To be perfectly honest, it's also impossible to require zero dropped
events.  Even in the most extreme configurations where the admin
decides to panic the system, that only happens once the system reaches
the point where it is dropping events.  We try *really* hard to not
drop events, but it is always going to be a possibility.


-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] audit: add backlog high water mark metric
  2026-04-15 15:21       ` Paul Moore
@ 2026-04-16 20:33         ` Steve Grubb
  2026-04-16 20:51           ` Paul Moore
  0 siblings, 1 reply; 10+ messages in thread
From: Steve Grubb @ 2026-04-16 20:33 UTC (permalink / raw)
  To: Paul Moore; +Cc: Ricardo Robaina, audit, linux-kernel, eparis

On Wednesday, April 15, 2026 11:21:52 AM Eastern Daylight Time Paul Moore 
wrote:
> On Wed, Apr 15, 2026 at 11:19 AM Paul Moore <paul@paul-moore.com> wrote:
> > On Tue, Apr 14, 2026 at 11:45 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > > On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul Moore 
wrote:
> > > > On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina
> > > > <rrobaina@redhat.com>
> > > 
> > > wrote:
> > ...
> > 
> > > ... compliance-driven systems that must use a finite backlog limit for
> > > memory safety but cannot tolerate dropped events ...> 
> > You must pick one of those two requirements, or at the very least
> > prioritize them; it is simply impossible to both limit the backlog
> > queue and require zero dropped events.
> 
> To be perfectly honest, it's also impossible to require zero dropped
> events.  Even in the most extreme configurations where the admin
> decides to panic the system, that only happens once the system reaches
> the point where it is dropping events.  We try *really* hard to not
> drop events, but it is always going to be a possibility.

You're helping make the point.  Those administrators have decided reliable 
auditing is more important than system availability. backlog_max_depth gives 
them the one thing they currently lack: advance warning. If the high-water 
mark is consistently approaching the backlog limit, they have actionable 
information to raise the limit, reduce audit rule coverage, or address the 
underlying load - before the system goes down. These are exactly the users 
who would benefit the most from this metric.

-Steve



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] audit: add backlog high water mark metric
  2026-04-16 20:33         ` Steve Grubb
@ 2026-04-16 20:51           ` Paul Moore
  2026-04-16 20:58             ` Paul Moore
  0 siblings, 1 reply; 10+ messages in thread
From: Paul Moore @ 2026-04-16 20:51 UTC (permalink / raw)
  To: Steve Grubb; +Cc: Ricardo Robaina, audit, linux-kernel, eparis

On Thu, Apr 16, 2026 at 4:33 PM Steve Grubb <sgrubb@redhat.com> wrote:
> On Wednesday, April 15, 2026 11:21:52 AM Eastern Daylight Time Paul Moore
> wrote:
> > On Wed, Apr 15, 2026 at 11:19 AM Paul Moore <paul@paul-moore.com> wrote:
> > > On Tue, Apr 14, 2026 at 11:45 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > > > On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul Moore
> wrote:
> > > > > On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina
> > > > > <rrobaina@redhat.com>
> > > >
> > > > wrote:
> > > ...
> > >
> > > > ... compliance-driven systems that must use a finite backlog limit for
> > > > memory safety but cannot tolerate dropped events ...>
> > > You must pick one of those two requirements, or at the very least
> > > prioritize them; it is simply impossible to both limit the backlog
> > > queue and require zero dropped events.
> >
> > To be perfectly honest, it's also impossible to require zero dropped
> > events.  Even in the most extreme configurations where the admin
> > decides to panic the system, that only happens once the system reaches
> > the point where it is dropping events.  We try *really* hard to not
> > drop events, but it is always going to be a possibility.
>
> You're helping make the point.  Those administrators have decided reliable
> auditing is more important than system availability.

Users prioritizing reliable auditing over system availability should
not run with a backlog limit.  It's that simple.

Regardless, I'm still not convinced this maximum backlog stat alone
will solve any meaningful problems.  If your audit log is predictable
enough that this metric has value, it should be possible to either
capture the backlog size during periods of high audit load or simply
run the system through that load and verify it doesn't crash and go to
hell.  If your audit log isn't predictable, capturing a maximum
backlog size doesn't really mean anything since it is still a snapshot
of one instance of the system and there is always the possibility of
the system exceeding it.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] audit: add backlog high water mark metric
  2026-04-16 20:51           ` Paul Moore
@ 2026-04-16 20:58             ` Paul Moore
  2026-04-17 13:02               ` Ricardo Robaina
  0 siblings, 1 reply; 10+ messages in thread
From: Paul Moore @ 2026-04-16 20:58 UTC (permalink / raw)
  To: Steve Grubb; +Cc: Ricardo Robaina, audit, linux-kernel, eparis

On Thu, Apr 16, 2026 at 4:51 PM Paul Moore <paul@paul-moore.com> wrote:
> On Thu, Apr 16, 2026 at 4:33 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > On Wednesday, April 15, 2026 11:21:52 AM Eastern Daylight Time Paul Moore
> > wrote:
> > > On Wed, Apr 15, 2026 at 11:19 AM Paul Moore <paul@paul-moore.com> wrote:
> > > > On Tue, Apr 14, 2026 at 11:45 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > > > > On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul Moore
> > wrote:
> > > > > > On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina
> > > > > > <rrobaina@redhat.com>
> > > > >
> > > > > wrote:
> > > > ...
> > > >
> > > > > ... compliance-driven systems that must use a finite backlog limit for
> > > > > memory safety but cannot tolerate dropped events ...>
> > > > You must pick one of those two requirements, or at the very least
> > > > prioritize them; it is simply impossible to both limit the backlog
> > > > queue and require zero dropped events.
> > >
> > > To be perfectly honest, it's also impossible to require zero dropped
> > > events.  Even in the most extreme configurations where the admin
> > > decides to panic the system, that only happens once the system reaches
> > > the point where it is dropping events.  We try *really* hard to not
> > > drop events, but it is always going to be a possibility.
> >
> > You're helping make the point.  Those administrators have decided reliable
> > auditing is more important than system availability.
>
> Users prioritizing reliable auditing over system availability should
> not run with a backlog limit.  It's that simple.

To clarify this further, even on systems without a backlog limit and a
panic-on-loss configuration, there is still a possibility that the
system could lose an event when it hits the edge before it panics.  A
maximum backlog stat won't help here.  Even if you had a way to
capture the backlog size before the system took itself out, the size
is flirting with the maximum resource limits of the system, it would
be silly to use that as a configured backlog limit, you would still
want to leave the limit at 0/disabled.

> Regardless, I'm still not convinced this maximum backlog stat alone
> will solve any meaningful problems.  If your audit log is predictable
> enough that this metric has value, it should be possible to either
> capture the backlog size during periods of high audit load or simply
> run the system through that load and verify it doesn't crash and go to
> hell.  If your audit log isn't predictable, capturing a maximum
> backlog size doesn't really mean anything since it is still a snapshot
> of one instance of the system and there is always the possibility of
> the system exceeding it.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] audit: add backlog high water mark metric
  2026-04-16 20:58             ` Paul Moore
@ 2026-04-17 13:02               ` Ricardo Robaina
  0 siblings, 0 replies; 10+ messages in thread
From: Ricardo Robaina @ 2026-04-17 13:02 UTC (permalink / raw)
  To: Paul Moore; +Cc: Steve Grubb, audit, linux-kernel, eparis

On Thu, Apr 16, 2026 at 5:58 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Thu, Apr 16, 2026 at 4:51 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Thu, Apr 16, 2026 at 4:33 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > > On Wednesday, April 15, 2026 11:21:52 AM Eastern Daylight Time Paul Moore
> > > wrote:
> > > > On Wed, Apr 15, 2026 at 11:19 AM Paul Moore <paul@paul-moore.com> wrote:
> > > > > On Tue, Apr 14, 2026 at 11:45 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > > > > > On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul Moore
> > > wrote:
> > > > > > > On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina
> > > > > > > <rrobaina@redhat.com>
> > > > > >
> > > > > > wrote:
> > > > > ...
> > > > >
> > > > > > ... compliance-driven systems that must use a finite backlog limit for
> > > > > > memory safety but cannot tolerate dropped events ...>
> > > > > You must pick one of those two requirements, or at the very least
> > > > > prioritize them; it is simply impossible to both limit the backlog
> > > > > queue and require zero dropped events.
> > > >
> > > > To be perfectly honest, it's also impossible to require zero dropped
> > > > events.  Even in the most extreme configurations where the admin
> > > > decides to panic the system, that only happens once the system reaches
> > > > the point where it is dropping events.  We try *really* hard to not
> > > > drop events, but it is always going to be a possibility.
> > >
> > > You're helping make the point.  Those administrators have decided reliable
> > > auditing is more important than system availability.
> >
> > Users prioritizing reliable auditing over system availability should
> > not run with a backlog limit.  It's that simple.
>
> To clarify this further, even on systems without a backlog limit and a
> panic-on-loss configuration, there is still a possibility that the
> system could lose an event when it hits the edge before it panics.  A
> maximum backlog stat won't help here.  Even if you had a way to
> capture the backlog size before the system took itself out, the size
> is flirting with the maximum resource limits of the system, it would
> be silly to use that as a configured backlog limit, you would still
> want to leave the limit at 0/disabled.
>
> > Regardless, I'm still not convinced this maximum backlog stat alone
> > will solve any meaningful problems.  If your audit log is predictable
> > enough that this metric has value, it should be possible to either
> > capture the backlog size during periods of high audit load or simply
> > run the system through that load and verify it doesn't crash and go to
> > hell.  If your audit log isn't predictable, capturing a maximum
> > backlog size doesn't really mean anything since it is still a snapshot
> > of one instance of the system and there is always the possibility of
> > the system exceeding it.
>
> --
> paul-moore.com
>

Hi Paul,

Thanks for reviewing the patch and giving your perspective on it.

I understand your point that if a system truly prioritizes auditing
over everything else, it shouldn't run with a limit. However, in
practice, there is a middle ground where compliance frameworks or
internal infrastructure policies require a finite backlog limit to
ensure memory safety, while still demanding reliable auditing.

Currently, audit users are looking for a way to tune the system based
on an optimal setting for their workload that satisfies memory
constraints while practically minimizing dropped events to near-zero.
I strongly believe such users would make good use of backlog_max_depth
because it lets them know what the worst-case scenarios look like and
how big the spikes can be. This allows them to base their tuning
decisions on real data rather than guesswork, as is usually done
nowadays. Other than that, exposing such metrics would allow users to
leverage services like tuned to dynamically adjust limits based on
workload.

That being said, I hear your concern about whether a single "max"
value alone is worth consuming a bit in the audit_status bitmask. So,
I'd like to ask what specific metric or combination of metrics you
would be willing to consider? You mentioned average queue length
earlier, and Steve suggested combining the max depth with a
backlog_lost_since_reset counter. I'm happy to work on a v2 that
addresses your concerns while still delivering the metrics audit users
currently lack.

-Ricardo

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-04-17 13:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-23 15:07 [PATCH] audit: add backlog high water mark metric Ricardo Robaina
2026-03-23 16:48 ` Steve Grubb
2026-04-10 21:34 ` Paul Moore
2026-04-15  3:45   ` Steve Grubb
2026-04-15 15:19     ` Paul Moore
2026-04-15 15:21       ` Paul Moore
2026-04-16 20:33         ` Steve Grubb
2026-04-16 20:51           ` Paul Moore
2026-04-16 20:58             ` Paul Moore
2026-04-17 13:02               ` Ricardo Robaina

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox