reactive audit question

public inbox for linux-audit@redhat.com
 help / color / mirror / Atom feed

* reactive audit question
@ 2010-11-12 17:24 LC Bruzenak
  2010-11-19 16:20 ` Steve Grubb
  0 siblings, 1 reply; 3+ messages in thread
From: LC Bruzenak @ 2010-11-12 17:24 UTC (permalink / raw)
  To: Linux Audit

Steve, others,

I may have asked this before, but it is becoming an issue so I thought
I'd check again anyway.

In our systems there are occasionally AVC "storms" which happen as a
result of some unforeseen (and often unknown) issue tickled by various
reasons.

At fielded sites, we are unable to fix this easily. Since we have to
keep all the audit data, this leads to many problems on a system running
over a weekend, for example, with no administrators around.

I probably need to add in either some rate-limiting code or possibly
kill off the process generating the AVCs. Rate-limiting I'd guess could
go into the auditd. If I wanted to be more brutal and kill the process,
I'd think maybe a modification to the setroubleshoot code would be
workable.

I don't think that a reactive rule is an option -
1) We have our rules locked into the kernel on startup and I'm against
changing that, and
2) maybe "normal" avc counts, under a threshold, we'd still want to see,
from that same process. Besides,
3) unless the rules have been changed, we cannot exclude AVCs from a
particular type/process anyway.

Got any thoughts/ideas/advice?

Thx,
LCB

-- 
LC (Lenny) Bruzenak
lenny@magitekltd.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: reactive audit question
  2010-11-12 17:24 reactive audit question LC Bruzenak
@ 2010-11-19 16:20 ` Steve Grubb
  2010-11-19 18:05   ` LC Bruzenak
  0 siblings, 1 reply; 3+ messages in thread
From: Steve Grubb @ 2010-11-19 16:20 UTC (permalink / raw)
  To: linux-audit

Hi Lenny,

On Friday, November 12, 2010 12:24:43 pm LC Bruzenak wrote:
> In our systems there are occasionally AVC "storms" which happen as a
> result of some unforeseen (and often unknown) issue tickled by various
> reasons.
> 
> At fielded sites, we are unable to fix this easily. Since we have to
> keep all the audit data, this leads to many problems on a system running
> over a weekend, for example, with no administrators around.
> 
> I probably need to add in either some rate-limiting code or possibly
> kill off the process generating the AVCs. Rate-limiting I'd guess could
> go into the auditd. If I wanted to be more brutal and kill the process,
> I'd think maybe a modification to the setroubleshoot code would be
> workable.

I didn't answer right away because I didn't have a good answer for you. If the storm 
is large enough to overrun the kernel queue, the rate limiting needs to be in the 
kernel. If auditd is able to handle the load, then perhaps you need an analysis plugin 
that performs whatever action you deem best.

> I don't think that a reactive rule is an option -
> 1) We have our rules locked into the kernel on startup and I'm against
> changing that, and
> 2) maybe "normal" avc counts, under a threshold, we'd still want to see,
> from that same process. Besides,
> 3) unless the rules have been changed, we cannot exclude AVCs from a
> particular type/process anyway.
> 
> Got any thoughts/ideas/advice?

What is the general source of the problem right now? Was it just that the app was 
doing something that policy didn't know it could do? Or was there attacks under way 
that someone was trying something bad? Or was its just an admin mistake where 
something didn't have the right label? Each of these has a different solution.

I think this is a complex problem and controls might be needed at several spots. I'd 
be open to hearing ideas on this too. I've also been wondering if the audit daemon 
might want to use control groups as a means of keeping itself scheduled for very busy 
systems. But i'd like to hear other people's thoughts.

-Steve

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: reactive audit question
  2010-11-19 16:20 ` Steve Grubb
@ 2010-11-19 18:05   ` LC Bruzenak
  0 siblings, 0 replies; 3+ messages in thread
From: LC Bruzenak @ 2010-11-19 18:05 UTC (permalink / raw)
  To: Steve Grubb; +Cc: linux-audit

On Fri, 2010-11-19 at 11:20 -0500, Steve Grubb wrote:
> 
> I didn't answer right away because I didn't have a good answer for you. If the storm 
> is large enough to overrun the kernel queue, the rate limiting needs to be in the 
> kernel. If auditd is able to handle the load, then perhaps you need an analysis plugin 
> that performs whatever action you deem best.

Steve,
I understand; it isn't a straightforward thing and I appreciate you
thinking about it. I think I have settled on a workable solution.

I am using the unix audisp builtin and I am sampling the AVC events.
I've got a non-blocking mechanism whereby I can count the AVCs on a very
small number of senders. Then I can take action against the offenders
(kill). Not perfect, has issues, but might be satisfactory.

I'm still testing this sampling approach, making certain I don't
introduce any blockage points, which would aggravate the issue.

And while this may work on a single process sending thousands of AVCs in
a tight loop, it wouldn't work on one which gets respawned, unless I
look at the ppid or do something more clever. 

> 
> What is the general source of the problem right now? Was it just that the app was 
> doing something that policy didn't know it could do? Or was there attacks under way 
> that someone was trying something bad? Or was its just an admin mistake where 
> something didn't have the right label? Each of these has a different solution.

Mostly the first scenario you mention - that the 3rd-party application
hit an execution path we had not seen in testing. But of course it
doesn't have to be a 3rd-party app. Even ones we create can run amok
with AVCs if all code paths are not exercised under all data conditions.
Basically untestable in finite time by humans.
:)

Some things you never know the code will do - for example in one error
recovery case I believe some process (or library it uses) decides to go
look a different running process and then wants to figure out which
connections it has. Well, it doesn't get an answer because of course it
isn't policy-able to see the /proc details or some such thing, generates
AVCs, and it is in a loop until it gets an answer (forever).

Or things which are normally working fine on targeted-policy systems get
confused on MLS systems because they cannot connect to the server when
they are invoked for a process running at a higher/lower/incomparable
MLS level. Then they retry a few million times or so...

Or a process decides to see which files it can access in a big data
store. All the ones it cannot, for MAC level (MLS) reasons, all generate
AVCs. A few hundred isn't a big deal; a few million is.

Funny things happen to systems when you subject them to the real world
and real users.
:)

> 
> I think this is a complex problem and controls might be needed at several spots. I'd 
> be open to hearing ideas on this too. I've also been wondering if the audit daemon 
> might want to use control groups as a means of keeping itself scheduled for very busy 
> systems. But i'd like to hear other people's thoughts.

I agree on the complexity. At the very least though I'd think adding a
syslog-like function whereby it can assimilate same-event audits and
then submit one event like "1000 similar events like this" would be
good.

Likely 1000 isn't even enough. At one point we were getting well over
1500 AVCs/second over a period of days. On a weekend of course. :)
Actually we were able to process that amount. I have no data on the
number of drops.

Tends to add right up. And this is just one sending host (there are
others but they are not as busy). If I had multiples, the aggregating
machine would be overrun. As processors/hardware get faster, I assume
the AVC error rates will too.

In my case, the concern is that a valuable event will be dropped off the
queue due to others like I described taking all the resources. Even
though I have increased the audispd queue size and the priorities, at
some point saturation will inevitably occur.

Thanks again!
LCB

-- 
LC (Lenny) Bruzenak
lenny@magitekltd.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-11-19 18:05 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-12 17:24 reactive audit question LC Bruzenak
2010-11-19 16:20 ` Steve Grubb
2010-11-19 18:05   ` LC Bruzenak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox