new ABI

All of lore.kernel.org
 help / color / mirror / Atom feed

* new ABI
@ 2006-08-14 21:12 Massimiliano Hofer
  2006-08-15  0:00 ` Joakim Axelsson
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Massimiliano Hofer @ 2006-08-14 21:12 UTC (permalink / raw)
  To: netfilter-devel

Hi,
I couldn't keep pace during the day with all the mail that has been written, 
so let me summarize what has been said.
Please forgive (and correct) me if I forget anything.

First of all several people think that the current ABI has shortcomings and 
something has to be done.

Regarding my proposal for priv_data (I'm obviously biased here, but I'll try 
to be objective):
- it offers the ability to store data out of reach to the userspace utilities 
(for whatever housekeeping any match/targets needs);
- it can't offer persistent data to matches/targets.

<biased>
With this patch we can part with some really ugly tricks involving userspace 
structure fields and kernel pointers and it would let us have O(1) matches 
for quota, limit and any other match/target that needs cross match data.
</biased>

People may expect the second feature too, but it's just not possible with the 
current infrastructure. The same infrastructure, making extensive use of 
arrays whose size is determined before we call any module hook function, 
leaves us in the cold for a really flexible solution for other problems too.
I've not yet read all the code involved, but, if we really need to, we may be 
able to use a compat-like interface to maintain ABI compatibility with a new 
netfilter core.

What people need from any new infrastructure:
- cleaner interface with clearer separation between kernel and user data;
- ability to dump internal state of matches/targets (this may not be in a 
1-to-1 relation, so it may be tricky, do we need module state dumping?);
- ability to change chains/rules/matches without reinitializing everything;
- ability to change matches' state or configuration without reinitializing 
everything;
- general infrastructure for common logic that is currently reinvented every 
time (negation comes to mind, but I'm sure there are other things).

<biased>
Regarding user influence over state, especially where the number of states 
doesn't match the number of matches involved, I'm not totally opposed to a 
file-like way of exposing it. I agree that /proc is in a sorry state, but 
configfs is there precisely for this purpose. Of course not everything can be 
done this way and I wouldn't like to have complex data passed and parsed this 
way.
We may need a new set of commands in iptables (should I call it iptables-ng?) 
just for keeping this kind of data (realms, quota groups, conditions, etc.). 
If we had a general way to keep collections of configurations, I'll be glad 
to conform and use it.
</biased>

I think the current array oriented data structures won't allow us to add these 
features. RCU lists come to mind. It sure is a step back in performance 
(sparser access to memory and more memory fragmentation), but it may not be 
that noticeable.
I think every match/target should expode:
- init;
- destroy;
- change;
- dump;
- restore.

Depending on the API change, dump and restore may melt in a single function.

The kernel should let any match:
- receive user supplied initialization data (mostly a rule definition) state 
dumps (this calls for very careful planning and checking);
- send state dumps to userspace;
- keep private data for every match/target;
- keep collections of configurations common to matches in a module (every 
module may keep it without netfilter core help, but if it becomes part of the 
infrastructure it may be handled through the userspace ABI).

Am I forgetting anything?
Do you think any of these features are bugs?
Am I overseeing fatal difficulties related to what I wrote?
Please reply with your opinions. I'll wear my asbestos suite for the next 
couple of days. :)

-- 
Saluti,
   Massimiliano Hofer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-14 21:12 new ABI Massimiliano Hofer
@ 2006-08-15  0:00 ` Joakim Axelsson
  2006-08-15  8:39   ` Amin Azez
  2006-08-15 22:08   ` Massimiliano Hofer
  2006-08-15 12:14 ` Simon Lodal
  2006-08-16 12:16 ` Joakim Axelsson
  2 siblings, 2 replies; 31+ messages in thread
From: Joakim Axelsson @ 2006-08-15  0:00 UTC (permalink / raw)
  To: Massimiliano Hofer; +Cc: netfilter-devel

2006-08-14 23:12:41+0200, Massimiliano Hofer <max@nucleus.it> ->
> Hi,
> I couldn't keep pace during the day with all the mail that has been written, 
> so let me summarize what has been said.
> Please forgive (and correct) me if I forget anything.
> 
> First of all several people think that the current ABI has shortcomings and 
> something has to be done.
> 
> Regarding my proposal for priv_data (I'm obviously biased here, but I'll try 
> to be objective):
> - it offers the ability to store data out of reach to the userspace utilities 
> (for whatever housekeeping any match/targets needs);
> - it can't offer persistent data to matches/targets.
> 
> <biased>
> With this patch we can part with some really ugly tricks involving userspace 
> structure fields and kernel pointers and it would let us have O(1) matches 
> for quota, limit and any other match/target that needs cross match data.
> </biased>
> 

O(1) can be done today as well. Just figure out the pointer in checkentry()
and keep it. However, do this everytime. Don't trust userspace to not alter
it. So its O(1) in match()/target() but not in checkentry(). Shouldn't be
too bad.

> People may expect the second feature too, but it's just not possible with the 
> current infrastructure. The same infrastructure, making extensive use of 
> arrays whose size is determined before we call any module hook function, 
> leaves us in the cold for a really flexible solution for other problems too.
> I've not yet read all the code involved, but, if we really need to, we may be 
> able to use a compat-like interface to maintain ABI compatibility with a new 
> netfilter core.
> 
> What people need from any new infrastructure:
> - cleaner interface with clearer separation between kernel and user data;
> - ability to dump internal state of matches/targets (this may not be in a 
> 1-to-1 relation, so it may be tricky, do we need module state dumping?);
> - ability to change chains/rules/matches without reinitializing everything;
> - ability to change matches' state or configuration without reinitializing 
> everything;
> - general infrastructure for common logic that is currently reinvented every 
> time (negation comes to mind, but I'm sure there are other things).
> 
> <biased>
> Regarding user influence over state, especially where the number of states 
> doesn't match the number of matches involved, I'm not totally opposed to a 
> file-like way of exposing it. I agree that /proc is in a sorry state, but 
> configfs is there precisely for this purpose. Of course not everything can be 
> done this way and I wouldn't like to have complex data passed and parsed this 
> way.

This isn't a bad idea. Can we make an iptablesfs that can be used in a smart
way? It will offer an obvious API for any userspace-program/script to use.
We wouldn't then need any explicit userspace-program at all to maintain.
Only kernel.

Perhaps we will push parsing functions into kernel then. Not good. But if we
could come up with an API for a iptablesfs the parsing would be minimal and
just in common fast functions.

> We may need a new set of commands in iptables (should I call it iptables-ng?) 
> just for keeping this kind of data (realms, quota groups, conditions, etc.). 
> If we had a general way to keep collections of configurations, I'll be glad 
> to conform and use it.
> </biased>
> 
> I think the current array oriented data structures won't allow us to add these 
> features. RCU lists come to mind. It sure is a step back in performance 
> (sparser access to memory and more memory fragmentation), but it may not be 
> that noticeable.
> I think every match/target should expode:
> - init;
> - destroy;
> - change;
> - dump;
> - restore.
> 

Don't forget the worker: match()/target().

> Depending on the API change, dump and restore may melt in a single function.
> 
> The kernel should let any match:
> - receive user supplied initialization data (mostly a rule definition) state 
> dumps (this calls for very careful planning and checking);
> - send state dumps to userspace;
> - keep private data for every match/target;
> - keep collections of configurations common to matches in a module (every 
> module may keep it without netfilter core help, but if it becomes part of the 
> infrastructure it may be handled through the userspace ABI).
> 
> 
> Am I forgetting anything?
> Do you think any of these features are bugs?
> Am I overseeing fatal difficulties related to what I wrote?
> Please reply with your opinions. I'll wear my asbestos suite for the next 
> couple of days. :)
> 

It would probably be nice to introduce more advanced pseudo data types. Like
a rate type ( X / time ). IP-type. Netmask-type and so on. Common parser
libraries for the userspace tool.

Also, don't forget that people tend to think that iptables are way too
complicated. I think people like BSD style of writing the rules to a file
that is then "executed". The file syntax is more of writing sentences of
what you want. I my self hate that way of configuring with "words" rather
than parameters. But still, one thing to consider. 

A file only config somewhat solves the iptables vs iptables-save/restore
syndrome. They are not always in sync. Also in comparation with switches
like Cisco, configuration alterations are always saved. You work in a shell
what changes the only config-file directly. This means that in our case,
iptables and iptables-save are the same. iptables only alter the "only file"
that iptables-save has. By introducing a file only will not nessesary make a
full realod of all rules only to alter one rule. The rules, if having a
state can get a state id which they can hook back on when reloading.

Also, try to move away from the small thinking or rules we have today. Try
to see a bigger picture. What is a rule. What can it do? What is match or
target? What is a module. I could basicly write myself a module today that
does my entire firewall using only one iptables rule. 
"iptables -A INPUT -m myhugefirewall".
Try focus on the bigger modules like recent, ipset, accounting, conntrack.
Not the small and simple ones as length, ttl and mport.

Another way of doing firewall is to write your rules in some syntax in a
file. Have a userspace program parse it into C-code. Have your gcc compiler
compile it into a kernel module. Load it. This will optimize the firewall
ALOT. Still again, states can be saved between reloads just using some ids
and hooks for the rules that needs a state.

If you look at the work i began with ippool which was later finished in a
much smaller version as ipset has no limits at all (well alot less atleast
:-). The idea was to make three category of elements. Data structes, data
interpreter and algorithms. Meaning we can have as data strcutes: array,
bitmap, hash, rcu-list, priority queue. Data interpreter: ip-addresses,
ipv6-addresses, port numbers, times/dates, ranges and so on. Algoritms:
timeout, sorting, logging and more. You can then combine a data structure
with a data interpreter, and possible with an algoritm or two. So to build
recent you combine data structure hash with ip-addresses, might also add the
algorithm timeout. To just match a single IP source address, combine the
data structure 'single' with data interpreter 'IPv4 address' and algoritm
'source', or something like that. This would be impossible with the current
API of iptables. But if you only add alter/change and dump this would be.
Even better i can be integrated with new iptables-ng perhaps. Crazy idea and
perhaps impossible to make it easy to use. But would be a really really nice
professional tool.

Also remember that alot of firewall setups for routers handles several
different destinations. Today no good way of grouping rules and/or trying to
group which destinations belongs to which custumer exists.

Many crazy ideas, so keep your asbestos suit on :-P

--
Joakim Axelsson

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-15  0:00 ` Joakim Axelsson
@ 2006-08-15  8:39   ` Amin Azez
  2006-08-15 22:08   ` Massimiliano Hofer
  1 sibling, 0 replies; 31+ messages in thread
From: Amin Azez @ 2006-08-15  8:39 UTC (permalink / raw)
  To: Massimiliano Hofer, netfilter-devel

* Joakim Axelsson wrote, On 15/08/06 01:00:
> 2006-08-14 23:12:41+0200, Massimiliano Hofer <max@nucleus.it> ->
>> Hi,
>> I couldn't keep pace during the day with all the mail that has been written, 
>> so let me summarize what has been said.
>> Please forgive (and correct) me if I forget anything.
>>
>> First of all several people think that the current ABI has shortcomings and 
>> something has to be done.
>>
>> Regarding my proposal for priv_data (I'm obviously biased here, but I'll try 
>> to be objective):
>> - it offers the ability to store data out of reach to the userspace utilities 
>> (for whatever housekeeping any match/targets needs);
>> - it can't offer persistent data to matches/targets.
>>
>> <biased>
>> With this patch we can part with some really ugly tricks involving userspace 
>> structure fields and kernel pointers and it would let us have O(1) matches 
>> for quota, limit and any other match/target that needs cross match data.
>> </biased>
>>
> 
> O(1) can be done today as well. Just figure out the pointer in checkentry()
> and keep it. However, do this everytime. Don't trust userspace to not alter
> it. So its O(1) in match()/target() but not in checkentry(). Shouldn't be
> too bad.

I did this for layer7 matching to cache the compiled regex, however it
stopped deletion of rules by specification (not by index) because the
matchinfo struct no longer matched (the kernel based one had the pointer
but the userland based one that was being compared did not). I didn't
pin down the code doing the match which would need "teaching" not to
match the private bit, as time constraints were too tight.

> It would probably be nice to introduce more advanced pseudo data types. Like
> a rate type ( X / time ). IP-type. Netmask-type and so on. Common parser
> libraries for the userspace tool.
>
> Also, don't forget that people tend to think that iptables are way too
> complicated. I think people like BSD style of writing the rules to a file
> that is then "executed". The file syntax is more of writing sentences of
> what you want. I my self hate that way of configuring with "words" rather
> than parameters. But still, one thing to consider. 
> 
> A file only config somewhat solves the iptables vs iptables-save/restore
> syndrome. They are not always in sync. Also in comparation with switches
> like Cisco, configuration alterations are always saved. You work in a shell
> what changes the only config-file directly. This means that in our case,
> iptables and iptables-save are the same. iptables only alter the "only file"
> that iptables-save has. By introducing a file only will not nessesary make a
> full realod of all rules only to alter one rule. The rules, if having a
> state can get a state id which they can hook back on when reloading.
> 
> Also, try to move away from the small thinking or rules we have today. Try
> to see a bigger picture. What is a rule. What can it do? What is match or
> target? What is a module. I could basicly write myself a module today that
> does my entire firewall using only one iptables rule. 
> "iptables -A INPUT -m myhugefirewall".
> Try focus on the bigger modules like recent, ipset, accounting, conntrack.
> Not the small and simple ones as length, ttl and mport.

I have modified iptables-restore (as it is good at parsing iptables-save
format) to output an xml specification. In support of your comment here;
if iptables modules preserved semantics by making use of macro's or
function calls instead of printf when saving their rules, then it would
be easy to support various more readable representations of rule, which
would answer your suggestions mentioned here.

> Another way of doing firewall is to write your rules in some syntax in a
> file. Have a userspace program parse it into C-code. Have your gcc compiler
> compile it into a kernel module. Load it. This will optimize the firewall
> ALOT. Still again, states can be saved between reloads just using some ids
> and hooks for the rules that needs a state.
> 
> If you look at the work i began with ippool which was later finished in a
> much smaller version as ipset has no limits at all (well alot less atleast
> :-). The idea was to make three category of elements. Data structes, data
> interpreter and algorithms. Meaning we can have as data strcutes: array,
> bitmap, hash, rcu-list, priority queue. Data interpreter: ip-addresses,
> ipv6-addresses, port numbers, times/dates, ranges and so on. Algoritms:
> timeout, sorting, logging and more. You can then combine a data structure
> with a data interpreter, and possible with an algoritm or two. So to build
> recent you combine data structure hash with ip-addresses, might also add the
> algorithm timeout. To just match a single IP source address, combine the
> data structure 'single' with data interpreter 'IPv4 address' and algoritm
> 'source', or something like that. This would be impossible with the current
> API of iptables. But if you only add alter/change and dump this would be.
> Even better i can be integrated with new iptables-ng perhaps. Crazy idea and
> perhaps impossible to make it easy to use. But would be a really really nice
> professional tool.
> 
> Also remember that alot of firewall setups for routers handles several
> different destinations. Today no good way of grouping rules and/or trying to
> group which destinations belongs to which custumer exists.
> 
> Many crazy ideas, so keep your asbestos suit on :-P

These ideas arenot crazy and are very much relevant to me.
I'm moving to xml representation of rules because the abstraction allows
me to implement user requirements in various ways depending on the
current capability of iptables. iptables now supports a module appearing
more than once in a match, but not multiple targets. With a meaningful
xml representation (or any easily manipulatable representation) I can
render the users requirements as multiple iptables matches now, possibly
with extra chains, and have these reduced to a single rule in the future.

Sam

Sam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-14 21:12 new ABI Massimiliano Hofer
  2006-08-15  0:00 ` Joakim Axelsson
@ 2006-08-15 12:14 ` Simon Lodal
  2006-08-15 22:57   ` Massimiliano Hofer
  2006-08-16 12:16 ` Joakim Axelsson
  2 siblings, 1 reply; 31+ messages in thread
From: Simon Lodal @ 2006-08-15 12:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Massimiliano Hofer

On Monday 14 August 2006 23:12, Massimiliano Hofer wrote:
> Hi,
> I couldn't keep pace during the day with all the mail that has been
> written, so let me summarize what has been said.
> Please forgive (and correct) me if I forget anything.
>
> First of all several people think that the current ABI has shortcomings and
> something has to be done.

Everybody has a long wishlist and seem to agree that something fundamental 
needs to be done.

The question seems to be when backwards compatibility can be given up.

> Regarding my proposal for priv_data (I'm obviously biased here, but I'll
> try to be objective):
> - it offers the ability to store data out of reach to the userspace
> utilities (for whatever housekeeping any match/targets needs);
> - it can't offer persistent data to matches/targets.
>
> <biased>
> With this patch we can part with some really ugly tricks involving
> userspace structure fields and kernel pointers and it would let us have
> O(1) matches for quota, limit and any other match/target that needs cross
> match data. </biased>
>
> People may expect the second feature too, but it's just not possible with
> the current infrastructure. The same infrastructure, making extensive use
> of arrays whose size is determined before we call any module hook function,
> leaves us in the cold for a really flexible solution for other problems
> too. I've not yet read all the code involved, but, if we really need to, we
> may be able to use a compat-like interface to maintain ABI compatibility
> with a new netfilter core.
>
> What people need from any new infrastructure:
> - cleaner interface with clearer separation between kernel and user data;
> - ability to dump internal state of matches/targets (this may not be in a
> 1-to-1 relation, so it may be tricky, do we need module state dumping?);

Yes, but why should that be hard? Netfilter should already have a list of 
registered modules.

> - ability to change chains/rules/matches without reinitializing everything;
> - ability to change matches' state or configuration without reinitializing
> everything;
> - general infrastructure for common logic that is currently reinvented
> every time (negation comes to mind, but I'm sure there are other things).

Agreed.

> <biased>
> Regarding user influence over state, especially where the number of states
> doesn't match the number of matches involved, I'm not totally opposed to a
> file-like way of exposing it. I agree that /proc is in a sorry state, but
> configfs is there precisely for this purpose. Of course not everything can
> be done this way and I wouldn't like to have complex data passed and parsed
> this way.

We are going to have "interesting" data that are not 1:1 with rules. But then 
they will be 1:1 with modules, or some other "scope" that netfilter knows how 
to traverse. Each "scope" can have their own section in the iptables-save 
output. Hence the parsing complexity lies in iptables-restore.

Whether it is all going to be exposed in some filesystem or not is a different 
matter.

> We may need a new set of commands in iptables (should I call it
> iptables-ng?)

What is the version after it going to be then? No, I never liked the -ng 
suffix :)

What is wrong with iptables2? 

> just for keeping this kind of data (realms, quota groups, 
> conditions, etc.). If we had a general way to keep collections of
> configurations, I'll be glad to conform and use it.
> </biased>
>
> I think the current array oriented data structures won't allow us to add
> these features. RCU lists come to mind. It sure is a step back in
> performance (sparser access to memory and more memory fragmentation), but
> it may not be that noticeable.

Flexibility is not free, but perhaps it can be cheap, performance wise.

Let's say we make iptables more shell-like, with the ability to handle 
multiple commands in one invocation (with a final COMMIT command required)? 
Would be lovely in itself.

Then iptables would get a better chance to optimize memory allocation, since 
it is not only looking at one rule at a time.

The case where you load the entire firewall ruleset in one go could be 
optimized to a point where it is no different from today.

Another proposal: Abstract out the storage for each rule/match/target. Let the 
kernel netfilter part handle allocation. Allow it to do transparent 
reallocation to optimize for locality:

 * The current scheme where the ipt_entry* structs are extended through 
the .data/.elems fields is abandoned. Hence, all ipt_entry* structs have 
static sizes, and can be stored in plain arrays regardless of what type they 
really are.

 * ipt_entry* structs might contain data (like basic src/dst/port/iface 
matches), but they may not keep pointers to anything, not even their own 
fields. They are independent of their own memory location. The memory 
management code can therefore rearrange the tables at will (proper locking 
assumed), without having to reinitialize rules.

 * All other memory is accessed through a struct that is passed to each 
rule/match/target's API functions. It contains at least .instance_data, but 
also .module_data (.priv_data), and perhaps other scopes data, 
like .rule_data, .chain_data and .global_data (all cross-module). Note that 
each of these are bound to a specific entity.

 * Each module and instance must call special netfilter API's to allocate 
memory of the required types. The netfilter part handles free'ing through 
refcount (why not).

 * The actual .*_data pointers may change between invocations (packets fed to) 
of the same rule/match/target. This means the netfilter part is allowed to 
rearrange dynamic memory too.

 * The catch: Rules can be changed without affecting the others, we get 
priv_data, and dynamic-sized memory allocation for individual instances as 
well. Memory layout and optimization is isolated from the modules and 
userspace. It become complex, but it does not have to be just to get it 
working.

 * Bonus: Sync of memory regions with other hosts can be handled 
transparently, or at least easily. So that fx. limit rules can work across 
redundant hosts.

As you can probably guess I am blissfully unaware of a few intricate details. 
Please enlighten me.

I have no clear idea how all these individual blobs would be communicated 
between kernel and userspace. Except there are two general options:

1) The current "pass a large blob" scheme. Since it will contain many smaller 
blobs, some in-kernel parsing is required. Worse yet, the kernel must also be 
able to assemble a large blob in order to dump to userspace.

2) Build the structure piece by piece, with a simple kernel call for each 
piece. Not entirely unlike populating a directory structure...

> I think every match/target should expode:
> - init;
> - destroy;
> - change;
> - dump;
> - restore.

change() would be nice, like in qdisc.

> Depending on the API change, dump and restore may melt in a single
> function.
>
> The kernel should let any match:
> - receive user supplied initialization data (mostly a rule definition)
> state dumps (this calls for very careful planning and checking);
> - send state dumps to userspace;
> - keep private data for every match/target;
> - keep collections of configurations common to matches in a module (every
> module may keep it without netfilter core help, but if it becomes part of
> the infrastructure it may be handled through the userspace ABI).
>
>
> Am I forgetting anything?

We all are.

> Do you think any of these features are bugs?
> Am I overseeing fatal difficulties related to what I wrote?
> Please reply with your opinions. I'll wear my asbestos suite for the next
> couple of days. :)

Applause for possibly opening the can of worms :)

Regards,
Simon

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-15  0:00 ` Joakim Axelsson
  2006-08-15  8:39   ` Amin Azez
@ 2006-08-15 22:08   ` Massimiliano Hofer
  1 sibling, 0 replies; 31+ messages in thread
From: Massimiliano Hofer @ 2006-08-15 22:08 UTC (permalink / raw)
  To: netfilter-devel

On Tuesday 15 August 2006 2:00 am, Joakim Axelsson wrote:

> > With this patch we can part with some really ugly tricks involving
> > userspace structure fields and kernel pointers and it would let us have
> > O(1) matches for quota, limit and any other match/target that needs cross
> > match data. </biased>
>
> O(1) can be done today as well. Just figure out the pointer in checkentry()
> and keep it. However, do this everytime. Don't trust userspace to not alter
> it. So its O(1) in match()/target() but not in checkentry(). Shouldn't be
> too bad.

I know. I already figure it out at checkentry (it wouldn't work at all if I 
didn't), the real point was: where do I store it?
The answer can be:
- I reserve some space in the userspace provided structure in order to do it;
- I apply priv_data.

While I certainly can do the former, I think the latter is far cleaner.
Of course changing the ABI to all other matches isn't nice either.

> This isn't a bad idea. Can we make an iptablesfs that can be used in a
> smart way? It will offer an obvious API for any userspace-program/script to
> use. We wouldn't then need any explicit userspace-program at all to
> maintain. Only kernel.
>
> Perhaps we will push parsing functions into kernel then. Not good. But if
> we could come up with an API for a iptablesfs the parsing would be minimal
> and just in common fast functions.

As much as I'd like to "ls" and "tar" my rules, I think that pushing the whole 
userspace parsing done by iptables into the kernel is a Bad Thing (TM).

I was referring to modules that needed extra configurations (something more 
than simple match parameters) and that needed to communicate with userspace.

Basically anything that has a "name" or "realm" parameter in it. I currently 
maintain condition that uses procfs and I'm going to convert it to configfs. 
I was asking the people in the mailing list if this feature is 
useful/widespread enough to justify general infrastructure.
This could be in the form of a file system hook 
(/config/netfilter/modulename/entry) or a specific set of command in the next 
version of iptables.
How many people would benefit from this?

> > I think every match/target should expode:
> > - init;
> > - destroy;
> > - change;
> > - dump;
> > - restore.
>
> Don't forget the worker: match()/target().

Yeah, that too. :)

> It would probably be nice to introduce more advanced pseudo data types.
> Like a rate type ( X / time ). IP-type. Netmask-type and so on. Common
> parser libraries for the userspace tool.

I agree. Although, if matches were really modular and the combination logic 
powerful enough, you would hardly need to parse the same type of data twice.
I realize this is a bit idealistic. A library of parser functions would 
certainly be nice.

> Also, don't forget that people tend to think that iptables are way too
> complicated. I think people like BSD style of writing the rules to a file
> that is then "executed". The file syntax is more of writing sentences of
> what you want. I my self hate that way of configuring with "words" rather
> than parameters. But still, one thing to consider.

I like my rules generated by my scripts. Of course I could generate a file, 
but I don't think there is anything obviously better in this approach.
Either way some utility needs to parse something, paramters on the command 
line or lines in a file makes really little difference.
I don't know much about BSD. Is there something that makes it really easier?

> A file only config somewhat solves the iptables vs iptables-save/restore
> syndrome. They are not always in sync. Also in comparation with switches

A good library approach to both would solve this.

> like Cisco, configuration alterations are always saved. You work in a shell

I often work with complex rulesets that are fare less complex to generate 
based on some paramters and a few criteria.
In some cases I will always need program generated rules. In many other cases 
I could do without them if I had:
- expression "subchains": it would be really powerful if I could create a 
chain with a resulting target that just says true or false and then use it as 
a match in other rules (we could even cache the result for multiple 
invocations);
- effective ways to handle sets (ports, IPs, anything): I know what I need is 
in pom, but it's not yet in the stable kernel and we could expand the concept 
farther away.

> what changes the only config-file directly. This means that in our case,
> iptables and iptables-save are the same. iptables only alter the "only
> file" that iptables-save has. By introducing a file only will not nessesary
> make a full realod of all rules only to alter one rule. The rules, if
> having a state can get a state id which they can hook back on when
> reloading.

A system of rules/matches ids will be needed. If we will use command interface 
these could be assigned by the kernel. With a file we will need to rely on 
the user.

> Also, try to move away from the small thinking or rules we have today. Try

That's why I was asking for opinions.
Whatever you propose, keep in mind that someone has to implement it.
I will gladly help, but I don't have enough time to work but on a fraction of 
the final result.
So, it has to be better, but reachable. Of course will need to design with 
room for improvement.

> Another way of doing firewall is to write your rules in some syntax in a
> file. Have a userspace program parse it into C-code. Have your gcc compiler
> compile it into a kernel module. Load it. This will optimize the firewall
> ALOT. Still again, states can be saved between reloads just using some ids
> and hooks for the rules that needs a state.

This would optimize the simple rules, but the larger ones eat most of their 
time in code that is already optimized this way (eg: tracking code for 
state). More than a compiler we would need a rule optimizer.
How many people has performance issues with the current system? How much 
better are other production systems?

If you want to think big you could design the new rules like a logic language 
and let loose any sort of optimization on it.

> If you look at the work i began with ippool which was later finished in a

I'll try and read it in the next few days.

> the algorithm timeout. To just match a single IP source address, combine
> the data structure 'single' with data interpreter 'IPv4 address' and

This looks like generic programming and good design patterns. The difficult 
part is exposing this much flexibility to the user in a meaningful way.

> Many crazy ideas, so keep your asbestos suit on :-P

You too. :-P

-- 
Saluti,
   Massimiliano Hofer
        Nucleus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-15 12:14 ` Simon Lodal
@ 2006-08-15 22:57   ` Massimiliano Hofer
  2006-08-18 14:14     ` Simon Lodal
                       ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Massimiliano Hofer @ 2006-08-15 22:57 UTC (permalink / raw)
  To: netfilter-devel

On Tuesday 15 August 2006 2:14 pm, Simon Lodal wrote:

> Everybody has a long wishlist and seem to agree that something fundamental
> needs to be done.
>
> The question seems to be when backwards compatibility can be given up.

Everyone agrees that we have reached the maximum expressiveness with the 
current system.
Nobody says that we couldn't keep a way to convert old rules in the new 
system.
The real question thus becomes: is it worh to restart from (almost) scratch?

> > What people need from any new infrastructure:
> > - cleaner interface with clearer separation between kernel and user data;
> > - ability to dump internal state of matches/targets (this may not be in a
> > 1-to-1 relation, so it may be tricky, do we need module state dumping?);
>
> Yes, but why should that be hard? Netfilter should already have a list of
> registered modules.

Yes, but iptables has no way to manipulate per-module data (eg: collection of 
names and flags for condition, but there are plenty other examples).
I don't think it would be difficult, even without a total redesign. I was 
testing the ground for ideas and real needs.

> We are going to have "interesting" data that are not 1:1 with rules. But
> then they will be 1:1 with modules, or some other "scope" that netfilter

Make it n:1. I don't think n:n is desirable.

> knows how to traverse. Each "scope" can have their own section in the
> iptables-save output. Hence the parsing complexity lies in
> iptables-restore.
>
> Whether it is all going to be exposed in some filesystem or not is a
> different matter.

I like file interfaces, but not everything readily becomes a file. It all 
depends on what people really want to do with this class of data.

> What is the version after it going to be then? No, I never liked the -ng
> suffix :)
>
> What is wrong with iptables2?

OK. We had ipfwadm and ipchains. So we're really more like iptables4. :)

> Flexibility is not free, but perhaps it can be cheap, performance wise.
>
> Let's say we make iptables more shell-like, with the ability to handle
> multiple commands in one invocation (with a final COMMIT command required)?
> Would be lovely in itself.
>
> Then iptables would get a better chance to optimize memory allocation,
> since it is not only looking at one rule at a time.
>
> The case where you load the entire firewall ruleset in one go could be
> optimized to a point where it is no different from today.

This if we assume we know the sizes of everything. I think matches/targets 
need to have a chance to influence their own data (now they can't).
We'll have:
- general data structures (fixed);
- match/target descriptor (passed by userspace and of known size);
- match/target runtime data (potentially anything from a single byte to a 
dynamic structure).

Currently matches/targets are fed the descriptor. I'd like them to be fed a 
descriptor and their runtime data. We can suppose the latter won't be needed 
by every match, so it won't impact performance.
We still got a fixed size data structure that we can move/compact/rewrite and 
a descriptor that we can potentially move (we could move it if people weren't 
abusing it for lack of runtime data) but with variable sized.
The first one can become a simple allocation in list node array (with some 
mechanism for growing and shrinking). The descriptors are a little more 
tricky and we would need stricter specifications in order to do proper 
repacking.
Before we continue work on a non-problem: do we have data about kernel memory 
fragmentation and performance issues?

>  * ipt_entry* structs might contain data (like basic src/dst/port/iface
> matches), but they may not keep pointers to anything, not even their own
> fields. They are independent of their own memory location. The memory
> management code can therefore rearrange the tables at will (proper locking
> assumed), without having to reinitialize rules.

Good. I just don't know if this is overdesigned.

>  * All other memory is accessed through a struct that is passed to each
> rule/match/target's API functions. It contains at least .instance_data, but
> also .module_data (.priv_data), and perhaps other scopes data,
> like .rule_data, .chain_data and .global_data (all cross-module). Note that
> each of these are bound to a specific entity.

I agree.

>  * Each module and instance must call special netfilter API's to allocate
> memory of the required types. The netfilter part handles free'ing through
> refcount (why not).

If we don't have cross-module data (does anyone need it?) each module could do 
it's housekeeping. It's difficult to know how to optimize other people's 
data.

>  * The actual .*_data pointers may change between invocations (packets fed
> to) of the same rule/match/target. This means the netfilter part is allowed
> to rearrange dynamic memory too.

What if people want to keep pointers and other complex data structures? The 
instance data should be opaque to the core code. The risk is that people, not 
trusting this structure, will use it just to keep a pointer to the real data.

>  * Bonus: Sync of memory regions with other hosts can be handled
> transparently, or at least easily. So that fx. limit rules can work across
> redundant hosts.

Malus: a whole memory management system just for a subsystem of the kernel. 
Too much semantics risks to limit what people want to do. Of course anarchy 
has drawbacks too. I'd seek a middle ground where we handle the common case 
and leave people free to implement exotic new things.

> I have no clear idea how all these individual blobs would be communicated
> between kernel and userspace. Except there are two general options:
>
> 1) The current "pass a large blob" scheme. Since it will contain many
> smaller blobs, some in-kernel parsing is required. Worse yet, the kernel
> must also be able to assemble a large blob in order to dump to userspace.

Either way we'll need some form of rule and match id.
I don't know what level of transactionality is desired. Currently 
iptables-restore is atomic and so are single changes with iptables. How much 
is needed with the new system? At least rule level atomicity is certainly 
desired, so we'll need to create duplicate data (just the core structure with 
pointers to the real descriptors) during modifications.

> > I think every match/target should expode:
> > - init;
> > - destroy;
> > - change;
> > - dump;
> > - restore.
>
> change() would be nice, like in qdisc.

Does it really make sense? How many matches would have a different behaviour 
while changing instead of a full create-activate_new-destroy?

> Applause for possibly opening the can of worms :)

:)

-- 
Saluti,
   Massimiliano Hofer
        Nucleus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-14 21:12 new ABI Massimiliano Hofer
  2006-08-15  0:00 ` Joakim Axelsson
  2006-08-15 12:14 ` Simon Lodal
@ 2006-08-16 12:16 ` Joakim Axelsson
  2006-08-16 12:29   ` Joakim Axelsson
                     ` (4 more replies)
  2 siblings, 5 replies; 31+ messages in thread
From: Joakim Axelsson @ 2006-08-16 12:16 UTC (permalink / raw)
  To: Massimiliano Hofer; +Cc: netfilter-devel

2006-08-14 23:12:41+0200, Massimiliano Hofer <max@nucleus.it> ->
> I think the current array oriented data structures won't allow us to add these 
> features. RCU lists come to mind. It sure is a step back in performance 
> (sparser access to memory and more memory fragmentation), but it may not be 
> that noticeable.
> I think every match/target should expode:
> - init;
> - destroy;
> - change;
> - dump;
> - restore.
> 
   match()/target()

I had some more realistic ideas that i'd like to share.

We keep the idea of rules (rather than essay, compiled or other form). To
make the new implementation as easy and as fast as possible i think using
XML to express the firewall is a good way to go. Now before you turn on all
your negatives, here's why:

- We want the old (todays) iptables to be compliant.
- We want an easy library that implements the ABI to kernel.
- We want to make the smallest possible effort writing a userspace tool.
  Leaving the more advacned for others / other projects.

XML already has good parsers. We can very easily rewrite todays iptables to
output XML. XML has no real limits on how to express things. Ofcourse can
other future userspace program use the new ABI-library directly, not using
the XML-parser. However, for those who doesn't want to learn the new library
XML is very easy. Both for humans, scripts and programs.

So we need one userspace library that talks the ABI with kernel. We also
need a library using this ABI that parses XML-files and passes them to the
ABI-library. Finally other userspace tools (that we do not need to write)
that dumps from kernel passed by the kernel->ABI->XML and pushes XML rules.

You can now easily write pre-parser later that can take firewall config on
"programing" form: Example:

if (packet.ipv4.source = 1.2.3.4 AND limit(2/s)) then 
	LOG(log-prefix, ...)
	jump(other-chain)
	DROP
end if;

The above might be much easier for a newbie to use. The best thing is that
we are pushing the "need" to write these kind of tools to others as seperate
projects. The common is XML. No need for complex parsers and trix using
getopt().

iptables is now easily ported like:
iptables <-> XML <-> ABI <-> kernel

Also keep in mind that we can allow several targets. I previously in another
mail talked about actions. Its not needed i think, but might make it easier
to distringuisch between jumps, ending targets and just changing and logging
targets. It should be perfect legal already today to say:

iptables -m match -j other_chain -j LOG -j other_chain2 -j DROP -j TTL

Now, the last TTL will never be executed, but thats a user config choise.
The sematics of the above can't be missunderstood.

For kernel-space:
-------------------
I think the above is good. Perhaps we don't need restore as it can be done
with dump. Only that dump must dump both initial state/config and current
state.
- init()
- destroy()
- dump() / restore()
- change()
- match() / target() 

Much importat is the change() that for exampel recent match can use to
remove or add IPs in any recent list. There are several other matches that
can use this. Quota for example. Add or remove bytes in the pot. Far more
complex matches like ipset can use this as well.

Also important is that easy instancse of a match/target in kernel needs its
own memory-space. Either allocated by iptables code or just a pointer-hook
where the module can allocate and hook in self. Just like priv_data, but
actually saved.

I also think RCU-list will help instead of tables/arrays. Its much more
common to add/change or remove a rule than add them all. I have a router
with some 1000 rules. Its a pain to change on of them. To save cache misses
we can preallocate memory and use. Meaning use slabs for each list element.
We can't control the code of the modules, but again, we can't today either.

Another thing that could use an add is the way of grouping rules. "These set
of rules belong to customer 1 and these to customer 2. And i'd like to only
list all rules related to customer 1". Now there is two aproches to
implement this. First one is to tag the rule with what rule or which group
it belongs to. Another way it to create sub tables. 

Also a good way of "finding your way" to either group of rules is needed.
Today i have a router with 4096 IPs (students computers) behind. The IPs all
need its own chain of rules. I won't go into why, but trust me, it's needed
and the only flexible way that i have found. I have created a sort of binary
tree of rules trying to make the access of each customer as painless as
possible. This area needs work as well i think. I've seen many people asking
if this exists. Simple solution might be to allow custom(?) modules that
implemet different forms of jumping. To which group or chain do you want to
jump to? Well: iptables -m match -J ipmapjump (notice the big -J) 
The recent "goto" jump also fits here.

Next, try to design this new iptables2/-ng so we don't need iptables3 in the
future. Rather add one too many unused hooks, void * passed parameter than
one too few.

Design this so we can have pkttables. Meaning no need for seperate tables
for iptables, ip6tables, arptables, ebtables. All in one. Its not that hard
really. Just perhaps a few more tables (nat, mangle, raw, filter, bridge
etc.) and move even the basic matching like source and dest address into
modules.

Summary:
+ Use XML to express firewall rules. Because its easy and backward
compability will be easily ported. It fits both human written and scripted
rules. The tool is already there in tons of places.

+ init, destory, dump/restore, change, match/target is needed as implemented
functions (or Nulls) for the match / targets.

+ Use RCU-list in kernel. Because its more editable.

+ Have smart ways of allocate memory in kernel (slabs).

+ Allow sveral targets for one rule.

+ Perhaps seperate ending targets form non ending and jumps.

+ Allow customs jump modules, beside match and target modules.

+ Allow grouping of rules in some way. Really large firewall needs this.

+ We rather have one too many hooks/void *, unused rather than one too few
for futhure use. It won't waste that much memory.

+ Design all this into pkttables rather than focus on IP/IPv6.

Thanks for your time reading this far :-)

--
Joakim Axelsson

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-16 12:16 ` Joakim Axelsson
@ 2006-08-16 12:29   ` Joakim Axelsson
  2006-08-16 14:40   ` Joakim Axelsson
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Joakim Axelsson @ 2006-08-16 12:29 UTC (permalink / raw)
  To: Massimiliano Hofer, netfilter-devel

2006-08-16 14:16:53+0200, Joakim Axelsson <gozem@gozem.se> ->
> We keep the idea of rules (rather than essay, compiled or other form). To
> make the new implementation as easy and as fast as possible i think using
> XML to express the firewall is a good way to go. Now before you turn on all
> your negatives, here's why:
> 
> - We want the old (todays) iptables to be compliant.
> - We want an easy library that implements the ABI to kernel.
> - We want to make the smallest possible effort writing a userspace tool.
>   Leaving the more advacned for others / other projects.
> 
> XML already has good parsers. We can very easily rewrite todays iptables to
> output XML. XML has no real limits on how to express things. Ofcourse can
> other future userspace program use the new ABI-library directly, not using
> the XML-parser. However, for those who doesn't want to learn the new library
> XML is very easy. Both for humans, scripts and programs.
> 
> So we need one userspace library that talks the ABI with kernel. We also
> need a library using this ABI that parses XML-files and passes them to the
> ABI-library. Finally other userspace tools (that we do not need to write)
> that dumps from kernel passed by the kernel->ABI->XML and pushes XML rules.
> 
> iptables is now easily ported like:
> iptables <-> XML <-> ABI <-> kernel
> 

Too already demonstrate the greatness of XML; I forgot to add in my previous
mail the need/wish for being able to list all state data of one module. For
example i have a module just making counters, represented in /proc. I can
easy list them all by a simple "grep *" (gives both filename and content).
With XML this extra isn't needed. As long as i can get the full kernel
config and state in XML, i can apply my favorite XML-parser and figure out
the data i need. :-)

--
Joakim Axelsson

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-16 12:16 ` Joakim Axelsson
  2006-08-16 12:29   ` Joakim Axelsson
@ 2006-08-16 14:40   ` Joakim Axelsson
  2006-08-18 13:06   ` Simon Lodal
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Joakim Axelsson @ 2006-08-16 14:40 UTC (permalink / raw)
  To: Massimiliano Hofer, netfilter-devel

And some more. About logging, debug and hit counters.

I think we can remove the hits counters. Allow a seperate module to count if
needed. I'm sure most firewalls do not need the counters on every rule. It's
just an expensive waste of locking in the kernel.

Also, people will want to log here and there. Both for the pure logging
purpose but also for the debugging purpose, "does my firewall work?". I
think it would be easier to allow each rule to have three flags for
debugging purpose. Log on entering the rule, Log on matching the rule and
Log on leaving the rule (after targets). This makes it very easy to trace
your firewall config. This config should of couse be able to change without
having to remove and readd the rule without debugging later.

For general logging (and debugging) we should remove -j LOG. The parseing of
the packet layout is something for userspace. Debugging only has a
"reserved" netlink channel. There is one set back doing this. If the machine
gets DoS-Attacked. All the logging will be more or less disabled as the
kernel uses all of the available CPU and you get nothing to try to figure
out the attack-vector (for counter firewall rules).

--
Joakim Axelsson

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-16 12:16 ` Joakim Axelsson
  2006-08-16 12:29   ` Joakim Axelsson
  2006-08-16 14:40   ` Joakim Axelsson
@ 2006-08-18 13:06   ` Simon Lodal
  2006-08-18 21:40     ` Massimiliano Hofer
  2006-08-18 22:24   ` Massimiliano Hofer
  2006-08-22  8:46   ` Jozsef Kadlecsik
  4 siblings, 1 reply; 31+ messages in thread
From: Simon Lodal @ 2006-08-18 13:06 UTC (permalink / raw)
  To: gozem; +Cc: max, netfilter-devel


> Also keep in mind that we can allow several targets. I previously in
> another mail talked about actions. Its not needed i think, but might
> make it easier to distringuisch between jumps, ending targets and just
> changing and logging targets. It should be perfect legal already today
> to say:
>
> iptables -m match -j other_chain -j LOG -j other_chain2 -j DROP -j TTL

LOG + DROP in one rule would be a huge improvement. Even though it would
just reintroduce an ipchains feature.

> Now, the last TTL will never be executed, but thats a user config
> choise. The sematics of the above can't be missunderstood.
>
>
> For kernel-space:
> -------------------
> I think the above is good. Perhaps we don't need restore as it can be
> done with dump. Only that dump must dump both initial state/config and
> current state.
> - init()
> - destroy()
> - dump() / restore()
> - change()
> - match() / target()
>
> Much importat is the change() that for exampel recent match can use to
> remove or add IPs in any recent list. There are several other matches
> that can use this. Quota for example. Add or remove bytes in the pot.
> Far more complex matches like ipset can use this as well.

That could provide the basis for some of the dynamic userspace features
that people are often pointed to on this list, even though they do not yet
exist.

> Also a good way of "finding your way" to either group of rules is
> needed. Today i have a router with 4096 IPs (students computers)
> behind. The IPs all need its own chain of rules. I won't go into why,
> but trust me, it's needed and the only flexible way that i have found.
> I have created a sort of binary tree of rules trying to make the access
> of each customer as painless as possible. This area needs work as well
> i think. I've seen many people asking if this exists. Simple solution
> might be to allow custom(?) modules that implemet different forms of
> jumping. To which group or chain do you want to jump to? Well: iptables
> -m match -J ipmapjump (notice the big -J)  The recent "goto" jump also
> fits here.

I would like to do it in a generic way: Introduce a "match index" variable
that can be set by matches and used by targets. A "--dports 1000:1023"
match has 24 possible matches, so it would set the index to between 0 and
23. Same can be done for IP, sets; all other matches that have a finite
set of possible matches and can enumerate them.
Now, it should be relatively simple to create a generic jump target that
uses the match index to jump to a specific subchain (donøt know exactly
how the list of subchains wouyld be defined, should not be that
difficult). Other interesting targets might be NAT that could NAT from/to
a base address plus match index.

> Summary:
> + Use XML to express firewall rules. Because its easy and backward
> compability will be easily ported. It fits both human written and
> scripted rules. The tool is already there in tons of places.
>
> + init, destory, dump/restore, change, match/target is needed as
> implemented functions (or Nulls) for the match / targets.
>
> + Use RCU-list in kernel. Because its more editable.
>
> + Have smart ways of allocate memory in kernel (slabs).
>
> + Allow sveral targets for one rule.
>
> + Perhaps seperate ending targets form non ending and jumps.
>
> + Allow customs jump modules, beside match and target modules.
>
> + Allow grouping of rules in some way. Really large firewall needs
> this.
>
> + We rather have one too many hooks/void *, unused rather than one too
> few for futhure use. It won't waste that much memory.
>
> + Design all this into pkttables rather than focus on IP/IPv6.


I agree with all your points, perhaps except the XML part ... I am one of
those non-converts. But you may be right anyway. It would be nice to have
an standard way to define a ruleset, as descriptive data rather than
commands.

Regards,
Simon

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-15 22:57   ` Massimiliano Hofer
@ 2006-08-18 14:14     ` Simon Lodal
  2006-08-18 21:40       ` Massimiliano Hofer
  2006-08-18 14:50     ` Amin Azez
  2006-08-23 18:06     ` Sven Anders
  2 siblings, 1 reply; 31+ messages in thread
From: Simon Lodal @ 2006-08-18 14:14 UTC (permalink / raw)
  To: max; +Cc: netfilter-devel

>> Everybody has a long wishlist and seem to agree that something
>> fundamental needs to be done.
>>
>> The question seems to be when backwards compatibility can be given up.
>
> Everyone agrees that we have reached the maximum expressiveness with
> the  current system.

You mean we have created the ideal system?!

Or that we have created a mess that is no longer extendable?

> Nobody says that we couldn't keep a way to convert old rules in the new
>  system.
> The real question thus becomes: is it worh to restart from (almost)
> scratch?

Sometimes you can have something entirely different in mind and still make
incremental changes.
The iptables syntax/interface as seen by user is far from stellar but
perhaps good enough for it's purposes. I do not see any urgent need to
change it's syntax.
But the API's suck. Good ideas get nowhere because the API's can not
support it. Is that really controversial? My point is they need to change,
and it will be incomatible, too bad, but is has to done some day.

>> > What people need from any new infrastructure:
>> > - cleaner interface with clearer separation between kernel and user
>> > data; - ability to dump internal state of matches/targets (this may
>> > not be in a 1-to-1 relation, so it may be tricky, do we need module
>> > state dumping?);
>>
>> Yes, but why should that be hard? Netfilter should already have a list
>> of registered modules.
>
> Yes, but iptables has no way to manipulate per-module data (eg:
> collection of  names and flags for condition, but there are plenty
> other examples). I don't think it would be difficult, even without a
> total redesign. I was  testing the ground for ideas and real needs.

I agree.

>> We are going to have "interesting" data that are not 1:1 with rules.
>> But then they will be 1:1 with modules, or some other "scope" that
>> netfilter
>
> Make it n:1. I don't think n:n is desirable.

It is not n:n. It is just more 1:1's.


>> knows how to traverse. Each "scope" can have their own section in the
>> iptables-save output. Hence the parsing complexity lies in
>> iptables-restore.
>>
>> Whether it is all going to be exposed in some filesystem or not is a
>> different matter.
>
> I like file interfaces, but not everything readily becomes a file. It
> all  depends on what people really want to do with this class of data.
>
>> What is the version after it going to be then? No, I never liked the
>> -ng suffix :)
>>
>> What is wrong with iptables2?
>
> OK. We had ipfwadm and ipchains. So we're really more like iptables4.
> :)

That might do.


>> Flexibility is not free, but perhaps it can be cheap, performance
>> wise.
>>
>> Let's say we make iptables more shell-like, with the ability to handle
>> multiple commands in one invocation (with a final COMMIT command
>> required)? Would be lovely in itself.
>>
>> Then iptables would get a better chance to optimize memory allocation,
>> since it is not only looking at one rule at a time.
>>
>> The case where you load the entire firewall ruleset in one go could be
>> optimized to a point where it is no different from today.
>
> This if we assume we know the sizes of everything. I think
> matches/targets  need to have a chance to influence their own data (now
> they can't).
Correct, and it is a major annoyance.


> We'll have:
> - general data structures (fixed);
> - match/target descriptor (passed by userspace and of known size); -
> match/target runtime data (potentially anything from a single byte to a
>  dynamic structure).
>
> Currently matches/targets are fed the descriptor. I'd like them to be
> fed a  descriptor and their runtime data. We can suppose the latter
> won't be needed  by every match, so it won't impact performance.
> We still got a fixed size data structure that we can
> move/compact/rewrite and  a descriptor that we can potentially move (we
> could move it if people weren't  abusing it for lack of runtime data)
> but with variable sized.
> The first one can become a simple allocation in list node array (with
> some  mechanism for growing and shrinking). The descriptors are a
> little more  tricky and we would need stricter specifications in order
> to do proper  repacking.

Sounds reasonable.

> Before we continue work on a non-problem: do we have data about kernel
> memory  fragmentation and performance issues?

I would love to know that too!


>>  * ipt_entry* structs might contain data (like basic
>>  src/dst/port/iface
>> matches), but they may not keep pointers to anything, not even their
>> own fields. They are independent of their own memory location. The
>> memory management code can therefore rearrange the tables at will
>> (proper locking assumed), without having to reinitialize rules.
>
> Good. I just don't know if this is overdesigned.

Perhaps yes. It is irrelevant unless there really is a fregmentation issue.


>>  * All other memory is accessed through a struct that is passed to
>>  each
>> rule/match/target's API functions. It contains at least
>> .instance_data, but also .module_data (.priv_data), and perhaps other
>> scopes data,
>> like .rule_data, .chain_data and .global_data (all cross-module). Note
>> that each of these are bound to a specific entity.
>
> I agree.
>
>>  * Each module and instance must call special netfilter API's to
>>  allocate
>> memory of the required types. The netfilter part handles free'ing
>> through refcount (why not).
>
> If we don't have cross-module data (does anyone need it?) each module
> could do  it's housekeeping. It's difficult to know how to optimize
> other people's  data.

The idea is just to make it less error prone to write match/target
modules; the less free()'s you need to call the less memory leaks we get.
Since you can only have one .instance_data pointer, the old one should be
deallocated if you allocate another. Why not let netfilter do that. You
would just tell netfilter how much memory you need, and it will just
deliver that. And guarantee against accidental memory leaks in individual
modules.

>>  * The actual .*_data pointers may change between invocations (packets
>>  fed
>> to) of the same rule/match/target. This means the netfilter part is
>> allowed to rearrange dynamic memory too.
>
> What if people want to keep pointers and other complex data structures?
> The  instance data should be opaque to the core code. The risk is that
> people, not  trusting this structure, will use it just to keep a
> pointer to the real data.

If someone really wants to break the rules they can.

Here, the only rule is: No pointers. Use local offsets instead if you
really need to "point".

>>  * Bonus: Sync of memory regions with other hosts can be handled
>> transparently, or at least easily. So that fx. limit rules can work
>> across redundant hosts.
>
> Malus: a whole memory management system just for a subsystem of the
> kernel.  Too much semantics risks to limit what people want to do. Of
> course anarchy  has drawbacks too. I'd seek a middle ground where we
> handle the common case  and leave people free to implement exotic new
> things.

The most important goal is to pass a number of pointers (descriptors) to
the modules, each being shared in different ways (and having dynamic size,
preferably). It might work if any module could (re)allocate those shared
memory areas independently, but would be much simpler if netfilter
allocation wrappers handles synchronization between them.
We already have a memory management subsystem; we try to manage how things
are located. Unfortunately it is very dumb which is the reason for many
problems we have.
I am no fan of a new complex memory management subsystem in-kernel. It is
only a suggestion for a solution to the fragmentation issue, if it really
exists.

>> I have no clear idea how all these individual blobs would be
>> communicated between kernel and userspace. Except there are two
>> general options:
>>
>> 1) The current "pass a large blob" scheme. Since it will contain many
>> smaller blobs, some in-kernel parsing is required. Worse yet, the
>> kernel must also be able to assemble a large blob in order to dump to
>> userspace.
>
> Either way we'll need some form of rule and match id.
> I don't know what level of transactionality is desired. Currently
> iptables-restore is atomic and so are single changes with iptables. How
> much  is needed with the new system? At least rule level atomicity is
> certainly  desired, so we'll need to create duplicate data (just the
> core structure with  pointers to the real descriptors) during
> modifications.

I agree with that desire. But it totally rules out a filesystem
representation, I guess. Not that I really want an iptablesfs.
I guess some hierarchical locking would be necessary.



Regards,
Simon

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-15 22:57   ` Massimiliano Hofer
  2006-08-18 14:14     ` Simon Lodal
@ 2006-08-18 14:50     ` Amin Azez
  2006-08-23 18:06     ` Sven Anders
  2 siblings, 0 replies; 31+ messages in thread
From: Amin Azez @ 2006-08-18 14:50 UTC (permalink / raw)
  To: netfilter-devel

* Massimiliano Hofer wrote, On 15/08/06 23:57:
> If we don't have cross-module data (does anyone need it?) 

All my cross-module data is per connection and so can be kept in the
conntrack.

Sam

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-18 14:14     ` Simon Lodal
@ 2006-08-18 21:40       ` Massimiliano Hofer
  0 siblings, 0 replies; 31+ messages in thread
From: Massimiliano Hofer @ 2006-08-18 21:40 UTC (permalink / raw)
  To: Simon Lodal; +Cc: netfilter-devel

On Friday 18 August 2006 4:14 pm, Simon Lodal wrote:

> > Everyone agrees that we have reached the maximum expressiveness with
> > the  current system.
>
> You mean we have created the ideal system?!
>
> Or that we have created a mess that is no longer extendable?

Something in between. :)
You can do incremental improvements, but some people is asking for structural 
changes in order to achieve other goals.

I'm not dissatisfied with the current code. The whole purpose of this thread 
is to understand if there is something worth an extensive change of the API 
(between core netfilter and its modules), possibly maintaining the ABI.

I was just seeking new ideas for the sake of it. Then we'll see what's worth, 
what can be done with what we have today, what is just too troublesome and 
what is just too idealistic to achieve in the near future (but potentially 
interesting).

The hard part will be when someone will have to do the real work. ;)

> > The real question thus becomes: is it worh to restart from (almost)
> > scratch?
>
> Sometimes you can have something entirely different in mind and still make
> incremental changes.
> The iptables syntax/interface as seen by user is far from stellar but
> perhaps good enough for it's purposes. I do not see any urgent need to
> change it's syntax.

Neither do I, but it's mostly a matter of taste.

> But the API's suck. Good ideas get nowhere because the API's can not
> support it. Is that really controversial? My point is they need to change,
> and it will be incomatible, too bad, but is has to done some day.

The question is: how much can we change the API without affecting the ABI?
Most things can be added incrementally, but I mostly started this thread 
because Patrick complained about the lack of a way to change individual rules 
or matches.
I think this will be the hardest feature yet proposed.

> The idea is just to make it less error prone to write match/target
> modules; the less free()'s you need to call the less memory leaks we get.
> Since you can only have one .instance_data pointer, the old one should be
> deallocated if you allocate another. Why not let netfilter do that. You
> would just tell netfilter how much memory you need, and it will just
> deliver that. And guarantee against accidental memory leaks in individual
> modules.

We'd need a 2 stage intialization. Something like:
- a simple init for matches that don't need .(priv|instance)_data or that 
declare a fixed size in the match registration;
- a size determinig call and the real init for dynamic ones.

This could be tricky for complex data and this could be rare enough to justify 
a fixed base structure and more complex data completely managed by the match 
module. People might end up like that anyway if what we do isn't enough 
(after all we could supply arbirtrarily sized structures, but only at init 
time). Maybe the simple solution would be enough.

> Here, the only rule is: No pointers. Use local offsets instead if you
> really need to "point".

No "local" pointers. Pointers to external data will work.

With these requirements we could keep the current copy and discard mechanism.
We could have a match array (mostly like the current one) and supplement it 
with a (priv|instance)_data array with size and offsets computed with a quick 
pass through the first one. With proper locking we could copy the necessary 
data with no memory fragmantation, and no lists.
Of course we have a major disadvantage: the current code can afford to build 
the new array without locking. The list approach can lock single nodes or the 
whole list for the time needed to change a single node. This last proposal 
would need to lock everything while it copies what could be thousands of 
rules.

The main questio remains this one: are we really scared by fragmentation?
I'll do some investigation, but I don't know if I'll have an answer.

> The most important goal is to pass a number of pointers (descriptors) to
> the modules, each being shared in different ways (and having dynamic size,
> preferably). It might work if any module could (re)allocate those shared
> memory areas independently, but would be much simpler if netfilter
> allocation wrappers handles synchronization between them.

You're describing priv_data. :)

> > Either way we'll need some form of rule and match id.
> > I don't know what level of transactionality is desired. Currently
> > iptables-restore is atomic and so are single changes with iptables. How
> > much  is needed with the new system? At least rule level atomicity is
> > certainly  desired, so we'll need to create duplicate data (just the
> > core structure with  pointers to the real descriptors) during
> > modifications.
>
> I agree with that desire. But it totally rules out a filesystem
> representation, I guess. Not that I really want an iptablesfs.
> I guess some hierarchical locking would be necessary.

I mentioned the file system approach just for the sake of it. I like the 
everything-is-a-file approach, but it certainly has its limits.

-- 
Saluti,
   Massimiliano Hofer
        Nucleus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-18 13:06   ` Simon Lodal
@ 2006-08-18 21:40     ` Massimiliano Hofer
  0 siblings, 0 replies; 31+ messages in thread
From: Massimiliano Hofer @ 2006-08-18 21:40 UTC (permalink / raw)
  To: netfilter-devel

On Friday 18 August 2006 3:06 pm, Simon Lodal wrote:

> > Also keep in mind that we can allow several targets. I previously in
> > another mail talked about actions. Its not needed i think, but might
> > make it easier to distringuisch between jumps, ending targets and just
> > changing and logging targets. It should be perfect legal already today
> > to say:
> >
> > iptables -m match -j other_chain -j LOG -j other_chain2 -j DROP -j TTL
>
> LOG + DROP in one rule would be a huge improvement. Even though it would
> just reintroduce an ipchains feature.

I like the idea of actions. I could perform separate type of mangling and 
other non "terminal" things with separate rules without worrying about 
precedence and specific combinations. The current use of "--continue" with 
some, but not all, targets really should be handles in a more general way and 
actions looks like a good solution to me.

> I would like to do it in a generic way: Introduce a "match index" variable
> that can be set by matches and used by targets. A "--dports 1000:1023"
> match has 24 possible matches, so it would set the index to between 0 and
> 23. Same can be done for IP, sets; all other matches that have a finite
> set of possible matches and can enumerate them.

What if we just assign a numeric index to every rule (plus an additional index 
for individual matches). This would let us identify rules for future changes, 
but we could go a step farther and let people choose a specific label if they 
want to.
This way we could jump to a separate chain or just to label two rules away.
If we combine this with my proposal for "functional chains" we could represent 
a whole lot of complex rulesets with far less rules than today.

> I agree with all your points, perhaps except the XML part ... I am one of
> those non-converts. But you may be right anyway. It would be nice to have
> an standard way to define a ruleset, as descriptive data rather than
> commands.

I'm a non-convert too, but perhaps it doesn't matter. The final userspace 
representation is irrelevant to the kernel and might be a matter of a few 
additional scripts.

-- 
Saluti,
   Massimiliano Hofer
        Nucleus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-16 12:16 ` Joakim Axelsson
                     ` (2 preceding siblings ...)
  2006-08-18 13:06   ` Simon Lodal
@ 2006-08-18 22:24   ` Massimiliano Hofer
  2006-08-22  8:46   ` Jozsef Kadlecsik
  4 siblings, 0 replies; 31+ messages in thread
From: Massimiliano Hofer @ 2006-08-18 22:24 UTC (permalink / raw)
  To: netfilter-devel

On Wednesday 16 August 2006 2:16 pm, Joakim Axelsson wrote:

> We keep the idea of rules (rather than essay, compiled or other form). To
> make the new implementation as easy and as fast as possible i think using
> XML to express the firewall is a good way to go. Now before you turn on all
> your negatives, here's why:
>
> - We want the old (todays) iptables to be compliant.
> - We want an easy library that implements the ABI to kernel.
> - We want to make the smallest possible effort writing a userspace tool.
>   Leaving the more advacned for others / other projects.

We want the existing ABI to continue working. Having people recompile iptables 
with a kernel change is not a short term proposition. I'd say it's kernel 3.0 
matter.
Using it between user utilities/scripts and a new version of iptables is 
certainly feasible, but it already is.

> XML already has good parsers. We can very easily rewrite todays iptables to
> output XML. XML has no real limits on how to express things. Ofcourse can
> other future userspace program use the new ABI-library directly, not using
> the XML-parser. However, for those who doesn't want to learn the new
> library XML is very easy. Both for humans, scripts and programs.

This is good for automatic manipulation. I'm not convinced it will be easier 
to digest for humans.

> So we need one userspace library that talks the ABI with kernel. We also
> need a library using this ABI that parses XML-files and passes them to the
> ABI-library. Finally other userspace tools (that we do not need to write)
> that dumps from kernel passed by the kernel->ABI->XML and pushes XML rules.

I'm all for a good library. XML might just be one of its representation 
plugins.

> if (packet.ipv4.source = 1.2.3.4 AND limit(2/s)) then
> 	LOG(log-prefix, ...)
> 	jump(other-chain)
> 	DROP
> end if;
>
> The above might be much easier for a newbie to use. The best thing is that
> we are pushing the "need" to write these kind of tools to others as
> seperate projects. The common is XML. No need for complex parsers and trix
> using getopt().

I'm not convinced that this will be much easier to parse, but I was expecting 
more a tag hell. Maybe you're right. Count me as undecided yet.

> iptables is now easily ported like:
> iptables <-> XML <-> ABI <-> kernel

The old iptables needs to continue working. We can't introduce layers between 
it and the current kernel ABI.

> Also keep in mind that we can allow several targets. I previously in
> another mail talked about actions. Its not needed i think, but might make
> it easier to distringuisch between jumps, ending targets and just changing
> and logging targets. It should be perfect legal already today to say:
>
> iptables -m match -j other_chain -j LOG -j other_chain2 -j DROP -j TTL
>
> Now, the last TTL will never be executed, but thats a user config choise.
> The sematics of the above can't be missunderstood.

I like this idea. Maybe I would use a different parameter for actions ("-a"?), 
but it's interesting.

> I think the above is good. Perhaps we don't need restore as it can be done
> with dump. Only that dump must dump both initial state/config and current
> state.
> - init()
> - destroy()
> - dump() / restore()
> - change()
> - match() / target()
>
> Much importat is the change() that for exampel recent match can use to
> remove or add IPs in any recent list. There are several other matches that
> can use this. Quota for example. Add or remove bytes in the pot. Far more
> complex matches like ipset can use this as well.

How would you represent single match changes with your XML implementation?
It sure would be expressive enought for any kind of module data.

> I also think RCU-list will help instead of tables/arrays. Its much more
> common to add/change or remove a rule than add them all. I have a router
> with some 1000 rules. Its a pain to change on of them. To save cache misses

I usually wipe everything and write from scratch. All my firewalls are 
generated (at least partially) and I don't want to risk making mismatched 
changes in the current firewall and the generating script.
I'd need several much more powerful primitives to abandon the approach.

Anyway I agree to the usefulness of RCU lists.

> Another thing that could use an add is the way of grouping rules. "These
> set of rules belong to customer 1 and these to customer 2. And i'd like to
> only list all rules related to customer 1". Now there is two aproches to
> implement this. First one is to tag the rule with what rule or which group
> it belongs to. Another way it to create sub tables.
>
> Also a good way of "finding your way" to either group of rules is needed.
> Today i have a router with 4096 IPs (students computers) behind. The IPs
> all need its own chain of rules. I won't go into why, but trust me, it's
> needed and the only flexible way that i have found. I have created a sort
> of binary tree of rules trying to make the access of each customer as
> painless as possible. This area needs work as well i think. I've seen many
> people asking if this exists. Simple solution might be to allow custom(?)
> modules that implemet different forms of jumping. To which group or chain
> do you want to jump to? Well: iptables -m match -J ipmapjump (notice the
> big -J)
> The recent "goto" jump also fits here.

You want some sort of multijump?
I had this kind of problem too. ipmapjump seems too specific. We'd really need 
a rule compiler/optimizer to handle this in an efficient way.

In your language it would be something like:

switch(packet.ipv4.source && 0xFF)
   case 1: ...
   ...
end switch;

This would be several orders of magnitude more complex than the current system 
(although magnificient).

> Next, try to design this new iptables2/-ng so we don't need iptables3 in
> the future. Rather add one too many unused hooks, void * passed parameter
> than one too few.

Of course this is an illusion. Nothing short of Turing-completeness will 
prevent iptables3. Even so someone will ask for an OO-iptables, a 
generic-iptables, etc.
We can try to keep people satisfied for the next few years and plan for the 
unplanned, but there is always something that is really unplannable and 
unforeseeable.

> Design this so we can have pkttables. Meaning no need for seperate tables
> for iptables, ip6tables, arptables, ebtables. All in one. Its not that hard
> really. Just perhaps a few more tables (nat, mangle, raw, filter, bridge
> etc.) and move even the basic matching like source and dest address into
> modules.

OK.

> Summary:
> + Use XML to express firewall rules. Because its easy and backward
> compability will be easily ported. It fits both human written and scripted
> rules. The tool is already there in tons of places.

Not convinced yet, but keep insisting. :)
Of course we'd need someone to do it. :)

> + init, destory, dump/restore, change, match/target is needed as
> implemented functions (or Nulls) for the match / targets.

OK.

> + Use RCU-list in kernel. Because its more editable.

Mostly OK, for lack of better alternatives.

> + Have smart ways of allocate memory in kernel (slabs).

OK.

> + Allow sveral targets for one rule.
> + Perhaps seperate ending targets form non ending and jumps.

OK.
I'd separate flow-changing targets from packet-altering (or some other lateral 
effect) actions.
Actions would be like current targets with a "--continue" parameter and we 
could combine it with a target to make them stop.

> + Allow customs jump modules, beside match and target modules.

What exacly are you proposing?

> + Allow grouping of rules in some way. Really large firewall needs this.

Proposal?
What about function-chains that we could use as a match?

> + We rather have one too many hooks/void *, unused rather than one too few
> for futhure use. It won't waste that much memory.

Of course.

> + Design all this into pkttables rather than focus on IP/IPv6.

OK.

> Thanks for your time reading this far :-)

:)

-- 
Saluti,
   Massimiliano Hofer
        Nucleus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-16 12:16 ` Joakim Axelsson
                     ` (3 preceding siblings ...)
  2006-08-18 22:24   ` Massimiliano Hofer
@ 2006-08-22  8:46   ` Jozsef Kadlecsik
  2006-08-23  5:01     ` Patrick McHardy
  2006-08-23 21:13     ` Massimiliano Hofer
  4 siblings, 2 replies; 31+ messages in thread
From: Jozsef Kadlecsik @ 2006-08-22  8:46 UTC (permalink / raw)
  To: Joakim Axelsson; +Cc: Massimiliano Hofer, netfilter-devel

Hi,

[The whole thread is really inspiring, but as a late-comer I just jump in
into it the middle.]

On Wed, 16 Aug 2006, Joakim Axelsson wrote:

> Summary:
> + Use XML to express firewall rules. Because its easy and backward
> compability will be easily ported. It fits both human written and scripted
> rules. The tool is already there in tons of places.

I think this is the surface for the users i.e how to express the rules in
the userspace. I strongly believe Harald has got absolutely right that
what we need is a plugin architecture and library which makes possible to
write interfaces using a commandline style or XML, or to write a
graphical interface or whatever. (Or do you propose to implement an XML
parser in kernel space? ;-)

What at least equally important is the kernel-userspace communication. In
what form should the data be passed? Using what kind of kernel-userspace
communication method?

For the former I believe the (extended) attribute subsytem stolen from
nfnetlink is probably the most flexible and still the simplest solution.
Of course it must be morphed and extended a little bit, for example by
adding "bunch/array of foo type" types to pass huge number of same type of
data more efficiently, or to gracefully handle unknown attribute types
(intended for another match/target or new unknown attribute for a given
match/target). Let's forget about passing binary blobs with magical
pointers/offsets back and forth!

My views on how to communicate is a little bit old-fashioned as I'm
perfectly satisfied with sockopt. Yes, dusty. And old. Maybe even not cool
but a hack. But reliable, which cannot absolutely be said about netlink.
I shudder to think about the situation when a DDoS attack could prevent or
just delay me to reconfigure a firewall. And please note even if we accept
delaying it assumes that we are willing to face reimplementing a TCP-alike
protocol over netlink to guarantee reliable message passing.

I cannot deny, a filesystem-like channel is tempting, but I think it's
simply not flexible enough.

> + init, destory, dump/restore, change, match/target is needed as implemented
> functions (or Nulls) for the match / targets.

For the kernel space, yes. In userspace we need a little bit more: option
(argument) list with allowed types and hooks for the parsers
(cmdline/XML/whatever), even probably embedded help.

For fun I propose to name the new commandline utility simply 'nf' :-).

> + Use RCU-list in kernel. Because its more editable.

I won't give up the hope to integrate nf-hipac once. In other words, the
RCU-list must be one possible way to list the rules and other methods
must be supported as well. Let us not fall in the trap to hardcode this
part of the system.

Think about expert systems built on top of iptables/nf by which one can
write policies instead of the exact rules and which translates and expands
the policies into the appropriate rules. Such systems could even choose
the most appropriate in-kernel rule listing for the given created chain or
sub-table.

> + Have smart ways of allocate memory in kernel (slabs).
>
> + Allow sveral targets for one rule.

Yes. Your example of multiple non-terminal, jump and terminal targets in
one rule proves that the semantic is still clear.

> + Perhaps seperate ending targets form non ending and jumps.

I think that it is required because thus the userspace tool can warn about
targets mistakenly listed after a terminal target.

> + Allow customs jump modules, beside match and target modules.

That's cool but sanity (and rule-integrity) must of course be preserved.

> + Allow grouping of rules in some way. Really large firewall needs this.

Sub-tables can probably help. It seems to me non-trivial how not to loose
efficiency, though. Multiple toplevel tables (i.e active/passive,
active/backup) should also be considered.

> + We rather have one too many hooks/void *, unused rather than one too few
> for futhure use. It won't waste that much memory.
>
> + Design all this into pkttables rather than focus on IP/IPv6.

Yes. And I'd add one more priciple:

+ All parts must be designed to take into account rule- and
  (match/target) state-replication between firewalls in active-active
  setups.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-22  8:46   ` Jozsef Kadlecsik
@ 2006-08-23  5:01     ` Patrick McHardy
  2006-08-23 13:48       ` Joakim Axelsson
  2006-08-24  8:50       ` Jozsef Kadlecsik
  2006-08-23 21:13     ` Massimiliano Hofer
  1 sibling, 2 replies; 31+ messages in thread
From: Patrick McHardy @ 2006-08-23  5:01 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Massimiliano Hofer, netfilter-devel

Jozsef Kadlecsik wrote:
> My views on how to communicate is a little bit old-fashioned as I'm
> perfectly satisfied with sockopt. Yes, dusty. And old. Maybe even not cool
> but a hack. But reliable, which cannot absolutely be said about netlink.
> I shudder to think about the situation when a DDoS attack could prevent or
> just delay me to reconfigure a firewall. And please note even if we accept
> delaying it assumes that we are willing to face reimplementing a TCP-alike
> protocol over netlink to guarantee reliable message passing.

I don't think a DoS attack could prevent you from using netlink any
more than it could from setsockopt (netlink is even more efficient due
to less data copied around). Reliability is not hard to achieve on the
userspace->kernel path if you don't mind eating one RTT for each rule
update (which is not very large), just send the update and wait for
an ACK or an error, then handle it. Having more than one message in
flight introduces another problem besides congestion control and
reliable transmission, the resulting ruleset in the kernel might
depend on the order in which messages are received, if messages are
dropped in the middle and are simply retransmitted it will be wrong.
This should probably not be done.

> Yes. And I'd add one more priciple:
> 
> + All parts must be designed to take into account rule- and
>   (match/target) state-replication between firewalls in active-active
>   setups.

Agreed, that would be nice to have (and not very hard).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-23  5:01     ` Patrick McHardy
@ 2006-08-23 13:48       ` Joakim Axelsson
  2006-08-24  9:20         ` Jozsef Kadlecsik
  2006-08-24  8:50       ` Jozsef Kadlecsik
  1 sibling, 1 reply; 31+ messages in thread
From: Joakim Axelsson @ 2006-08-23 13:48 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Massimiliano Hofer, netfilter-devel, Jozsef Kadlecsik

2006-08-23 07:01:50+0200, Patrick McHardy <kaber@trash.net> ->
> Jozsef Kadlecsik wrote:
> > My views on how to communicate is a little bit old-fashioned as I'm
> > perfectly satisfied with sockopt. Yes, dusty. And old. Maybe even not cool
> > but a hack. But reliable, which cannot absolutely be said about netlink.
> > I shudder to think about the situation when a DDoS attack could prevent or
> > just delay me to reconfigure a firewall. And please note even if we accept
> > delaying it assumes that we are willing to face reimplementing a TCP-alike
> > protocol over netlink to guarantee reliable message passing.
> 
> I don't think a DoS attack could prevent you from using netlink any
> more than it could from setsockopt (netlink is even more efficient due
> to less data copied around). Reliability is not hard to achieve on the
> userspace->kernel path if you don't mind eating one RTT for each rule
> update (which is not very large), just send the update and wait for
> an ACK or an error, then handle it. Having more than one message in
> flight introduces another problem besides congestion control and
> reliable transmission, the resulting ruleset in the kernel might
> depend on the order in which messages are received, if messages are
> dropped in the middle and are simply retransmitted it will be wrong.
> This should probably not be done.
> 

I have good experience with this. The solution i have found is:
1. Cut the cable (or disable the port in a switch) to get enough CPU over if
you can't get anything at all done in userspace.
2. You need tcpdump to figure out the attack-vector.
3. Then insert an iptables rule droping the attack.

In any case there are several more paths which are more critical and needs
userspace cpu than netlink.

Also, another problem with set/getsockopt() is that when you want to retrive
something and don't know the size of it you have a very hard time allocating
the memory to pass with getsockopt() for the kernel to write in. Think
moduels like recent where the state data can be of very different length
from time to time. Even having things what ask first how much memory is
needed and then provides it as the state can change between the two calls.
Its easier with netlink to just allow it to send packets until a packet
arrives that says its the last one.

> > Yes. And I'd add one more priciple:
> > 
> > + All parts must be designed to take into account rule- and
> >   (match/target) state-replication between firewalls in active-active
> >   setups.
> 
> Agreed, that would be nice to have (and not very hard).

+ Agreed here as well :-)

--
Joakim Axelsson

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-15 22:57   ` Massimiliano Hofer
  2006-08-18 14:14     ` Simon Lodal
  2006-08-18 14:50     ` Amin Azez
@ 2006-08-23 18:06     ` Sven Anders
  2006-08-23 21:19       ` Massimiliano Hofer
  2 siblings, 1 reply; 31+ messages in thread
From: Sven Anders @ 2006-08-23 18:06 UTC (permalink / raw)
  To: Massimiliano Hofer, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 2834 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Massimiliano Hofer schrieb:
> On Tuesday 15 August 2006 2:14 pm, Simon Lodal wrote:
> 
>> Everybody has a long wishlist and seem to agree that something fundamental
>> needs to be done.
>>
>> The question seems to be when backwards compatibility can be given up.
> 
> Everyone agrees that we have reached the maximum expressiveness with the 
> current system.
> Nobody says that we couldn't keep a way to convert old rules in the new 
> system.
> The real question thus becomes: is it worh to restart from (almost) scratch?

In my personal opinion it's time for a new API.
During the implementation of my program, I run into many problems which could
only be solved clearly by a new API. It would make the implementation of other
user-space programs (beside iptables) much easier.

> Either way we'll need some form of rule and match id.
> I don't know what level of transactionality is desired. Currently 
> iptables-restore is atomic and so are single changes with iptables. How much 
> is needed with the new system? At least rule level atomicity is certainly 
> desired, so we'll need to create duplicate data (just the core structure with 
> pointers to the real descriptors) during modifications.

I would love to have unique rule ids! 8-)

If you implement a new API, you could support the following too:

 - boolean logic between matches
   Example:
     rule 2 { src-ip 1.2.3.4/24 and protocol TCP and
              ( port 21 or port 23 or port 25 ) } accept }

 - multiple targets
   Example:
     rule 3 { protocol TCP and port 22 ulog { prefix "SSH Access" } accept }

   I think this could be done with little changes on the current netfilter core
   too, but it would be better to do it in a new framework. You only have to
   distinguish between VERIDICT and NON-VERDICT targets.

  - Get the counters of [single] rules (and reset them) without completely
    setting the whole firewall once again.

  - A NOT for all matches

This would also make some matches obsolete (multiport for instance).

Regards
 Sven

- --
 Sven Anders <anders@anduras.de>                 () Ascii Ribbon Campaign
                                                 /\ Support plain text e-mail
 ANDURAS service solutions AG
 Innstraße 71 - 94036 Passau - Germany
 Web: www.anduras.de - Tel: +49 (0)851-4 90 50-0 - Fax: +49 (0)851-4 90 50-55

Rechtsform: Aktiengesellschaft - Sitz: Passau - Amtsgericht Passau HRB 6032
Mitglieder des Vorstands: Sven Anders, Marcus Junker, Michael Schön
Vorsitzender des Aufsichtsrats: Dipl. Kfm. Thomas Träger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE7Jkf5lKZ7Feg4EcRAiAbAKCQZe9QqcOPsDqA5QUWXaag15DGawCfbk72
rLC2Ayk9H9w66juw3HQrf2A=
=5A4W
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-22  8:46   ` Jozsef Kadlecsik
  2006-08-23  5:01     ` Patrick McHardy
@ 2006-08-23 21:13     ` Massimiliano Hofer
  2006-08-24 10:15       ` Jozsef Kadlecsik
  1 sibling, 1 reply; 31+ messages in thread
From: Massimiliano Hofer @ 2006-08-23 21:13 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Tuesday 22 August 2006 10:46 am, Jozsef Kadlecsik wrote:

> I think this is the surface for the users i.e how to express the rules in
> the userspace. I strongly believe Harald has got absolutely right that
> what we need is a plugin architecture and library which makes possible to
> write interfaces using a commandline style or XML, or to write a

XML could already be used. It would be easier to parse for other programs (a 
GUI viewer/generator), but little else.
He was proposing a richer language, too. I'm not sure a complete boolean 
expression would be easier to manage for a GUI, althought it would be easier 
to write for sysadmins.

> What at least equally important is the kernel-userspace communication. In
> what form should the data be passed? Using what kind of kernel-userspace
> communication method?
>
> For the former I believe the (extended) attribute subsytem stolen from
> nfnetlink is probably the most flexible and still the simplest solution.
> Of course it must be morphed and extended a little bit, for example by
> adding "bunch/array of foo type" types to pass huge number of same type of
> data more efficiently, or to gracefully handle unknown attribute types
> (intended for another match/target or new unknown attribute for a given
> match/target). Let's forget about passing binary blobs with magical
> pointers/offsets back and forth!

> > + init, destory, dump/restore, change, match/target is needed as
> > implemented functions (or Nulls) for the match / targets.
>
> For the kernel space, yes. In userspace we need a little bit more: option
> (argument) list with allowed types and hooks for the parsers
> (cmdline/XML/whatever), even probably embedded help.

I agree.

> > + Use RCU-list in kernel. Because its more editable.
>
> I won't give up the hope to integrate nf-hipac once. In other words, the
> RCU-list must be one possible way to list the rules and other methods
> must be supported as well. Let us not fall in the trap to hardcode this
> part of the system.

The least radical proposals call for a separation between kernel metadata 
(potentially fixed size) and match structures (variable, but known, size 
parameters and custom data).
The more radical (with respect to the necessary data structures) call for 
complete expression evaluation. This would require trees instead of lists or 
arrays.
Whatever we do, we certainly can keep the metadata distinct from the match 
data and make it inaccessible to modules (not to mention userspace).

For example we could have:
struct match_instance {
   struct match_id id;
   struct match *match_type;
   void *match_parameters;
   void *instance_data;
   ... /* counters, spinlocks for editing single matches, etc. */
};

The real structure could be kept in a separate tree. During modifications we 
could create a complete version of the new tree with pointers to 
match_instances and then swap it with the old one atomically.
We could use some simple management system in order to know when a 
match_instance is no longer needed.
For example we could embed it in a structure that keeps a reference counter or 
a list and adjust them when as needed. Match modules won't need to know what 
and how we do it and we may start with a linear data structure (similar to 
the one we use today) and switch to a tree without people noticing.

Te current way to express rules is similar to Horn clauses (logic programmers 
please forgive me, I know, we have negations, targets that don't terminate, 
lateral effects and we lack recursive rules).
Other people propose boolean expressions.

I think most people is more confortable with boolean expressions, althought 
it's more complex to parse and manage. How much do people desires it?

> > + Allow customs jump modules, beside match and target modules.
>
> That's cool but sanity (and rule-integrity) must of course be preserved.

We want to ensure termination (well I certainly want to avoid a loop in the 
rules). If we had boolean expressions and non-recursive expression chains 
(that we could use as pre-build sub-expressions in other rules) we could be 
extemely expressive without gotos.

> > + Allow grouping of rules in some way. Really large firewall needs this.
>
> Sub-tables can probably help. It seems to me non-trivial how not to loose
> efficiency, though. Multiple toplevel tables (i.e active/passive,
> active/backup) should also be considered.

Is this the same thing I'm calling expression chains? We could have an 
expression table with user defined chains that system chains can use.
I don't think it would be inefficient. Just jump to a chain and see what it 
returns. If we want to optimize it we could cache results so that people can 
use sub-chains several times (in different combinations) withou worrying 
about it.

> + All parts must be designed to take into account rule- and
>   (match/target) state-replication between firewalls in active-active
>   setups.

What level of runtime synchronization do you need? We have to allow some delay 
between 2 firewalls.
If we implement generic module data tables (with event notification to 
userspace), this wouldn't be a kernel problem anymore.

-- 
Saluti,
   Massimiliano Hofer
        Nucleus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-23 18:06     ` Sven Anders
@ 2006-08-23 21:19       ` Massimiliano Hofer
  2006-08-24  7:57         ` Sven Anders
  0 siblings, 1 reply; 31+ messages in thread
From: Massimiliano Hofer @ 2006-08-23 21:19 UTC (permalink / raw)
  To: netfilter-devel

On Wednesday 23 August 2006 8:06 pm, Sven Anders wrote:

> > The real question thus becomes: is it worh to restart from (almost)
> > scratch?
>
> In my personal opinion it's time for a new API.
> During the implementation of my program, I run into many problems which
> could only be solved clearly by a new API. It would make the implementation
> of other user-space programs (beside iptables) much easier.

Do you mean ABI?

> I would love to have unique rule ids! 8-)

Would a number be sufficient, or do you think a user supplied string would be 
much more useful? Of course the kernel will assign default ids to id-less 
rules.

>    I think this could be done with little changes on the current netfilter
> core too, but it would be better to do it in a new framework. You only have
> to distinguish between VERIDICT and NON-VERDICT targets.

The current data structures will be completely wiped away. This isn't a little 
change and will need a lot of testing.

>   - A NOT for all matches

If we implement boolean expressions a NOT won't be the least bit difficult.

-- 
Saluti,
   Massimiliano Hofer
        Nucleus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-23 21:19       ` Massimiliano Hofer
@ 2006-08-24  7:57         ` Sven Anders
  0 siblings, 0 replies; 31+ messages in thread
From: Sven Anders @ 2006-08-24  7:57 UTC (permalink / raw)
  To: Massimiliano Hofer, netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 2400 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Massimiliano Hofer schrieb:
> On Wednesday 23 August 2006 8:06 pm, Sven Anders wrote:
> 
>>> The real question thus becomes: is it worh to restart from (almost)
>>> scratch?
>> In my personal opinion it's time for a new API.
>> During the implementation of my program, I run into many problems which
>> could only be solved clearly by a new API. It would make the implementation
>> of other user-space programs (beside iptables) much easier.
> 
> Do you mean ABI?

Oops, yes :-)

>> I would love to have unique rule ids! 8-)
> 
> Would a number be sufficient, or do you think a user supplied string would be 
> much more useful? Of course the kernel will assign default ids to id-less 
> rules.

In my current application I use the 'comment' "match" for assigning unique ID to
my rules. These rule consist of a plain hex-number.
I think it would be sufficent, if it's a plain number, but a string may be more
useful for an end-user. If it's will be still possible to attach a comment to a
rule, I recommend an integer (easier to handle by the kernel / to compare and
uses less memory).

>>    I think this could be done with little changes on the current netfilter
>> core too, but it would be better to do it in a new framework. You only have
>> to distinguish between VERIDICT and NON-VERDICT targets.
>
> The current data structures will be completely wiped away. This isn't a little 
> change and will need a lot of testing.

Yes, but I only wanted to make clear, that this is a change that could be done
in the current structure too. Nevertheless I vote for a new ABI.

Regards
 Sven

- --
 Sven Anders <anders@anduras.de>                 () Ascii Ribbon Campaign
                                                 /\ Support plain text e-mail
 ANDURAS service solutions AG
 Innstraße 71 - 94036 Passau - Germany
 Web: www.anduras.de - Tel: +49 (0)851-4 90 50-0 - Fax: +49 (0)851-4 90 50-55

Rechtsform: Aktiengesellschaft - Sitz: Passau - Amtsgericht Passau HRB 6032
Mitglieder des Vorstands: Sven Anders, Marcus Junker, Michael Schön
Vorsitzender des Aufsichtsrats: Dipl. Kfm. Thomas Träger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE7Vvb5lKZ7Feg4EcRAsdIAKChgn1cuOsd+5I8o3gUkHQc7IxBNQCeIvA7
3pWkDFUA68MAhhzqK8SwC/U=
=xqQQ
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-23  5:01     ` Patrick McHardy
  2006-08-23 13:48       ` Joakim Axelsson
@ 2006-08-24  8:50       ` Jozsef Kadlecsik
  2006-08-24 10:58         ` Massimiliano Hofer
  2006-08-24 16:47         ` Patrick McHardy
  1 sibling, 2 replies; 31+ messages in thread
From: Jozsef Kadlecsik @ 2006-08-24  8:50 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Massimiliano Hofer, netfilter-devel

Hi,

On Wed, 23 Aug 2006, Patrick McHardy wrote:

> Jozsef Kadlecsik wrote:
> > My views on how to communicate is a little bit old-fashioned as I'm
> > perfectly satisfied with sockopt. Yes, dusty. And old. Maybe even not cool
> > but a hack. But reliable, which cannot absolutely be said about netlink.
> > I shudder to think about the situation when a DDoS attack could prevent or
> > just delay me to reconfigure a firewall. And please note even if we accept
> > delaying it assumes that we are willing to face reimplementing a TCP-alike
> > protocol over netlink to guarantee reliable message passing.
>
> I don't think a DoS attack could prevent you from using netlink any
> more than it could from setsockopt (netlink is even more efficient due
> to less data copied around).

What makes me uneasy is that we want to control the network traffic (do
firewalling) over a protocol which is managed by the network core itself.
Somehow it hurts compartmentalization.

> Reliability is not hard to achieve on the userspace->kernel path if you
> don't mind eating one RTT for each rule update (which is not very
> large), just send the update and wait for an ACK or an error, then
> handle it.

And how should the error be handled? Re-send the same request immediately
and thus hammering the system? Or should we use a linear/exponential
backoff? :-) What if it's a RESTORE operation and we are in the middle of
adding thousands of rules to the kernel. Should we then implement
transactions and all successfully added rules be removed in the case of a
fatal failure? (Allowing multiple alternate tables can solve it of
course.)

> Having more than one message in flight introduces another problem
> besides congestion control and reliable transmission, the resulting
> ruleset in the kernel might depend on the order in which messages are
> received, if messages are dropped in the middle and are simply
> retransmitted it will be wrong. This should probably not be done.

Yes, to handle that would be halfway toward TCP :-)

> > Yes. And I'd add one more priciple:
> >
> > + All parts must be designed to take into account rule- and
> >   (match/target) state-replication between firewalls in active-active
> >   setups.
>
> Agreed, that would be nice to have (and not very hard).

+ Match/target versioning support, i.e. be able to specify
  version-dependent features. That might require OR operation
  support at some level in one rule:

# Handle two versions of the same module:
-m foo --foo-version 2 <flags 2> --OR --foo-version 1 <flags 1>
# Handle new or possibly missing module
-m bar --bar-version 1 <flags 1> --OR --false|true|fatal-error|ignore-rule

Interesting question is how to handle such rules in the kernel, taking
into account that we want the 'SAVE' operation to work correctly.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-23 13:48       ` Joakim Axelsson
@ 2006-08-24  9:20         ` Jozsef Kadlecsik
  2006-08-24 13:48           ` Joakim Axelsson
  0 siblings, 1 reply; 31+ messages in thread
From: Jozsef Kadlecsik @ 2006-08-24  9:20 UTC (permalink / raw)
  To: Joakim Axelsson; +Cc: Massimiliano Hofer, netfilter-devel, Patrick McHardy

Hi,

On Wed, 23 Aug 2006, Joakim Axelsson wrote:

> 2006-08-23 07:01:50+0200, Patrick McHardy <kaber@trash.net> ->
> > Jozsef Kadlecsik wrote:
> > > My views on how to communicate is a little bit old-fashioned as I'm
> > > perfectly satisfied with sockopt. Yes, dusty. And old. Maybe even not cool
> > > but a hack. But reliable, which cannot absolutely be said about netlink.
> > > I shudder to think about the situation when a DDoS attack could prevent or
> > > just delay me to reconfigure a firewall. And please note even if we accept
> > > delaying it assumes that we are willing to face reimplementing a TCP-alike
> > > protocol over netlink to guarantee reliable message passing.
>
> I have good experience with this. The solution i have found is:
> 1. Cut the cable (or disable the port in a switch) to get enough CPU over if
> you can't get anything at all done in userspace.
> 2. You need tcpdump to figure out the attack-vector.
> 3. Then insert an iptables rule droping the attack.

That's a good pratcice: cut off the source of the big trouble, find a way
to handle it and then enable and watch the show. But usually one need to
let the attack go on some time to analyze the patterns. Or tune some
rules online, get the counters etc. So it's not always doable.

> In any case there are several more paths which are more critical and needs
> userspace cpu than netlink.
>
> Also, another problem with set/getsockopt() is that when you want to retrive
> something and don't know the size of it you have a very hard time allocating
> the memory to pass with getsockopt() for the kernel to write in. Think
> moduels like recent where the state data can be of very different length
> from time to time. Even having things what ask first how much memory is
> needed and then provides it as the state can change between the two calls.

Yes, the same problems I faced with at working on ipset ;-). But it was
easy for me, as some good guys writing ippool found workarounds ;-)).

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-23 21:13     ` Massimiliano Hofer
@ 2006-08-24 10:15       ` Jozsef Kadlecsik
  2006-09-04 22:26         ` Massimiliano Hofer
  0 siblings, 1 reply; 31+ messages in thread
From: Jozsef Kadlecsik @ 2006-08-24 10:15 UTC (permalink / raw)
  To: Massimiliano Hofer; +Cc: netfilter-devel

Hi,

On Wed, 23 Aug 2006, Massimiliano Hofer wrote:

> On Tuesday 22 August 2006 10:46 am, Jozsef Kadlecsik wrote:
>
> > I think this is the surface for the users i.e how to express the rules in
> > the userspace. I strongly believe Harald has got absolutely right that
> > what we need is a plugin architecture and library which makes possible to
> > write interfaces using a commandline style or XML, or to write a
>
> XML could already be used. It would be easier to parse for other programs (a
> GUI viewer/generator), but little else.
> He was proposing a richer language, too. I'm not sure a complete boolean
> expression would be easier to manage for a GUI, althought it would be easier
> to write for sysadmins.

That can be handled by the data structure passed to/from the kernel: when
one plugin (say the GUI) constructed the data from its input, another
plugin (cmdline/XML) can translate it back to its own format.

> > > + Use RCU-list in kernel. Because its more editable.
> >
> > I won't give up the hope to integrate nf-hipac once. In other words, the
> > RCU-list must be one possible way to list the rules and other methods
> > must be supported as well. Let us not fall in the trap to hardcode this
> > part of the system.
>
> The least radical proposals call for a separation between kernel metadata
> (potentially fixed size) and match structures (variable, but known, size
> parameters and custom data).
> The more radical (with respect to the necessary data structures) call for
> complete expression evaluation. This would require trees instead of lists or
> arrays.

I think the current system is pushed to its limits and it is time for
radical changes. Non-radical change proposals can of course explore the
current problems and point to directions toward the proper solutions, so
such efforts are not wasted.

> > > + Allow grouping of rules in some way. Really large firewall needs this.
> >
> > Sub-tables can probably help. It seems to me non-trivial how not to loose
> > efficiency, though. Multiple toplevel tables (i.e active/passive,
> > active/backup) should also be considered.
>
> Is this the same thing I'm calling expression chains? We could have an
> expression table with user defined chains that system chains can use.
> I don't think it would be inefficient. Just jump to a chain and see what it
> returns. If we want to optimize it we could cache results so that people can
> use sub-chains several times (in different combinations) withou worrying
> about it.

Could you show an example?

Sub-tables as I name are complete tables of the given type (filter,
mangle, etc):

# Define new, empty filter type of table foo:
-t filter --new-table foo
# Fill it up with rules, define default policies, etc
-t foo -A ...
...

# Then you could chain filter type of tables one after another
# using the default as an always existing one:
-t filter --prepend-table foo
-t filter --append-table bar
# Now we'd have foo, default, bar as filter tables to be evaluated

Or we could swap the default filter table with another one
(internally at restore or by user input when a table is pre-constructed
in case of emergency, etc.)

But what I propose as sub-table is definitely not scalable.

> > + All parts must be designed to take into account rule- and
> >   (match/target) state-replication between firewalls in active-active
> >   setups.
>
> What level of runtime synchronization do you need? We have to allow some delay
> between 2 firewalls.

It depends on the active-active setup and the modules in question. If the
TCP/UDP/etc. streams can pass any of the firewalls then strong
syncronization is required. If the streams always pass the same firewall
in both directions (except after failover), then loose syncronization
can be enough. recent/set modules probably always require strong sync.

> If we implement generic module data tables (with event notification to
> userspace), this wouldn't be a kernel problem anymore.

Yes, we just need such an infrastructure to back the module writers.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-24  8:50       ` Jozsef Kadlecsik
@ 2006-08-24 10:58         ` Massimiliano Hofer
  2006-08-24 11:22           ` Jozsef Kadlecsik
  2006-08-24 16:47         ` Patrick McHardy
  1 sibling, 1 reply; 31+ messages in thread
From: Massimiliano Hofer @ 2006-08-24 10:58 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Patrick McHardy, Jozsef Kadlecsik

On Thursday 24 August 2006 10:50 am, Jozsef Kadlecsik wrote:

> + Match/target versioning support, i.e. be able to specify
>   version-dependent features. That might require OR operation
>   support at some level in one rule:
>
> # Handle two versions of the same module:
> -m foo --foo-version 2 <flags 2> --OR --foo-version 1 <flags 1>
> # Handle new or possibly missing module
> -m bar --bar-version 1 <flags 1> --OR --false|true|fatal-error|ignore-rule
>
> Interesting question is how to handle such rules in the kernel, taking
> into account that we want the 'SAVE' operation to work correctly.

This could be a userspace problem. The kernel could disclose what matches are 
registered (given that the corresponding module is loaded) and what versions 
are supported.
Choosing different paths based on some fixed parameter like the current 
version isn't something we should do for every packet in transit.

-- 
Saluti,
   Massimiliano Hofer
        Nucleus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-24 10:58         ` Massimiliano Hofer
@ 2006-08-24 11:22           ` Jozsef Kadlecsik
  2006-08-24 13:13             ` Massimiliano Hofer
  0 siblings, 1 reply; 31+ messages in thread
From: Jozsef Kadlecsik @ 2006-08-24 11:22 UTC (permalink / raw)
  To: Massimiliano Hofer; +Cc: netfilter-devel, Patrick McHardy

On Thu, 24 Aug 2006, Massimiliano Hofer wrote:

> > # Handle two versions of the same module:
> > -m foo --foo-version 2 <flags 2> --OR --foo-version 1 <flags 1>
> > # Handle new or possibly missing module
> > -m bar --bar-version 1 <flags 1> --OR --false|true|fatal-error|ignore-rule
> >
> > Interesting question is how to handle such rules in the kernel, taking
> > into account that we want the 'SAVE' operation to work correctly.
>
> This could be a userspace problem. The kernel could disclose what matches are
> registered (given that the corresponding module is loaded) and what versions
> are supported.
> Choosing different paths based on some fixed parameter like the current
> version isn't something we should do for every packet in transit.

I completely agree with your last sentence. But what I wanted to say is
that when one issues such commands and then enters 'pkttables/nf --save'
to get the actual ruleset from the kernel, one expects the exactly same
rule returned, without missing parts. Even if the kernel cannot interpret
and thus ignores some parts of the command at packet matching.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-24 11:22           ` Jozsef Kadlecsik
@ 2006-08-24 13:13             ` Massimiliano Hofer
  0 siblings, 0 replies; 31+ messages in thread
From: Massimiliano Hofer @ 2006-08-24 13:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Patrick McHardy, Jozsef Kadlecsik

On Thursday 24 August 2006 1:22 pm, Jozsef Kadlecsik wrote:

> I completely agree with your last sentence. But what I wanted to say is
> that when one issues such commands and then enters 'pkttables/nf --save'
> to get the actual ruleset from the kernel, one expects the exactly same
> rule returned, without missing parts. Even if the kernel cannot interpret
> and thus ignores some parts of the command at packet matching.

This is a broader consistency problem. With the current system iptables asks 
specifically for the match version that is supported by userspace. If a new 
(additional) version is implemented in the kernel, it won't be used.
If you change the userspace utility, it's up to the utility itself to 
support/convert the old save format.

A good userspace library framework will make this easier, but I think the 
current kernel version system is fine.

-- 
Saluti,
   Massimiliano Hofer
        Nucleus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-24  9:20         ` Jozsef Kadlecsik
@ 2006-08-24 13:48           ` Joakim Axelsson
  0 siblings, 0 replies; 31+ messages in thread
From: Joakim Axelsson @ 2006-08-24 13:48 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Massimiliano Hofer, netfilter-devel, Patrick McHardy

2006-08-24 11:20:55+0200, Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> ->
> Hi,
> 
> On Wed, 23 Aug 2006, Joakim Axelsson wrote:
> 
> > 2006-08-23 07:01:50+0200, Patrick McHardy <kaber@trash.net> ->
> > > Jozsef Kadlecsik wrote:
> > > > My views on how to communicate is a little bit old-fashioned as I'm
> > > > perfectly satisfied with sockopt. Yes, dusty. And old. Maybe even not cool
> > > > but a hack. But reliable, which cannot absolutely be said about netlink.
> > > > I shudder to think about the situation when a DDoS attack could prevent or
> > > > just delay me to reconfigure a firewall. And please note even if we accept
> > > > delaying it assumes that we are willing to face reimplementing a TCP-alike
> > > > protocol over netlink to guarantee reliable message passing.
> >
> > I have good experience with this. The solution i have found is:
> > 1. Cut the cable (or disable the port in a switch) to get enough CPU over if
> > you can't get anything at all done in userspace.
> > 2. You need tcpdump to figure out the attack-vector.
> > 3. Then insert an iptables rule droping the attack.
> 
> That's a good pratcice: cut off the source of the big trouble, find a way
> to handle it and then enable and watch the show. But usually one need to
> let the attack go on some time to analyze the patterns. Or tune some
> rules online, get the counters etc. So it's not always doable.
> 

Or use heavliy limiters in your firewall so it can't overload :-) Thats what
i do. But sometimes even those fail as i have missed something.

> > In any case there are several more paths which are more critical and needs
> > userspace cpu than netlink.
> >
> > Also, another problem with set/getsockopt() is that when you want to retrive
> > something and don't know the size of it you have a very hard time allocating
> > the memory to pass with getsockopt() for the kernel to write in. Think
> > moduels like recent where the state data can be of very different length
> > from time to time. Even having things what ask first how much memory is
> > needed and then provides it as the state can change between the two calls.
> 
> Yes, the same problems I faced with at working on ipset ;-). But it was
> easy for me, as some good guys writing ippool found workarounds ;-)).
> 

Just to point to the rest of the readers. It was who wrote that piece of
horrible code. It made use of first trying to figure the amount of memory
needed by asking the kernel, and kernel asked the kernel module to guess the
needed memory. Then userspace allocated thsi and sent a second getsockopt()
with some extra memory allocated in order to cover any state changes that
needs more memory between the two sockopt()-calls. Even so thsi could fail
so i added a loop of 5 each time allocing a factor power of 2 more
additional memory. Finally failing with an memory error to user if we still
could get all the info from a state from kernel. This was a terrible
solution :-) Netlink solves that as it packet based and doesn't need memory
from userspace to write in. 

Either we use netlink with or without some sort of ACKing, possible with
some sort of ordering as well. Or we write ourself a new ABI for this. Using
sockopt() is terrible wrong for this.

--
Joakim Axelsson

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-24  8:50       ` Jozsef Kadlecsik
  2006-08-24 10:58         ` Massimiliano Hofer
@ 2006-08-24 16:47         ` Patrick McHardy
  1 sibling, 0 replies; 31+ messages in thread
From: Patrick McHardy @ 2006-08-24 16:47 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Massimiliano Hofer, netfilter-devel

Jozsef Kadlecsik wrote:
> What makes me uneasy is that we want to control the network traffic (do
> firewalling) over a protocol which is managed by the network core itself.
> Somehow it hurts compartmentalization.

I don't really worry about that.

>>Reliability is not hard to achieve on the userspace->kernel path if you
>>don't mind eating one RTT for each rule update (which is not very
>>large), just send the update and wait for an ACK or an error, then
>>handle it.
> 
> 
> And how should the error be handled? Re-send the same request immediately
> and thus hammering the system? Or should we use a linear/exponential
> backoff? :-)

That depends on the error. For an error indicating receive buffer
overflow (which can only happen with multiple concurrent senders
on the userspace->kernel path) you can usually try to retransmit
immediately. As long as messages are queued one of the other
senders will stay in the kernel working off the netlink queue
(without beeing able to add any messages during that time),
so it shouldn't last long. Alternatively you can simply use blocking
IO. In the end there's no difference to setsockopt, you have
a maximum throughput and once you've hit it you run into problems.

> What if it's a RESTORE operation and we are in the middle of
> adding thousands of rules to the kernel. Should we then implement
> transactions and all successfully added rules be removed in the case of a
> fatal failure? (Allowing multiple alternate tables can solve it of
> course.)

We could do that. Thats the price of a non-atomic interface (on
the ruleset level). And I think we've learned what the price of
an atomic one is :)

>>>Yes. And I'd add one more priciple:
>>>
>>>+ All parts must be designed to take into account rule- and
>>>  (match/target) state-replication between firewalls in active-active
>>>  setups.
>>
>>Agreed, that would be nice to have (and not very hard).
> 
> 
> + Match/target versioning support, i.e. be able to specify
>   version-dependent features. That might require OR operation
>   support at some level in one rule:
> 
> # Handle two versions of the same module:
> -m foo --foo-version 2 <flags 2> --OR --foo-version 1 <flags 1>
> # Handle new or possibly missing module
> -m bar --bar-version 1 <flags 1> --OR --false|true|fatal-error|ignore-rule

Thats more of a userspace thing. Modules will always have to be
backwards compatible, so if new attributes are not understood,
use the old ones if the user is fine with it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new ABI
  2006-08-24 10:15       ` Jozsef Kadlecsik
@ 2006-09-04 22:26         ` Massimiliano Hofer
  0 siblings, 0 replies; 31+ messages in thread
From: Massimiliano Hofer @ 2006-09-04 22:26 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Jozsef Kadlecsik

On Thursday 24 August 2006 12:15 pm, Jozsef Kadlecsik wrote:

> I think the current system is pushed to its limits and it is time for
> radical changes. Non-radical change proposals can of course explore the
> current problems and point to directions toward the proper solutions, so
> such efforts are not wasted.

I've been thinking of a few things we can do in an incremental way, but the 
reality is that the most important features require a drastically different 
interaction with userspace and one of the primary goals is to reduce the 
number of user space visible changes. This does not apply to development 
versions of course. There are lots of small incremental steps we can do 
in "developer mode".

> > > > + Allow grouping of rules in some way. Really large firewall needs
> > > > this.
> > >
> > > Sub-tables can probably help. It seems to me non-trivial how not to
> > > loose efficiency, though. Multiple toplevel tables (i.e active/passive,
> > > active/backup) should also be considered.
> >
> > Is this the same thing I'm calling expression chains? We could have an
> > expression table with user defined chains that system chains can use.
> > I don't think it would be inefficient. Just jump to a chain and see what
> > it returns. If we want to optimize it we could cache results so that
> > people can use sub-chains several times (in different combinations)
> > withou worrying about it.
>
> Could you show an example?

iptables -t macro -N trustedclient
iptables -t macro -A trustedclient ... -j TRUE
...
iptables -t macro -A trustedclient ... -j FALSE

iptables -t macro -N trustedserver
...

iptables -I FORWARD --macro trustedclient --macro trustedserver -j ACCEPT

Of course, the real usefulness comes when you use multiple combinations.
Another user could be to have a single macro in FORWARD and in nat 
POSTROUTING. This poses the problem of what matches to allow in macro chains. 
The maximum common set? Ignore non applicable ones? Separate macros for every 
context?

We could cache the result of a macro the first time we run into it (for a 
single packet) and avoid traversing the chain when we reuse it.

I have lots of firewalls where I use a script to generate combinations of 
rules. This feature would simplify all of them.

> Sub-tables as I name are complete tables of the given type (filter,
> mangle, etc):
>
> # Define new, empty filter type of table foo:
> -t filter --new-table foo
> # Fill it up with rules, define default policies, etc
> -t foo -A ...
> ...
>
> # Then you could chain filter type of tables one after another
> # using the default as an always existing one:
> -t filter --prepend-table foo
> -t filter --append-table bar
> # Now we'd have foo, default, bar as filter tables to be evaluated

Similar goal, different style.
Why not just:
-t filter --evaluate-table foo

This way you could mix it with other rules and allow for more flexible mixing 
of more than 2 sub-tables.
Would you forbid {prepend|append|evalute}-table in sub-tables in order to 
avoid infinite recursion or would you check for cyclicity at load time or run 
time?

> > What level of runtime synchronization do you need? We have to allow some
> > delay between 2 firewalls.
>
> It depends on the active-active setup and the modules in question. If the
> TCP/UDP/etc. streams can pass any of the firewalls then strong
> syncronization is required. If the streams always pass the same firewall
> in both directions (except after failover), then loose syncronization
> can be enough. recent/set modules probably always require strong sync.

This is feasible only with ip_conntrack_tcp_be_liberal. I don't think you will 
ever keep a completerly up to date mirror state down to the last sequence 
number unless you use some special and really fast means of communication.
Synchronous remote communication in the middle of packet processing is out of 
the question.
Do you have a something specific in your mind?
If you settle for loose synchronization we can certainly have a way to send 
kernel generated change events to userspace and then you could send them to 
the other members of the cluster.

> > If we implement generic module data tables (with event notification to
> > userspace), this wouldn't be a kernel problem anymore.
>
> Yes, we just need such an infrastructure to back the module writers.

This is one of the few things we could add without breaking backward 
compatibility.

-- 
Saluti,
   Massimiliano Hofer
        Nucleus

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2006-09-04 22:26 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-14 21:12 new ABI Massimiliano Hofer
2006-08-15  0:00 ` Joakim Axelsson
2006-08-15  8:39   ` Amin Azez
2006-08-15 22:08   ` Massimiliano Hofer
2006-08-15 12:14 ` Simon Lodal
2006-08-15 22:57   ` Massimiliano Hofer
2006-08-18 14:14     ` Simon Lodal
2006-08-18 21:40       ` Massimiliano Hofer
2006-08-18 14:50     ` Amin Azez
2006-08-23 18:06     ` Sven Anders
2006-08-23 21:19       ` Massimiliano Hofer
2006-08-24  7:57         ` Sven Anders
2006-08-16 12:16 ` Joakim Axelsson
2006-08-16 12:29   ` Joakim Axelsson
2006-08-16 14:40   ` Joakim Axelsson
2006-08-18 13:06   ` Simon Lodal
2006-08-18 21:40     ` Massimiliano Hofer
2006-08-18 22:24   ` Massimiliano Hofer
2006-08-22  8:46   ` Jozsef Kadlecsik
2006-08-23  5:01     ` Patrick McHardy
2006-08-23 13:48       ` Joakim Axelsson
2006-08-24  9:20         ` Jozsef Kadlecsik
2006-08-24 13:48           ` Joakim Axelsson
2006-08-24  8:50       ` Jozsef Kadlecsik
2006-08-24 10:58         ` Massimiliano Hofer
2006-08-24 11:22           ` Jozsef Kadlecsik
2006-08-24 13:13             ` Massimiliano Hofer
2006-08-24 16:47         ` Patrick McHardy
2006-08-23 21:13     ` Massimiliano Hofer
2006-08-24 10:15       ` Jozsef Kadlecsik
2006-09-04 22:26         ` Massimiliano Hofer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.