Xtables2 Netlink spec

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Xtables2 Netlink spec
@ 2010-11-24 22:29 Jan Engelhardt
  2010-11-25 11:42 ` Pablo Neira Ayuso
  2010-11-26 19:01 ` Jozsef Kadlecsik
  0 siblings, 2 replies; 55+ messages in thread
From: Jan Engelhardt @ 2010-11-24 22:29 UTC (permalink / raw)
  To: Netfilter Developer Mailing List, netfilter; +Cc: pablo

By request of Pablo, I am posting the Xtables2 Netlink interface 
specification for review. Additionally, further documentation and 
toolchain around it is available through the temporary project page at

	http://jengelh.medozas.de/projects/xtables/

which currently includes

 * User Documentation Chapter 1: Architectural Differences

 * Developer Documentation Part 1: Netlink interface (WIP)
   This is copied below to facilitate inline replies

 * Runnable Linux source tree

 * Runnable userspace library (libnetfilter_xtables)
   with small test-and-debug program

--8<--

Netlink interface

1 General use

1.1 Socket

Xtables2 is usable through a Netlink socket of type 
NETLINK_XTABLES. No intermediate subsystem like nfnetlink is 
used, because the kernel's nfnetlink parser does not make all 
attributes available to (in-kernel) nfnetlink users.

#include <sys/socket.h>
#include <linux/netlink.h>
#define NETFILTER_XTABLES 21

nfxt_socket = socket(AF_NETLINK, SOCK_RAW, NETFILTER_XTABLES);

The NETLINK_XTABLES constant is defined in linux/netlink.h with 
the value 21.

1.2 Message format

All messages transmitted over the Netlink socket are to have the 
base struct nlmsghdr header, followed by a version tag to allow 
for the flexibility of data following it:

struct xtnetlink_genhdr {
        uint32_t version;
};

The version member is always 0 in the current implementation.

Following the genhdr can be any number of standard Netlink 
attributes (struct nlattr plus their payload).

Often, a logical tree structure is used to describe something, 
such as for example tables of chains of rules:

filter
 \__ INPUT
 |    \__ some rule
 \__ FORWARD
 |    \__ rule2
 |    \__ rule3
 |    \__ rule4
 \__ OUTPUT
      \__ rule5
      \__ rule6

For this document, child objects are always “nested” within a 
parent object, irrespective of the serialized encoding.

There are different ways to encode such a tree structure into a 
serialized stream. In many Netlink protocols, children attributes 
are encapsulated (a. k. a. “nested”, though we will avoid this 
term to avoid double-use) and treated as a whole as a parent's 
opaque data. We will call this format “Encapsulated Encoding”.

To encode an attribute's length, struct nlattr only has a 16-bit 
field, which means the attribute header plus payload is limited 
to 64 KB. This is easily exceedable with the encapsulated 
encoding as chains are collected rules in a chain, for example. 
The problem is aggreviated by the kernel's Netlink handler only 
allocating skbs a page size worth, which in the worst case means 
that the usable payload for attributes is around 3600 bytes only. 
In light of xt_u32's private data block being 1984 bytes already, 
that means that you won't be able to fit two -m u32 invocations 
nested in a single rule into a dump.

The Xtables2 Netlink protocol however encodes each node as a 
standalone attribute, to be called Flat Encoding, that is 
appended (a. k. a. “chained”) to the data stream. This makes it 
possible to split requests and dumps at a finer level than 
encapsulation would. Above all, it gets extensions the guarantee 
to have data blocks of a minimum guaranteed size.

Since Netlink messages do have a 32-bit quantity to store the 
message length, rulesets of roughly up to 4 GB are possibile, 
which is currently regarded as sufficient. The largest (and 
meaningful) rulesets seen to date in the industry weighed in at 
approximately 150 MB.

Whereas attribute nesting automatically provided for boundaries, 
this is realized using a dummy attribute in the chained approach. 
Certain attributes can start such a flattened nesting, and 
NFXTA_STOP terminates it.

2 Attributes

The meaning of attributes depends upon the nesting level in which 
they appear. Their type however remains the same, such that a 
single Netlink attribute validation policy object (struct 
nla_policy) is sufficient.

A table of all known attributes:

+--------+-----------------+---------------+----------------+
| Value  | Mnemonic        |    C type     | NLA type       |
+--------+-----------------+---------------+----------------+
+--------+-----------------+---------------+----------------+
|   1    | NFXTA_STOP      |               | NLA_FLAG       |
+--------+-----------------+---------------+----------------+
|   2    | NFXTA_ERRNO     |     int       | NLA_U32        |
+--------+-----------------+---------------+----------------+
|   3    | NFXTA_NAME      |   char []     | NLA_NUL_STRING |
+--------+-----------------+---------------+----------------+
|   4    | NFXTA_CHAIN     |               | NLA_FLAG       |
+--------+-----------------+---------------+----------------+
|   5    | NFXTA_HOOKNUM   | unsigned int  | NLA_U32        |
+--------+-----------------+---------------+----------------+
|   6    | NFXTA_PRIORITY  |     int       | NLA_U32        |
+--------+-----------------+---------------+----------------+
|   7    | NFXTA_NFPROTO   |   uint8_t     | NLA_U32        |
+--------+-----------------+---------------+----------------+
|        | NFXTA_RULE      |               | NLA_FLAG       |
+--------+-----------------+---------------+----------------+
|        | NFXTA_OFFSET    | unsigned int  | NLA_U32        |
+--------+-----------------+---------------+----------------+
|        | NFXTA_LENGTH    |    size_t     | NLA_U32        |
+--------+-----------------+---------------+----------------+
|        | NFXTA_VERDICT   | unsigned int  | NLA_U32        |
+--------+-----------------+---------------+----------------+
|        | NFXTA_MATCH     |               | NLA_FLAG       |
+--------+-----------------+---------------+----------------+
|        | NFXTA_DATA      |               | NLA_BINARY     |
+--------+-----------------+---------------+----------------+
|        | NFXTA_TARGET    |               | NLA_FLAG       |
+--------+-----------------+---------------+----------------+
|        | NFXTA_JUMP      |   char []     | NLA_NUL_STRING |
+--------+-----------------+---------------+----------------+
|        | NFXTA_GOTO      |   char []     | NLA_NUL_STRING |
+--------+-----------------+---------------+----------------+
|        | NFXTA_REVISION  |   uint8_t     | NLA_U32        |
+--------+-----------------+---------------+----------------+
|        | NFXTA_SIZE      |    size_t     | NLA_U32        |
+--------+-----------------+---------------+----------------+
|        | NFXTA_HOOKMASK  | unsigned int  | NLA_U32        |
+--------+-----------------+---------------+----------------+

The kernel ignores attributes with value 0 during validation, so 
it was left unused.

2.1 Nest level terminator<sub:nfxta_stop>

This attribute serves to denote the end of a nesting level as 
introduced by NFXTA_CHAIN, NFXTA_RULE, NFXTA_MATCH or 
NFXTA_TARGET. It has no data portion.

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4                   | nla_type = NFXTA_STOP         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

2.2 Dump error code<sub:nfxta_errno>

Once a NLM_F_MULTI dump operation has been started, for example 
with the NFXTM_CHAIN_DUMP request, Netlink kernel users must 
always end it successfully with NLMSG_DONE. To convey an error 
during the dump, Xtables2 will emit a NFXTA_ERRNO attribute into 
the stream (if it can), emit no further attributes for the 
request, and cause the dump to stop.

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 8                   | nla_type = NFXTA_ERRNO        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| int errno;                                                    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

2.3 Match extension<sub:nfxta_match>

Invocation of a match is represented using the NFXTA_MATCH 
attribute which starts a nest level. A match attribute must 
contain two attributes:

• NFXTA_NAME: the name of the target extension

• NFXTA_DATA: data private to this instance of the extension

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4                   | nla_type = NFXTA_MATCH        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 + payload         | nla_type = NFXTA_NAME         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.                                                               .
. name of the extension, e.g. "hashlimit"                       .
.                                                               .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 + payload         | nla_type = NFXTA_DATA         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.                                                               .
. e.g. struct xt_hashlimit_info                                 .
.                                                               .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4                   | nla_type = NFXTA_STOP         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

2.4 Target extension<sub:nfxta_target>

Invocation of a match is represented using the NFXTA_TARGET 
attribute which starts a nest level. A target attribute must 
contain two attributes:

• NFXTA_NAME: the name of the target extension

• NFXTA_DATA: data private to this instance of the extension

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4                   | nla_type = NFXTA_TARGET       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 + payload         | nla_type = NFXTA_NAME         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.                                                               .
. name of the extension, e.g. "TCPMSS"                          .
.                                                               .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 + payload         | nla_type = NFXTA_DATA         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.                                                               .
. e.g. struct xt_tcpmss_info                                    .
.                                                               .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4                   | nla_type = NFXTA_STOP         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

2.5 Rule<sub:nfxta_rule>

A rule is started using the NFXTA_RULE attribute, which starts a 
nest level, and is ended with an NFXTA_STOP attribute. Rules can 
contain:

• Zero or more match extensions (NFXTA_MATCH..NFXTA_STOP).

• Zero or more target extensions (NFXTA_TARGET..NFXTA_STOP).

• Zero or one NFXTA_VERDICT attribute that specifies the rule's 
  verdict as data, which can either be NF_ACCEPT or NF_DROP. 
  (Non-normative notes: The supplied verdict is executed if no 
  target has reached a verdict on its own. Omission of the 
  verdict attribute counts as XT_CONTINUE.)

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4                   | nla_type = NFXTA_RULE         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.                                                               .
. matches, targets, verdict                                     .
.                                                               .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4                   | nla_type = NFXTA_STOP         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

2.6 Chain<sub:nfxta_chain>

A chain is started using the NFXTA_CHAIN attribute, which starts 
a nest level, and is ended with an NFXTA_STOP attribute. Chains 
can contain:

• Zero or one of this group of three (= specify all three, or 
  none at all), specifying that this chain is a base chain 
  hooking in at some point:

  – One NFXTA_HOOKNUM attribute for giving a hook number. This is 
    (unfortunately) dependent on the chosen nfproto, so it is 
    either NF_INET_*, NF_BR_* or NF_ARP_*.

  – One NFXTA_PRIORITY attribute.

  – One NFXTA_NFPROTO attribute that is NFPROTO_*.

• Zero or more rules (NFXTA_RULE..NFXTA_STOP).

Example of a fully populated chain:

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4                   | nla_type = NFXTA_CHAIN        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 8                   | nla_type = NFXTA_HOOKNUM      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| hook number (0..7)                                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 8                   | nla_type = NFXTA_PRIORITY     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| priority (-2147483648..2147483647)                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 8                   | nla_type = NFXTA_NFPROTO      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nfproto value (2=ipv4, 3=arp, 7=bridge, 10=ipv6, 12=decnet)   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.                                                               .
. rules                                                         .
.                                                               .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4                   | nla_type = NFXTA_STOP         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

3 Message types

3.1 IDENTIFYNFXTM_IDENTIFY: Identification

First and foremost a debug command. And to get something 
(table/chain-independent) that users can glare at (they love 
doing that).

Request:

• nlmsg_type = NFXTM_IDENTIFY;

Response:

• An NFXTA_NAMENFXTA_NAME attribute contains the name and version 
  of the implementation/patchset.

• Zero or more attributes of type NFXTA_MATCH, terminated by 
  NFXTA_STOP, giving meta information about the loaded match 
  extensions. Per available match, a group of three attributes 
  follows:

  – One NFXTA_NAME attribute for the name of the extension

  – One NFXTA_REVISION attribute to denote the version of the 
    extension's parameter protocol

  – One NFXTA_SIZE attribute for the size of its per-instance 
    data block

• Zero or more attributes of type NFXTA_TARGET, terminated by 
  NFXTA_STOP, giving meta information about the loaded and 
  available target extensions:

  – same attributes as with NFXTA_MATCH above

3.2 CHAIN_NEWNFXTM_CHAIN_NEW: Create new chain

Request:

• nlmsg_type = NFXTM_CHAIN_NEW;

• NFXTA_NAME attribute carrying the name of the new chain.

• Zero or one of this group of three:

  – NFXTA_HOOKNUM

  – NFXTA_PRIORITY

  – NFXTA_NFPROTO

Response:

• Standard ACK.

Remarks:

Right now, a chain can only be promoted to a base chain during 
creation (as far as the userspace view goes; when the kernel 
exactly installs the nf_hook_ops is not of concern to userspace), 
and it can only be demoted by deleting it. Should a 
NFXTM_CHAIN_PROMOTE be split off the NFXTM_CHAIN_NEW 
functionality?

3.3 CHAIN_DELNFXTM_CHAIN_DEL: Delete a chain

Request:

• nlmsg_type = NFXTM_CHAIN_DEL;

• NFXTA_NAME attribute carrying the name of the chain to delete

Response:

• Standard ACK.

3.4 CHAIN_MOVENFXTM_CHAIN_MOVE: Rename a chain

Request:

• nlmsg_type = NFXTM_CHAIN_MOVE;

• Two NFXTA_NAME attributes (order is important):

  – First one specifies the current name of the chain

  – Second one specifies the new name of the chain

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nlmsg_len = at least 24                                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nlmsg_type = NFXTM_CHAIN_MOVE | nlmsg_flags = NLM_F_REQUEST   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nlmsg_seq = whatever                                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nlmsg_pid = whatever                                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = at least 4          | nla_type = NFXTA_NAME         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.                                                               .
. old name                                                      .
.                                                               .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = at least 4          | nla_type = NFXTA_NAME         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
.                                                               .
. new name                                                      .
.                                                               .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

3.5 CHAIN_DUMPNFXTM_CHAIN_DUMP: Chain dump

Request:

• nlmsg_type = NFXTM_CHAIN_DUMP;

• NFXTA_NAMENFXTA_NAME attribute specifying the name of the chain 
  to dump

Response:

• Zero or one of this group of three:

  – NFXTA_HOOKNUMNFXTA_HOOKNUM, NFXTA_PRIORITYNFXTA_PRIORITY, 
    NFXTA_NFPROTONFXTA_NFPROTO.

• Zero or more NFXTA_RULE attributes as per section [sub:nfxta_rule]
  .

Errors:

• If an error occurs during dump, an NFXTA_ERRNO attribute is 
  emitted into the stream and the dump will immediately terminate 
  with a standard NLMSG_DONE message. No NFXTA_STOP attributes 
  will be emitted if the dump stopped in the middle of a nesting 
  level.

3.6 TABLE_DUMPNFXTM_TABLE_DUMP: Table dump

Returns an atomic snapshot of the table.

Request:

• nlmsg_type = NFXTM_TABLE_DUMP;

Response:

• Zero or more NFXTA_CHAINNFXTA_CHAIN attributes as described in 
  section [sub:nfxta_chain].

3.7 CHAIN_SPLICENFXTM_CHAIN_SPLICE: Add/delete rules

The NFXTM_CHAIN_SPLICE request does a bulk deletion of zero or 
more consecutive rules, followed by a bulk insertion of zero or 
more consecutive rules, all done in an atomic fashion. It 
operates similar to Perl's splice function on arrays. The request 
message needs to have at least the first three attributes.

Request:

• NFXTA_NAMENFXTA_NAME: Name of the chain to modify.

• NFXTA_OFFSETNFXTA_OFFSET: Index of entry where operation should 
  start.

• NFXTA_LENGTHNFXTA_LENGTH: Number of entries starting from 
  offset that should be removed. May be zero or more.

• Zero or more NFXTA_RULENFXTA_RULE as per section [sub:nfxta_rule]
  .

Response:

• Standard ACK.

• Desired: detailed error code and origin of error (result of 
  running ->check in extensions)

3.8 TABLE_REPLACENFXTM_TABLE_REPLACE

Atomic exchange of an entire table.

Request:

• nlmsg_type = NFXTM_TABLE_REPLACE;

• Zero or more NFXTA_CHAINNFXTA_CHAIN attributes as per section [sub:nfxta_chain]
  .

Response:

• Standard ACK.

• Desired: detailed error code and origin of error (result of 
  running ->check in extensions)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-24 22:29 Xtables2 Netlink spec Jan Engelhardt
@ 2010-11-25 11:42 ` Pablo Neira Ayuso
  2010-11-25 13:35   ` Jan Engelhardt
  2010-11-26 19:01 ` Jozsef Kadlecsik
  1 sibling, 1 reply; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-11-25 11:42 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List

Hi Jan,

I have trimmed the CC to netfilter, I don't think this deserves 
attention to users, not yet at least.

Some quick impressions on your proposal:

On 24/11/10 23:29, Jan Engelhardt wrote:
>
> By request of Pablo, I am posting the Xtables2 Netlink interface
> specification for review. Additionally, further documentation and
> toolchain around it is available through the temporary project page at
>
> 	http://jengelh.medozas.de/projects/xtables/
>
> which currently includes
>
>   * User Documentation Chapter 1: Architectural Differences
>
>   * Developer Documentation Part 1: Netlink interface (WIP)
>     This is copied below to facilitate inline replies
>
>   * Runnable Linux source tree
>
>   * Runnable userspace library (libnetfilter_xtables)
>     with small test-and-debug program
>
> --8<--
>
> Netlink interface
>
> 1 General use
>
> 1.1 Socket
>
> Xtables2 is usable through a Netlink socket of type
> NETLINK_XTABLES. No intermediate subsystem like nfnetlink is
> used, because the kernel's nfnetlink parser does not make all
> attributes available to (in-kernel) nfnetlink users.
>
> #include<sys/socket.h>
> #include<linux/netlink.h>
> #define NETFILTER_XTABLES 21
>
> nfxt_socket = socket(AF_NETLINK, SOCK_RAW, NETFILTER_XTABLES);
>
> The NETLINK_XTABLES constant is defined in linux/netlink.h with
> the value 21.

This has to go upon nfnetlink as other netfilter subsystems.

> 1.2 Message format
>
> All messages transmitted over the Netlink socket are to have the
> base struct nlmsghdr header, followed by a version tag to allow
> for the flexibility of data following it:
>
> struct xtnetlink_genhdr {
>          uint32_t version;
> };
>
> The version member is always 0 in the current implementation.
>
> Following the genhdr can be any number of standard Netlink
> attributes (struct nlattr plus their payload).
>
> Often, a logical tree structure is used to describe something,
> such as for example tables of chains of rules:
>
> filter
>   \__ INPUT
>   |    \__ some rule
>   \__ FORWARD
>   |    \__ rule2
>   |    \__ rule3
>   |    \__ rule4
>   \__ OUTPUT
>        \__ rule5
>        \__ rule6
>
> For this document, child objects are always “nested” within a
> parent object, irrespective of the serialized encoding.
>
> There are different ways to encode such a tree structure into a
> serialized stream. In many Netlink protocols, children attributes
> are encapsulated (a. k. a. “nested”, though we will avoid this
> term to avoid double-use) and treated as a whole as a parent's
> opaque data. We will call this format “Encapsulated Encoding”.
>
> To encode an attribute's length, struct nlattr only has a 16-bit
> field, which means the attribute header plus payload is limited
> to 64 KB. This is easily exceedable with the encapsulated
> encoding as chains are collected rules in a chain, for example.
> The problem is aggreviated by the kernel's Netlink handler only
> allocating skbs a page size worth, which in the worst case means
> that the usable payload for attributes is around 3600 bytes only.
> In light of xt_u32's private data block being 1984 bytes already,
> that means that you won't be able to fit two -m u32 invocations
> nested in a single rule into a dump.
 >
> The Xtables2 Netlink protocol however encodes each node as a
> standalone attribute, to be called Flat Encoding, that is
> appended (a. k. a. “chained”) to the data stream. This makes it
> possible to split requests and dumps at a finer level than
> encapsulation would. Above all, it gets extensions the guarantee
> to have data blocks of a minimum guaranteed size.
 >
> Since Netlink messages do have a 32-bit quantity to store the
> message length, rulesets of roughly up to 4 GB are possibile,
> which is currently regarded as sufficient. The largest (and
> meaningful) rulesets seen to date in the industry weighed in at
> approximately 150 MB.

You can split data into several messages and avoid this limitation.

> Whereas attribute nesting automatically provided for boundaries,
> this is realized using a dummy attribute in the chained approach.
> Certain attributes can start such a flattened nesting, and
> NFXTA_STOP terminates it.

I don't like this trailing attribute, see below.

> 2 Attributes
>
> The meaning of attributes depends upon the nesting level in which
> they appear. Their type however remains the same, such that a
> single Netlink attribute validation policy object (struct
> nla_policy) is sufficient.
>
> A table of all known attributes:
>
> +--------+-----------------+---------------+----------------+
> | Value  | Mnemonic        |    C type     | NLA type       |
> +--------+-----------------+---------------+----------------+
> +--------+-----------------+---------------+----------------+
> |   1    | NFXTA_STOP      |               | NLA_FLAG       |
> +--------+-----------------+---------------+----------------+
> |   2    | NFXTA_ERRNO     |     int       | NLA_U32        |
> +--------+-----------------+---------------+----------------+
> |   3    | NFXTA_NAME      |   char []     | NLA_NUL_STRING |
> +--------+-----------------+---------------+----------------+
> |   4    | NFXTA_CHAIN     |               | NLA_FLAG       |
> +--------+-----------------+---------------+----------------+
> |   5    | NFXTA_HOOKNUM   | unsigned int  | NLA_U32        |
> +--------+-----------------+---------------+----------------+
> |   6    | NFXTA_PRIORITY  |     int       | NLA_U32        |
> +--------+-----------------+---------------+----------------+
> |   7    | NFXTA_NFPROTO   |   uint8_t     | NLA_U32        |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_RULE      |               | NLA_FLAG       |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_OFFSET    | unsigned int  | NLA_U32        |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_LENGTH    |    size_t     | NLA_U32        |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_VERDICT   | unsigned int  | NLA_U32        |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_MATCH     |               | NLA_FLAG       |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_DATA      |               | NLA_BINARY     |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_TARGET    |               | NLA_FLAG       |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_JUMP      |   char []     | NLA_NUL_STRING |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_GOTO      |   char []     | NLA_NUL_STRING |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_REVISION  |   uint8_t     | NLA_U32        |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_SIZE      |    size_t     | NLA_U32        |
> +--------+-----------------+---------------+----------------+
> |        | NFXTA_HOOKMASK  | unsigned int  | NLA_U32        |
> +--------+-----------------+---------------+----------------+
>
>
> The kernel ignores attributes with value 0 during validation, so
> it was left unused.
>
> 2.1 Nest level terminator<sub:nfxta_stop>
>
> This attribute serves to denote the end of a nesting level as
> introduced by NFXTA_CHAIN, NFXTA_RULE, NFXTA_MATCH or
> NFXTA_TARGET. It has no data portion.
>
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4                   | nla_type = NFXTA_STOP         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

It's not a good idea to make assumptions on the order of the TLVs in a 
Netlink message. I mean, you should not assume that NFXTA_STOP comes 
after one specific attribute.

> 2.2 Dump error code<sub:nfxta_errno>
>
> Once a NLM_F_MULTI dump operation has been started, for example
> with the NFXTM_CHAIN_DUMP request, Netlink kernel users must
> always end it successfully with NLMSG_DONE. To convey an error
> during the dump, Xtables2 will emit a NFXTA_ERRNO attribute into
> the stream (if it can), emit no further attributes for the
> request, and cause the dump to stop.
>
> 0                   1                   2                   3
> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 8                   | nla_type = NFXTA_ERRNO        |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | int errno;                                                    |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Isn't nlmsg_err OK for your needs?

> 2.3 Match extension<sub:nfxta_match>
>
> Invocation of a match is represented using the NFXTA_MATCH
> attribute which starts a nest level. A match attribute must
> contain two attributes:
>
> • NFXTA_NAME: the name of the target extension
>
> • NFXTA_DATA: data private to this instance of the extension
>
> 0                   1                   2                   3
> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4                   | nla_type = NFXTA_MATCH        |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4 + payload         | nla_type = NFXTA_NAME         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> .                                                               .
> . name of the extension, e.g. "hashlimit"                       .
> .                                                               .
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4 + payload         | nla_type = NFXTA_DATA         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> .                                                               .
> . e.g. struct xt_hashlimit_info

This is fine during some transition period, but Netlink protocols must 
not encapsulate structures in the payload of their TLVs.
                                 .
> .                                                               .
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4                   | nla_type = NFXTA_STOP         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> 2.4 Target extension<sub:nfxta_target>
>
> Invocation of a match is represented using the NFXTA_TARGET
> attribute which starts a nest level. A target attribute must
> contain two attributes:
>
> • NFXTA_NAME: the name of the target extension
>
> • NFXTA_DATA: data private to this instance of the extension
> 0                   1                   2                   3
> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4                   | nla_type = NFXTA_TARGET       |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4 + payload         | nla_type = NFXTA_NAME         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> .                                                               .
> . name of the extension, e.g. "TCPMSS"                          .
> .                                                               .
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4 + payload         | nla_type = NFXTA_DATA         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> .                                                               .
> . e.g. struct xt_tcpmss_info

same comment as above.
                                     .
> .                                                               .
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4                   | nla_type = NFXTA_STOP         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> 2.5 Rule<sub:nfxta_rule>
>
> A rule is started using the NFXTA_RULE attribute, which starts a
> nest level, and is ended with an NFXTA_STOP attribute. Rules can
> contain:
>
> • Zero or more match extensions (NFXTA_MATCH..NFXTA_STOP).
>
> • Zero or more target extensions (NFXTA_TARGET..NFXTA_STOP).
>
> • Zero or one NFXTA_VERDICT attribute that specifies the rule's
>    verdict as data, which can either be NF_ACCEPT or NF_DROP.
>    (Non-normative notes: The supplied verdict is executed if no
>    target has reached a verdict on its own. Omission of the
>    verdict attribute counts as XT_CONTINUE.)
>
> 0                   1                   2                   3
> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4                   | nla_type = NFXTA_RULE         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> .                                                               .
> . matches, targets, verdict                                     .
> .                                                               .
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4                   | nla_type = NFXTA_STOP         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> 2.6 Chain<sub:nfxta_chain>
>
> A chain is started using the NFXTA_CHAIN attribute, which starts
> a nest level, and is ended with an NFXTA_STOP attribute. Chains
> can contain:
>
> • Zero or one of this group of three (= specify all three, or
>    none at all), specifying that this chain is a base chain
>    hooking in at some point:
>
>    – One NFXTA_HOOKNUM attribute for giving a hook number. This is
>      (unfortunately) dependent on the chosen nfproto, so it is
>      either NF_INET_*, NF_BR_* or NF_ARP_*.
>
>    – One NFXTA_PRIORITY attribute.
>
>    – One NFXTA_NFPROTO attribute that is NFPROTO_*.
>
> • Zero or more rules (NFXTA_RULE..NFXTA_STOP).
>
> Example of a fully populated chain:
>
> 0                   1                   2                   3
> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4                   | nla_type = NFXTA_CHAIN        |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 8                   | nla_type = NFXTA_HOOKNUM      |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | hook number (0..7)                                            |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 8                   | nla_type = NFXTA_PRIORITY     |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | priority (-2147483648..2147483647)                            |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 8                   | nla_type = NFXTA_NFPROTO      |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nfproto value (2=ipv4, 3=arp, 7=bridge, 10=ipv6, 12=decnet)   |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> .                                                               .
> . rules                                                         .
> .                                                               .
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = 4                   | nla_type = NFXTA_STOP         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> 3 Message types
>
> 3.1 IDENTIFYNFXTM_IDENTIFY: Identification
>
> First and foremost a debug command. And to get something
> (table/chain-independent) that users can glare at (they love
> doing that).
>
> Request:
>
> • nlmsg_type = NFXTM_IDENTIFY;
>
> Response:
>
> • An NFXTA_NAMENFXTA_NAME attribute contains the name and version
>    of the implementation/patchset.
>
> • Zero or more attributes of type NFXTA_MATCH, terminated by
>    NFXTA_STOP, giving meta information about the loaded match
>    extensions. Per available match, a group of three attributes
>    follows:
>
>    – One NFXTA_NAME attribute for the name of the extension
>
>    – One NFXTA_REVISION attribute to denote the version of the
>      extension's parameter protocol
>
>    – One NFXTA_SIZE attribute for the size of its per-instance
>      data block

We can avoid this if structures are splitted into several TLVs. You can 
add new attributes and obsolete old ones.

> • Zero or more attributes of type NFXTA_TARGET, terminated by
>    NFXTA_STOP, giving meta information about the loaded and
>    available target extensions:
>
>    – same attributes as with NFXTA_MATCH above
>
> 3.2 CHAIN_NEWNFXTM_CHAIN_NEW: Create new chain
>
> Request:
>
> • nlmsg_type = NFXTM_CHAIN_NEW;
>
> • NFXTA_NAME attribute carrying the name of the new chain.
>
> • Zero or one of this group of three:
>
>    – NFXTA_HOOKNUM
>
>    – NFXTA_PRIORITY
>
>    – NFXTA_NFPROTO
>
> Response:
>
> • Standard ACK.
>
> Remarks:
>
> Right now, a chain can only be promoted to a base chain during
> creation (as far as the userspace view goes; when the kernel
> exactly installs the nf_hook_ops is not of concern to userspace),
> and it can only be demoted by deleting it. Should a
> NFXTM_CHAIN_PROMOTE be split off the NFXTM_CHAIN_NEW
> functionality?
>
> 3.3 CHAIN_DELNFXTM_CHAIN_DEL: Delete a chain
>
> Request:
>
> • nlmsg_type = NFXTM_CHAIN_DEL;
>
> • NFXTA_NAME attribute carrying the name of the chain to delete
>
> Response:
>
> • Standard ACK.
>
> 3.4 CHAIN_MOVENFXTM_CHAIN_MOVE: Rename a chain
>
> Request:
>
> • nlmsg_type = NFXTM_CHAIN_MOVE;
>
> • Two NFXTA_NAME attributes (order is important):
>
>    – First one specifies the current name of the chain
>
>    – Second one specifies the new name of the chain
>
> 0                   1                   2                   3
> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nlmsg_len = at least 24                                       |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nlmsg_type = NFXTM_CHAIN_MOVE | nlmsg_flags = NLM_F_REQUEST   |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nlmsg_seq = whatever                                          |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nlmsg_pid = whatever                                          |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = at least 4          | nla_type = NFXTA_NAME         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> .                                                               .
> . old name                                                      .
> .                                                               .
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> | nla_len = at least 4          | nla_type = NFXTA_NAME         |
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> .                                                               .
> . new name                                                      .
> .                                                               .
> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> 3.5 CHAIN_DUMPNFXTM_CHAIN_DUMP: Chain dump
>
> Request:
>
> • nlmsg_type = NFXTM_CHAIN_DUMP;
>
> • NFXTA_NAMENFXTA_NAME attribute specifying the name of the chain
>    to dump
>
> Response:
>
> • Zero or one of this group of three:
>
>    – NFXTA_HOOKNUMNFXTA_HOOKNUM, NFXTA_PRIORITYNFXTA_PRIORITY,
>      NFXTA_NFPROTONFXTA_NFPROTO.
>
> • Zero or more NFXTA_RULE attributes as per section [sub:nfxta_rule]
>    .
>
> Errors:
>
> • If an error occurs during dump, an NFXTA_ERRNO attribute is
>    emitted into the stream and the dump will immediately terminate
>    with a standard NLMSG_DONE message. No NFXTA_STOP attributes
>    will be emitted if the dump stopped in the middle of a nesting
>    level.
>
> 3.6 TABLE_DUMPNFXTM_TABLE_DUMP: Table dump
>
> Returns an atomic snapshot of the table.
>
> Request:
>
> • nlmsg_type = NFXTM_TABLE_DUMP;
>
> Response:
>
> • Zero or more NFXTA_CHAINNFXTA_CHAIN attributes as described in
>    section [sub:nfxta_chain].
>
> 3.7 CHAIN_SPLICENFXTM_CHAIN_SPLICE: Add/delete rules
>
> The NFXTM_CHAIN_SPLICE request does a bulk deletion of zero or
> more consecutive rules, followed by a bulk insertion of zero or
> more consecutive rules, all done in an atomic fashion. It
> operates similar to Perl's splice function on arrays. The request
> message needs to have at least the first three attributes.
>
> Request:
>
> • NFXTA_NAMENFXTA_NAME: Name of the chain to modify.
>
> • NFXTA_OFFSETNFXTA_OFFSET: Index of entry where operation should
>    start.
>
> • NFXTA_LENGTHNFXTA_LENGTH: Number of entries starting from
>    offset that should be removed. May be zero or more.
>
> • Zero or more NFXTA_RULENFXTA_RULE as per section [sub:nfxta_rule]
>    .
>
> Response:
>
> • Standard ACK.
>
> • Desired: detailed error code and origin of error (result of
>    running ->check in extensions)
>
> 3.8 TABLE_REPLACENFXTM_TABLE_REPLACE
>
> Atomic exchange of an entire table.
>
> Request:
>
> • nlmsg_type = NFXTM_TABLE_REPLACE;
>
> • Zero or more NFXTA_CHAINNFXTA_CHAIN attributes as per section [sub:nfxta_chain]
>    .
>
> Response:
>
> • Standard ACK.
>
> • Desired: detailed error code and origin of error (result of
>    running ->check in extensions)

That's all by now. Quite exhaustive, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-25 11:42 ` Pablo Neira Ayuso
@ 2010-11-25 13:35   ` Jan Engelhardt
  2010-11-25 14:21     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-11-25 13:35 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Netfilter Developer Mailing List


On Thursday 2010-11-25 12:42, Pablo Neira Ayuso wrote:
>>
>> nfxt_socket = socket(AF_NETLINK, SOCK_RAW, NETFILTER_XTABLES);
>
> This has to go upon nfnetlink as other netfilter subsystems.

Why so? It is not like Netlink protocols were limited to 32 AFAICS.
Also as told, nfnetlink is not fit for parsing netlink messages where
an attribute type appears more than once. If anything, I would look
into genetlink, though that also starts to look like it cannot do
that.

>> The Xtables2 Netlink protocol however encodes each node as a
>> standalone attribute, to be called Flat Encoding, that is
>> appended (a. k. a. “chained”) to the data stream. This makes it
>> possible to split requests and dumps at a finer level than
>> encapsulation would. Above all, it gets extensions the guarantee
>> to have data blocks of a minimum guaranteed size.
>>
>> Since Netlink messages do have a 32-bit quantity to store the
>> message length, rulesets of roughly up to 4 GB are possibile,
>> which is currently regarded as sufficient. The largest (and
>> meaningful) rulesets seen to date in the industry weighed in at
>> approximately 150 MB.
>
>You can split data into several messages and avoid this limitation.

Netlink may have support for splitting messages, but not really
splitting data. So I am just splitting messages at attribute
boundaries like everyone else.

>> Whereas attribute nesting automatically provided for boundaries,
>> this is realized using a dummy attribute in the chained approach.
>> Certain attributes can start such a flattened nesting, and
>> NFXTA_STOP terminates it.
>
> I don't like this trailing attribute, see below.
>
>> This attribute serves to denote the end of a nesting level as
>> introduced by NFXTA_CHAIN, NFXTA_RULE, NFXTA_MATCH or
>> NFXTA_TARGET. It has no data portion.
>>
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>> | nla_len = 4                   | nla_type = NFXTA_STOP         |
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
>It's not a good idea to make assumptions on the order of the TLVs in
>a Netlink message. I mean, you should not assume that NFXTA_STOP
>comes after one specific attribute.

Ordering is a necessary constraint with flat encoding. Furthermore,
rules exhibit order, so even if I were to use encapsulated encoding,
there would be ordering requirements.

The Netlink RFC does not make any statements about what is to follow
nlmsghdr; unless I missed something, it does not mention ordering,
not even attributes at all. So XTNL is free to use what it chooses -
including an nlattr32 that is not compatible with nlattr16.

>> 2.2 Dump error code<sub:nfxta_errno>
>>
>> Once a NLM_F_MULTI dump operation has been started, for example
>> with the NFXTM_CHAIN_DUMP request, Netlink kernel users must
>> always end it successfully with NLMSG_DONE. To convey an error
>> during the dump, Xtables2 will emit a NFXTA_ERRNO attribute into
>> the stream (if it can), emit no further attributes for the
>> request, and cause the dump to stop.
>>
>> 0                   1                   2                   3
>> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>> | nla_len = 8                   | nla_type = NFXTA_ERRNO        |
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>> | int errno;                                                    |
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
> Isn't nlmsg_err OK for your needs?

You cannot abort a dump from the kernel, which is why nlmsg_err
does not get used.

>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>> | nla_len = 4 + payload         | nla_type = NFXTA_DATA         |
>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>> .                                                               .
>> . e.g. struct xt_hashlimit_info
>
>This is fine during some transition period, but Netlink protocols
>must not encapsulate structures in the payload of their TLVs.

I did not see such a requirement in the Netlink RFC.
Of course it is for existing extensions.

>We can avoid this if structures are splitted into several TLVs. You
>can add new attributes and obsolete old ones.

Yes, but not at this stage. Complete architectural rewrites of
everything at once comes with plenty of problems. Linux evolution has
shown that small incremental reviewable patches are the credo.

Do not worry, I left room in XTNL for attributes upgrades.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-25 13:35   ` Jan Engelhardt
@ 2010-11-25 14:21     ` Pablo Neira Ayuso
  2010-11-25 21:46       ` Jan Engelhardt
  0 siblings, 1 reply; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-11-25 14:21 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List

On 25/11/10 14:35, Jan Engelhardt wrote:
>
> On Thursday 2010-11-25 12:42, Pablo Neira Ayuso wrote:
>>>
>>> nfxt_socket = socket(AF_NETLINK, SOCK_RAW, NETFILTER_XTABLES);
>>
>> This has to go upon nfnetlink as other netfilter subsystems.
>
> Why so? It is not like Netlink protocols were limited to 32 AFAICS.
> Also as told, nfnetlink is not fit for parsing netlink messages where
> an attribute type appears more than once. If anything, I would look
> into genetlink, though that also starts to look like it cannot do
> that.

All netfilter subsystems must go over nfnetlink, dot.

If you are repeating the same attribute in one message, it means that 
you have to split your data into several messages.

>>> The Xtables2 Netlink protocol however encodes each node as a
>>> standalone attribute, to be called Flat Encoding, that is
>>> appended (a. k. a. “chained”) to the data stream. This makes it
>>> possible to split requests and dumps at a finer level than
>>> encapsulation would. Above all, it gets extensions the guarantee
>>> to have data blocks of a minimum guaranteed size.
>>>
>>> Since Netlink messages do have a 32-bit quantity to store the
>>> message length, rulesets of roughly up to 4 GB are possibile,
>>> which is currently regarded as sufficient. The largest (and
>>> meaningful) rulesets seen to date in the industry weighed in at
>>> approximately 150 MB.
>>
>> You can split data into several messages and avoid this limitation.
>
> Netlink may have support for splitting messages, but not really
> splitting data. So I am just splitting messages at attribute
> boundaries like everyone else.
>
>>> Whereas attribute nesting automatically provided for boundaries,
>>> this is realized using a dummy attribute in the chained approach.
>>> Certain attributes can start such a flattened nesting, and
>>> NFXTA_STOP terminates it.
>>
>> I don't like this trailing attribute, see below.
>>
>>> This attribute serves to denote the end of a nesting level as
>>> introduced by NFXTA_CHAIN, NFXTA_RULE, NFXTA_MATCH or
>>> NFXTA_TARGET. It has no data portion.
>>>
>>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>> | nla_len = 4                   | nla_type = NFXTA_STOP         |
>>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>
>> It's not a good idea to make assumptions on the order of the TLVs in
>> a Netlink message. I mean, you should not assume that NFXTA_STOP
>> comes after one specific attribute.
>
> Ordering is a necessary constraint with flat encoding. Furthermore,
> rules exhibit order, so even if I were to use encapsulated encoding,
> there would be ordering requirements.
>
> The Netlink RFC does not make any statements about what is to follow
> nlmsghdr; unless I missed something, it does not mention ordering,
> not even attributes at all. So XTNL is free to use what it chooses -
> including an nlattr32 that is not compatible with nlattr16.

Because the Netlink RFC doesn't make any statement, it doesn't mean that 
you can make assumptions. Moreover, that RFC doesn't cover everything in 
Netlink, that document requires lots of updates or way more RFCs to 
specify lots of undocumented Netlink aspects.

BTW, you may want to read this:
http://1984.lsi.us.es/~pablo/docs/spae.pdf

It still misses lots of aspects, including this, but we've got some more 
new documentation at least. It's not a RFC, it aims to be a tutorial.

>>> 2.2 Dump error code<sub:nfxta_errno>
>>>
>>> Once a NLM_F_MULTI dump operation has been started, for example
>>> with the NFXTM_CHAIN_DUMP request, Netlink kernel users must
>>> always end it successfully with NLMSG_DONE. To convey an error
>>> during the dump, Xtables2 will emit a NFXTA_ERRNO attribute into
>>> the stream (if it can), emit no further attributes for the
>>> request, and cause the dump to stop.
>>>
>>> 0                   1                   2                   3
>>> 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
>>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>> | nla_len = 8                   | nla_type = NFXTA_ERRNO        |
>>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>> | int errno;                                                    |
>>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>
>> Isn't nlmsg_err OK for your needs?
>
> You cannot abort a dump from the kernel, which is why nlmsg_err
> does not get used.

What error can cause a dump from the kernel to be aborted? If we really 
need this, the point would be to add it to netlink instead of 
introducing some ad-hoc facility.

>>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>> | nla_len = 4 + payload         | nla_type = NFXTA_DATA         |
>>> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>> .                                                               .
>>> . e.g. struct xt_hashlimit_info
>>
>> This is fine during some transition period, but Netlink protocols
>> must not encapsulate structures in the payload of their TLVs.
>
> I did not see such a requirement in the Netlink RFC.
> Of course it is for existing extensions.

Again, the RFC is a useless argument for this, look for a better one. 
Encapsulating structures into TLVs is a *really bad practise* since you 
have to stick to the structure layout, which is indeed the problem that 
we have faced in iptables for 10 years, and that many other interfaces 
in the Linux kernel have.

Supporting the encapsulation of the structure during some time (during 
the transition) may be OK, but it's definitely not the way to go in the 
long run.

Remember that the revision field in iptables is a workaround, and the 
result in quite dirty code. The aim at that time we add it was to find 
some temporary solution until we could provide an extensible interface 
for iptables.

Moreover, if we support Netlink on the wire in the future, you'll have 
problems with encapsulated structures.

>> We can avoid this if structures are splitted into several TLVs. You
>> can add new attributes and obsolete old ones.
>
> Yes, but not at this stage. Complete architectural rewrites of
> everything at once comes with plenty of problems. Linux evolution has
> shown that small incremental reviewable patches are the credo.
>
> Do not worry, I left room in XTNL for attributes upgrades.

BTW, I didn't look at your protocol in deep yet but I'd suggest the 
following basis to rework it: one netlink message, one rule operation.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-25 14:21     ` Pablo Neira Ayuso
@ 2010-11-25 21:46       ` Jan Engelhardt
  2010-11-26  8:25         ` Pablo Neira Ayuso
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-11-25 21:46 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Netfilter Developer Mailing List

On Thursday 2010-11-25 15:21, Pablo Neira Ayuso wrote:

> On 25/11/10 14:35, Jan Engelhardt wrote:
>>
>> On Thursday 2010-11-25 12:42, Pablo Neira Ayuso wrote:
>>>>
>>>> nfxt_socket = socket(AF_NETLINK, SOCK_RAW, NETFILTER_XTABLES);
>>>
>>> This has to go upon nfnetlink as other netfilter subsystems.
>>
>> Why so? It is not like Netlink protocols were limited to 32 AFAICS.
>> Also as told, nfnetlink is not fit for parsing netlink messages where
>> an attribute type appears more than once. If anything, I would look
>> into genetlink, though that also starts to look like it cannot do
>> that.
>
>All netfilter subsystems must go over nfnetlink, dot.

Sorry, I don't quite buy that argument just yet. There ought to be some 
technical arguments that justify the use of an extra layer like 
nfnetlink — other than the version field in struct nfgenmsg.

In fact, why don't we just use genetlink for new code instead?

>BTW, I didn't look at your protocol in deep yet but I'd suggest the 
>following basis to rework it: one netlink message, one rule operation.

I can agree with that suggestion, so I will be doing that.

>Because the Netlink RFC doesn't make any statement, it doesn't mean 
>that you can make assumptions.

On the other extreme, Perl5 has shown that, when there is no full
specification, the code fills in. As it stands, af_netlink.c does
support attribute ordering :-)

>What error can cause a dump from the kernel to be aborted? If we really need
>this, the point would be to add it to netlink instead of introducing some
>ad-hoc facility.

Some memory needs to be allocated and stored, right before 
netlink_dump_start is called. It cannot be stored however, because 
nlk->cb->cb_args is inaccessible from outside of the dump function. So 
the lookup and allocation is currently done inside the dump function 
during its first iteration, and can return ENOENT, EINVAL, ENOMEM and 
all that.

>Again, the RFC is a useless argument for this, look for a better one. 
>Encapsulating structures into TLVs is a *really bad practise* since you 
>have to stick to the structure layout, which is indeed the problem that 
>we have faced in iptables for 10 years, and that many other interfaces 
>in the Linux kernel have.

And yet, when Dave was presented with two alternating proposals for 
64-bit interface counters, he preferred to have struct rtnl_link_stats64 
used for IFLA_STATS64 rather than one-attribute-per-entity.

Something is rotten in the state of net/. ;-)

>Supporting the encapsulation of the structure during some time (during the
>transition) may be OK, but it's definitely not the way to go in the long run.
>Moreover, if we support Netlink on the wire in the future, you'll
>have problems with encapsulated structures.

Future, yes. Requiring doing that now — increasing the patch queue depth 
and time-to-kernel — is not productive. A developer's (digital) health 
is downed by todo lists, and replenished by successful merges/patch 
application.

thanks,
Jan
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-25 21:46       ` Jan Engelhardt
@ 2010-11-26  8:25         ` Pablo Neira Ayuso
  2010-11-26 13:59           ` Jan Engelhardt
                             ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-11-26  8:25 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List

On 25/11/10 22:46, Jan Engelhardt wrote:
> 
> On Thursday 2010-11-25 15:21, Pablo Neira Ayuso wrote:
> 
>> On 25/11/10 14:35, Jan Engelhardt wrote:
>>>
>>> On Thursday 2010-11-25 12:42, Pablo Neira Ayuso wrote:
>>>>>
>>>>> nfxt_socket = socket(AF_NETLINK, SOCK_RAW, NETFILTER_XTABLES);
>>>>
>>>> This has to go upon nfnetlink as other netfilter subsystems.
>>>
>>> Why so? It is not like Netlink protocols were limited to 32 AFAICS.
>>> Also as told, nfnetlink is not fit for parsing netlink messages where
>>> an attribute type appears more than once. If anything, I would look
>>> into genetlink, though that also starts to look like it cannot do
>>> that.
>>
>> All netfilter subsystems must go over nfnetlink, dot.
> 
> Sorry, I don't quite buy that argument just yet. There ought to be some 
> technical arguments that justify the use of an extra layer like 
> nfnetlink — other than the version field in struct nfgenmsg.

It's a design argument, so it's indeed technical. Nfnetlink provides
multiplexation upon one of the Netlink busses for our netfilter
subsystems. It was introduced a bit before GeNetlink into the Linux
kernel tree although it was already out-of-tree for quite some time
before it was applied.

> In fact, why don't we just use genetlink for new code instead?

Genetlink is similar. The main difference is that the ID family number
and multicast groups for each subsystem is not fixed, it's registered in
runtime. This means that you have to make the "family name resolution",
ie. to send a message to resolve the ID family number and multicast
groups before doing any operation.

Another reason is consistency, it's a good idea to use the mechanism
that other netfilter subsystems already use.

>> BTW, I didn't look at your protocol in deep yet but I'd suggest the 
>> following basis to rework it: one netlink message, one rule operation.
> 
> I can agree with that suggestion, so I will be doing that.

Great, this approach requires more memory because you spend one netlink
header for every rule, but the cost is worth since it provides flexibility.

>> Because the Netlink RFC doesn't make any statement, it doesn't mean 
>> that you can make assumptions.
> 
> On the other extreme, Perl5 has shown that, when there is no full
> specification, the code fills in. As it stands, af_netlink.c does
> support attribute ordering :-)

I agree, it would be great to have some more specifications. However, 1)
someone would have to like doing that, and 2) Linux kernel evolves so
quick that documenting aspects remains a daunting task. Anyway, I don't
throw the towel on documentation, actually I'd like to do that.

You are quite prolific in writing documentation, let me know if you are
interested if I write some drafts, in case that you want to
contribute/review. Or let me know if you decide to write something, I'd
be pleased to contribute of course.

>> What error can cause a dump from the kernel to be aborted? If we really need
>> this, the point would be to add it to netlink instead of introducing some
>> ad-hoc facility.
> 
> Some memory needs to be allocated and stored, right before 
> netlink_dump_start is called. It cannot be stored however, because 
> nlk->cb->cb_args is inaccessible from outside of the dump function. So 
> the lookup and allocation is currently done inside the dump function 
> during its first iteration, and can return ENOENT, EINVAL, ENOMEM and 
> all that.

What is that initial data handling in dumps for?

>> Again, the RFC is a useless argument for this, look for a better one. 
>> Encapsulating structures into TLVs is a *really bad practise* since you 
>> have to stick to the structure layout, which is indeed the problem that 
>> we have faced in iptables for 10 years, and that many other interfaces 
>> in the Linux kernel have.
> 
> And yet, when Dave was presented with two alternating proposals for 
> 64-bit interface counters, he preferred to have struct rtnl_link_stats64 
> used for IFLA_STATS64 rather than one-attribute-per-entity.
> 
> Something is rotten in the state of net/. ;-)

If Dave decides that, it's because he's sure that such structure will
suffer any changes in the future. This assumption has been proven to be
wrong in iptables, since matches/targets may be extended in a backward
compatible way to include new features.

>> Supporting the encapsulation of the structure during some time (during the
>> transition) may be OK, but it's definitely not the way to go in the long run.
>> Moreover, if we support Netlink on the wire in the future, you'll
>> have problems with encapsulated structures.
> 
> Future, yes. Requiring doing that now — increasing the patch queue depth 
> and time-to-kernel — is not productive. A developer's (digital) health 
> is downed by todo lists, and replenished by successful merges/patch 
> application.

Indeed, I understand that it will be a lot of work to split every single
structure for the first patchset. For that reason, I support the idea of
encapsulating structures in the short run. We can later split them.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26  8:25         ` Pablo Neira Ayuso
@ 2010-11-26 13:59           ` Jan Engelhardt
  2010-11-26 19:48             ` Jozsef Kadlecsik
  2010-11-27 11:10             ` Pablo Neira Ayuso
  2010-11-26 15:27           ` Jan Engelhardt
  2010-12-03 21:03           ` Jan Engelhardt
  2 siblings, 2 replies; 55+ messages in thread
From: Jan Engelhardt @ 2010-11-26 13:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Netfilter Developer Mailing List


On Friday 2010-11-26 09:25, Pablo Neira Ayuso wrote:
>
>> In fact, why don't we just use genetlink for new code instead?
>
>Genetlink is similar. The main difference is that the ID family number
>and multicast groups for each subsystem is not fixed, it's registered in
>runtime. This means that you have to make the "family name resolution",
>ie. to send a message to resolve the ID family number and multicast
>groups before doing any operation.

That does not sound like a showstopper, though.

>>> BTW, I didn't look at your protocol in deep yet but I'd suggest the 
>>> following basis to rework it: one netlink message, one rule operation.
>> 
>> I can agree with that suggestion, so I will be doing that.
>
>Great, this approach requires more memory because you spend one netlink
>header for every rule, but the cost is worth since it provides flexibility.

As far as I can see, it won't cost me more memory; char buf[MNL_BUFFER_SIZE]
remains, and the kernel part does not have to keep more state than before
anyway, so it's a zero-sum game.

>> On the other extreme, Perl5 has shown that, when there is no full
>> specification, the code fills in. As it stands, af_netlink.c does
>> support attribute ordering :-)
>
>I agree, it would be great to have some more specifications. However, 1)
>someone would have to like doing that, and 2) Linux kernel evolves so
>quick that documenting aspects remains a daunting task. Anyway, I don't
>throw the towel on documentation, actually I'd like to do that.
>
>You are quite prolific in writing documentation, let me know if you are
>interested if I write some drafts, in case that you want to
>contribute/review. Or let me know if you decide to write something, I'd
>be pleased to contribute of course.

I have in the pipeline an Netlink e-book, though it's more of a large
howto (like "Writing Netfilter Modules") than an academicly dry RFC.

>>[...] memory needs to be allocated and stored, right before
>>netlink_dump_start is called. [But] because nlk->cb->cb_args is
>>inaccessible from outside[...], the lookup and allocation is
>>currently done inside the dump function[...]
>
>What is that initial data handling in dumps for?

Making an atomic snapshot/copy of the table. A userspace client
could take almost indefinitely on retrieving a table, so it is
possible that something else changes tables meanwhile.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26  8:25         ` Pablo Neira Ayuso
  2010-11-26 13:59           ` Jan Engelhardt
@ 2010-11-26 15:27           ` Jan Engelhardt
  2010-11-27 12:25             ` Pablo Neira Ayuso
  2010-12-03 21:03           ` Jan Engelhardt
  2 siblings, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-11-26 15:27 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Netfilter Developer Mailing List

On Friday 2010-11-26 09:25, Pablo Neira Ayuso wrote:

>>> BTW, I didn't look at your protocol in deep yet but I'd suggest the 
>>> following basis to rework it: one netlink message, one rule operation.
v>> 
>> I can agree with that suggestion, so I will be doing that.
>
>Great, this approach requires more memory because you spend one netlink
>header for every rule, but the cost is worth since it provides flexibility.

Hm, I remembered a problem with that. With "allow same attribute type
multiple times", it is possible to send a single TABLE_REPLACE
request message (even if it is 150 MB in size) that the kernel part
can then work on. Without it, and instead using per-rule ops, it
would mean that I would have to keep a per-fd state (which seems not
possible) and make use of the NETLINK_URELEASE notification handler
to kill said state when the client goes away unexpectedly.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-24 22:29 Xtables2 Netlink spec Jan Engelhardt
  2010-11-25 11:42 ` Pablo Neira Ayuso
@ 2010-11-26 19:01 ` Jozsef Kadlecsik
  2010-12-09 12:08   ` Pablo Neira Ayuso
  2010-12-15  4:55   ` Jan Engelhardt
  1 sibling, 2 replies; 55+ messages in thread
From: Jozsef Kadlecsik @ 2010-11-26 19:01 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Netfilter Developer Mailing List, netfilter, Pablo Neira Ayuso

Hi Jan,

On Wed, 24 Nov 2010, Jan Engelhardt wrote:

> By request of Pablo, I am posting the Xtables2 Netlink interface 
> specification for review. Additionally, further documentation and 
> toolchain around it is available through the temporary project page at
> 
> 	http://jengelh.medozas.de/projects/xtables/
> 
> which currently includes
> 
>  * User Documentation Chapter 1: Architectural Differences
> 
>  * Developer Documentation Part 1: Netlink interface (WIP)
>    This is copied below to facilitate inline replies
> 
>  * Runnable Linux source tree
> 
>  * Runnable userspace library (libnetfilter_xtables)
>    with small test-and-debug program
[...]

Please add fine-grained error reporting to the protocol: in my opinion the 
main shortcoming of the current kernel-userspace xtables protocol is the 
lack of the proper error reporting. I mean, the new protocol should be 
able to carry back which rule caused the error, in the rule whether it was 
a general kind of error (ENOMEM), or a table, chain, match or target error 
and exactly what was the error at table/chain/match/target level.

Say, the TCPMSS target should be able to report back that it cannot be 
used outside of FORWARD, OUTPUT and POSTROUTING. Or that the rule must 
match TCP SYN packets.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26 13:59           ` Jan Engelhardt
@ 2010-11-26 19:48             ` Jozsef Kadlecsik
  2010-11-26 19:55               ` Jan Engelhardt
  2010-11-27 11:10             ` Pablo Neira Ayuso
  1 sibling, 1 reply; 55+ messages in thread
From: Jozsef Kadlecsik @ 2010-11-26 19:48 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Pablo Neira Ayuso, Netfilter Developer Mailing List

On Fri, 26 Nov 2010, Jan Engelhardt wrote:

> On Friday 2010-11-26 09:25, Pablo Neira Ayuso wrote:
> >
> >>[...] memory needs to be allocated and stored, right before
> >>netlink_dump_start is called. [But] because nlk->cb->cb_args is
> >>inaccessible from outside[...], the lookup and allocation is
> >>currently done inside the dump function[...]
> >
> >What is that initial data handling in dumps for?
> 
> Making an atomic snapshot/copy of the table. A userspace client
> could take almost indefinitely on retrieving a table, so it is
> possible that something else changes tables meanwhile.

Why don't you lock the tables during dumping? That way the tables won't 
change, whatever long time the dump takes. Snapshotting the table looks as 
wasting memory and time.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26 19:48             ` Jozsef Kadlecsik
@ 2010-11-26 19:55               ` Jan Engelhardt
  2010-11-26 20:05                 ` Jozsef Kadlecsik
  2010-11-29 12:23                 ` Pablo Neira Ayuso
  0 siblings, 2 replies; 55+ messages in thread
From: Jan Engelhardt @ 2010-11-26 19:55 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Pablo Neira Ayuso, Netfilter Developer Mailing List


On Friday 2010-11-26 20:48, Jozsef Kadlecsik wrote:
>> >
>> >What is that initial data handling in dumps for?
>> 
>> Making an atomic snapshot/copy of the table. A userspace client
>> could take almost indefinitely on retrieving a table, so it is
>> possible that something else changes tables meanwhile.
>
>Why don't you lock the tables during dumping? That way the tables won't 
>change, whatever long time the dump takes. Snapshotting the table looks as 
>wasting memory and time.

For that to work, I would have to use a locking primitive that can be
held across returns to userspace, which leaves semaphores as the only
option and, ya, I didn't quite feel like using _that_. Also sounds a
bit like a killer if an admin cannot update a table just because he
forgot some dumper process in the background in suspended state. :-/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26 19:55               ` Jan Engelhardt
@ 2010-11-26 20:05                 ` Jozsef Kadlecsik
  2010-11-26 21:33                   ` Jan Engelhardt
  2010-11-29 12:23                 ` Pablo Neira Ayuso
  1 sibling, 1 reply; 55+ messages in thread
From: Jozsef Kadlecsik @ 2010-11-26 20:05 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Pablo Neira Ayuso, Netfilter Developer Mailing List

On Fri, 26 Nov 2010, Jan Engelhardt wrote:

> On Friday 2010-11-26 20:48, Jozsef Kadlecsik wrote:
> >> >
> >> >What is that initial data handling in dumps for?
> >> 
> >> Making an atomic snapshot/copy of the table. A userspace client
> >> could take almost indefinitely on retrieving a table, so it is
> >> possible that something else changes tables meanwhile.
> >
> >Why don't you lock the tables during dumping? That way the tables won't 
> >change, whatever long time the dump takes. Snapshotting the table looks as 
> >wasting memory and time.
> 
> For that to work, I would have to use a locking primitive that can be
> held across returns to userspace, which leaves semaphores as the only
> option and, ya, I didn't quite feel like using _that_. Also sounds a
> bit like a killer if an admin cannot update a table just because he
> forgot some dumper process in the background in suspended state. :-/

There's already an internal mutex there (cb_mutex), to serialize dumping. 
(It's unfortunate that cb_mutex cannot be accessed, for other purposes.) 
And the kernel->userspace messaging is asynchronous, so I think suspended 
dumping process won't hurt: the messages will wait it the queue, that's 
all.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26 20:05                 ` Jozsef Kadlecsik
@ 2010-11-26 21:33                   ` Jan Engelhardt
       [not found]                     ` <alpine.DEB.2.00.1011270951330.20431@blackhole.kfki.hu>
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-11-26 21:33 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Pablo Neira Ayuso, Netfilter Developer Mailing List


On Friday 2010-11-26 21:05, Jozsef Kadlecsik wrote:
>> >
>> >Why don't you lock the tables during dumping? That way the tables won't 
>> >change, whatever long time the dump takes. Snapshotting the table looks as 
>> >wasting memory and time.
>> 
>> For that to work, I would have to use a locking primitive that can be
>> held across returns to userspace, which leaves semaphores as the only
>> option and, ya, I didn't quite feel like using _that_. Also sounds a
>> bit like a killer if an admin cannot update a table just because he
>> forgot some dumper process in the background in suspended state. :-/
>
>There's already an internal mutex there (cb_mutex), to serialize dumping.
>(It's unfortunate that cb_mutex cannot be accessed, for other purposes.) 
>And the kernel->userspace messaging is asynchronous, so I think suspended 
>dumping process won't hurt: the messages will wait it the queue, that's 
>all.

It's not about serialization of dumps, but when a writer shows up:

- process A starts a read operation dump
- process A gets forgotten in whatever way
- process B tries to do a _write_ operation on the table

Locking writers out because a reader does not want to finish sounds bad.
Letting A not take a lock, A can get back a non-atomic snapshot.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26 13:59           ` Jan Engelhardt
  2010-11-26 19:48             ` Jozsef Kadlecsik
@ 2010-11-27 11:10             ` Pablo Neira Ayuso
  1 sibling, 0 replies; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-11-27 11:10 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List

On 26/11/10 14:59, Jan Engelhardt wrote:
> 
> On Friday 2010-11-26 09:25, Pablo Neira Ayuso wrote:
>>
>>> In fact, why don't we just use genetlink for new code instead?
>>
>> Genetlink is similar. The main difference is that the ID family number
>> and multicast groups for each subsystem is not fixed, it's registered in
>> runtime. This means that you have to make the "family name resolution",
>> ie. to send a message to resolve the ID family number and multicast
>> groups before doing any operation.
> 
> That does not sound like a showstopper, though.
>
>>>> BTW, I didn't look at your protocol in deep yet but I'd suggest the 
>>>> following basis to rework it: one netlink message, one rule operation.
>>>
>>> I can agree with that suggestion, so I will be doing that.
>>
>> Great, this approach requires more memory because you spend one netlink
>> header for every rule, but the cost is worth since it provides flexibility.
> 
> As far as I can see, it won't cost me more memory; char buf[MNL_BUFFER_SIZE]
> remains, and the kernel part does not have to keep more state than before
> anyway, so it's a zero-sum game.

I meant that less rule-set data will fit into one buffer, but that extra
memory consumption gives you flexibility comes in return.

>>> On the other extreme, Perl5 has shown that, when there is no full
>>> specification, the code fills in. As it stands, af_netlink.c does
>>> support attribute ordering :-)
>>
>> I agree, it would be great to have some more specifications. However, 1)
>> someone would have to like doing that, and 2) Linux kernel evolves so
>> quick that documenting aspects remains a daunting task. Anyway, I don't
>> throw the towel on documentation, actually I'd like to do that.
>>
>> You are quite prolific in writing documentation, let me know if you are
>> interested if I write some drafts, in case that you want to
>> contribute/review. Or let me know if you decide to write something, I'd
>> be pleased to contribute of course.
> 
> I have in the pipeline an Netlink e-book, though it's more of a large
> howto (like "Writing Netfilter Modules") than an academicly dry RFC.

I have a 25 pages document that looks like a HOWTO for libmnl and
Netlink sockets in general that I'll release soon. I didn't publish it
yet because I wanted to release the library first. It comes after my
article.

But you are free not to cooperate and do the "lone rider" thing, of course.

>>> [...] memory needs to be allocated and stored, right before
>>> netlink_dump_start is called. [But] because nlk->cb->cb_args is
>>> inaccessible from outside[...], the lookup and allocation is
>>> currently done inside the dump function[...]
>>
>> What is that initial data handling in dumps for?
> 
> Making an atomic snapshot/copy of the table. A userspace client
> could take almost indefinitely on retrieving a table, so it is
> possible that something else changes tables meanwhile.

Why don't you lock the table during the dump?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26 15:27           ` Jan Engelhardt
@ 2010-11-27 12:25             ` Pablo Neira Ayuso
  0 siblings, 0 replies; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-11-27 12:25 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List

On 26/11/10 16:27, Jan Engelhardt wrote:
> 
> On Friday 2010-11-26 09:25, Pablo Neira Ayuso wrote:
> 
>>>> BTW, I didn't look at your protocol in deep yet but I'd suggest the 
>>>> following basis to rework it: one netlink message, one rule operation.
> v>> 
>>> I can agree with that suggestion, so I will be doing that.
>>
>> Great, this approach requires more memory because you spend one netlink
>> header for every rule, but the cost is worth since it provides flexibility.
> 
> Hm, I remembered a problem with that. With "allow same attribute type
> multiple times", it is possible to send a single TABLE_REPLACE
> request message (even if it is 150 MB in size) that the kernel part
> can then work on. Without it, and instead using per-rule ops, it
> would mean that I would have to keep a per-fd state (which seems not
> possible) and make use of the NETLINK_URELEASE notification handler
> to kill said state when the client goes away unexpectedly.

You can lock the table during the dump to avoid that someone modifies
the rule-set (we can return EAGAIN to the one trying to add some rule,
so it can retry).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
       [not found]                     ` <alpine.DEB.2.00.1011270951330.20431@blackhole.kfki.hu>
@ 2010-11-27 13:39                       ` Jan Engelhardt
  2010-11-27 17:04                         ` Jozsef Kadlecsik
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-11-27 13:39 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Pablo Neira Ayuso, Netfilter Developer Mailing List

On Saturday 2010-11-27 10:06, Jozsef Kadlecsik wrote:
>
>a suspended process could cause any problem, we faced it already.

* let there be a list of 6 entries of data, ABCDEF
* proc A starts a dump and reads, say, 1 of 10 entries (A)
* proc B adds a new entry at the start of the list, Z,
  and deletes an entry, A
  (and does these two actions atomically)
* proc A reads the rest

procA read the sequence ACDEF. In case of interfaces or ct entries,
that does not matter, because they are pretty much independent of
another.

However, a ruleset's rules are not independent in this way. Suppose:

   Z = -A INPUT -m conntrack --ctstate ESTABLISHED
   A = -A INPUT -p icmp -j ACCEPT
   B = -A INPUT -p tcp --dport 22 -j ACCEPT
   C = -A INPUT -j DROP
   D = -A OUTPUT -p icmp -j ACCEPT
   E = -A OUTPUT -p tcp --dport 22 -j ACCEPT
   F = -A OUTPUT -j DROP

One would want to either see ABCDEF, or ZBCDEF, because
presenting ACDEF to the user,

	(a) -A INPUT -p icmp -j ACCEPT
	(c) -A INPUT -j DROP

would mean to him that SSH cannot be accessed, which is not the case.
We needs locks, yeah.

The entire ruleset and chains are RCU-protected, and the actual
packet processing holds rcu_read_lock for the entire period, so it is
already guaranteed to see either ABCDEF or ZBCDEF, which is fine.

When the kernel dumps however, and the skb is full, and it returns to
userspace, no rcu and no mutex may be held, which gives away the
guarantee of an atomic view. The chain might go away inbetween,
unless I hold the writer lock. To hold the lock across the entire
dump would require some semaphore, and it does not seem like a good
idea to block users across returns to userspace either.

Currently, the writer lock (mutex) is held as short as possible, to
make a snapshot, or to actually do the replacement. It sort of
works..

Does that help?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-27 13:39                       ` Jan Engelhardt
@ 2010-11-27 17:04                         ` Jozsef Kadlecsik
  2010-11-27 17:35                           ` Jan Engelhardt
  0 siblings, 1 reply; 55+ messages in thread
From: Jozsef Kadlecsik @ 2010-11-27 17:04 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Pablo Neira Ayuso, Netfilter Developer Mailing List

On Sat, 27 Nov 2010, Jan Engelhardt wrote:

> On Saturday 2010-11-27 10:06, Jozsef Kadlecsik wrote:
> >
> >a suspended process could cause any problem, we faced it already.
> 
> * let there be a list of 6 entries of data, ABCDEF
> * proc A starts a dump and reads, say, 1 of 10 entries (A)
> * proc B adds a new entry at the start of the list, Z,
>   and deletes an entry, A
>   (and does these two actions atomically)
> * proc A reads the rest
[...] 
> When the kernel dumps however, and the skb is full, and it returns to
> userspace, no rcu and no mutex may be held, which gives away the
> guarantee of an atomic view. The chain might go away inbetween,
> unless I hold the writer lock. To hold the lock across the entire
> dump would require some semaphore, and it does not seem like a good
> idea to block users across returns to userspace either.

AFAIK when the kernel dumps and the skb is full, it's not returned 
directly to the userspace but first enqueued. And if the userspace 
listener is too slow/not ready to receive the netlink messages from the 
queue, then the queue can get full and messages will be lost. So I think 
the steps are:

* let there be a list of 6 entries of data, ABCDEF
* proc A starts a dump, and kernel enqueues the messages, which
  cover all entries. From kernel point of view the dumping is done.
  At the same time Proc A is receiving the messages from the queue...
* proc B adds a new entry at the start of the list, Z,
  and deletes an entry, A (and does these two actions atomically)
* ...proc A reads the rest from the queue.
  If messages are not lost due to the slow userpsace handling, then the 
  received state is correct and corresponds to the one when the dump
  was initiated.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-27 17:04                         ` Jozsef Kadlecsik
@ 2010-11-27 17:35                           ` Jan Engelhardt
  2010-11-27 20:42                             ` Jozsef Kadlecsik
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-11-27 17:35 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Pablo Neira Ayuso, Netfilter Developer Mailing List

On Saturday 2010-11-27 18:04, Jozsef Kadlecsik wrote:
>
>AFAIK when the kernel dumps and the skb is full, it's not returned 
>directly to the userspace but first enqueued.

I don't recognize that inside the code however.

In netlink_dump(), there is the cb->dump call. There are no loops
inside this function. Neither are there in the two parents,
netlink_dump_start() and netlink_recvmsg().

The queueing behavior you mention I only see when the kernel
autonomously sends packets that have not been explicitly
requested by userspace.

>* let there be a list of 6 entries of data, ABCDEF
>* proc A starts a dump, and kernel enqueues the messages, which
>  cover all entries.

Given a big enough ruleset, it won't cover all of them.

>And if the userspace listener is too slow/not ready to receive the netlink
>messages from the queue, then the queue can get full and messages will be
>lost.

If that was the case, we would have seen truncated ct dumps already.
Because the standard queue is just something like 124KB or so, that
would be a mere ~40 fully-stuffed packets in the queue, and people
certainly have more than 40, for safety, even more than 1000
connections live at a time, so somebody has already run into a
queue overflow, iff things were queued in the first place.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-27 17:35                           ` Jan Engelhardt
@ 2010-11-27 20:42                             ` Jozsef Kadlecsik
  2010-11-29 12:30                               ` Pablo Neira Ayuso
  0 siblings, 1 reply; 55+ messages in thread
From: Jozsef Kadlecsik @ 2010-11-27 20:42 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Pablo Neira Ayuso, Netfilter Developer Mailing List

On Sat, 27 Nov 2010, Jan Engelhardt wrote:

> On Saturday 2010-11-27 18:04, Jozsef Kadlecsik wrote:
> >
> >AFAIK when the kernel dumps and the skb is full, it's not returned 
> >directly to the userspace but first enqueued.
> 
> I don't recognize that inside the code however.
> 
> In netlink_dump(), there is the cb->dump call. There are no loops
> inside this function. Neither are there in the two parents,
> netlink_dump_start() and netlink_recvmsg().

In netlink_dump() after the call to cb->dump, you can see the call to 
skb_queue_tail. So the message is queued.

Where the looping happens, I do not know. Some socket magic?
 
Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26 19:55               ` Jan Engelhardt
  2010-11-26 20:05                 ` Jozsef Kadlecsik
@ 2010-11-29 12:23                 ` Pablo Neira Ayuso
  1 sibling, 0 replies; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-11-29 12:23 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Jozsef Kadlecsik, Netfilter Developer Mailing List

On 26/11/10 20:55, Jan Engelhardt wrote:
>
> On Friday 2010-11-26 20:48, Jozsef Kadlecsik wrote:
>>>>
>>>> What is that initial data handling in dumps for?
>>>
>>> Making an atomic snapshot/copy of the table. A userspace client
>>> could take almost indefinitely on retrieving a table, so it is
>>> possible that something else changes tables meanwhile.
>>
>> Why don't you lock the tables during dumping? That way the tables won't
>> change, whatever long time the dump takes. Snapshotting the table looks as
>> wasting memory and time.
>
> For that to work, I would have to use a locking primitive that can be
> held across returns to userspace, which leaves semaphores as the only
> option and, ya, I didn't quite feel like using _that_.

Abusing the Netlink protocol to overcome the "supposed to be" limitation 
does not seem to me the way to go. Moreover, if we ever have more than X 
bytes rule-sets (I don't remember that limit that you have previously 
mentioned), you'll have to add some locking strategy anyway.

The locking is the way to go.

> Also sounds a
> bit like a killer if an admin cannot update a table just because he
> forgot some dumper process in the background in suspended state. :-/

He will notice that he did that because he hits EAGAIN, so he can kill 
the process in background and retry.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-27 20:42                             ` Jozsef Kadlecsik
@ 2010-11-29 12:30                               ` Pablo Neira Ayuso
  2010-11-29 12:39                                 ` Jozsef Kadlecsik
  0 siblings, 1 reply; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-11-29 12:30 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Jan Engelhardt, Netfilter Developer Mailing List

On 27/11/10 21:42, Jozsef Kadlecsik wrote:
> On Sat, 27 Nov 2010, Jan Engelhardt wrote:
>
>> On Saturday 2010-11-27 18:04, Jozsef Kadlecsik wrote:
>>>
>>> AFAIK when the kernel dumps and the skb is full, it's not returned
>>> directly to the userspace but first enqueued.
>>
>> I don't recognize that inside the code however.
>>
>> In netlink_dump(), there is the cb->dump call. There are no loops
>> inside this function. Neither are there in the two parents,
>> netlink_dump_start() and netlink_recvmsg().
>
> In netlink_dump() after the call to cb->dump, you can see the call to
> skb_queue_tail. So the message is queued.
>
> Where the looping happens, I do not know. Some socket magic?

1) you send a NLM_F_DUMP request.
2) the kernel fills one skb and enqueue it into the socket buffer.
3) the process invokes recvmsg(), it gets the datagram, then go back to 
step 2).

Thus, the dump only consumes 1 memory page per recv() invocation. That's 
the magic.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-29 12:30                               ` Pablo Neira Ayuso
@ 2010-11-29 12:39                                 ` Jozsef Kadlecsik
  2010-11-29 12:55                                   ` Pablo Neira Ayuso
  2010-11-29 13:49                                   ` Pablo Neira Ayuso
  0 siblings, 2 replies; 55+ messages in thread
From: Jozsef Kadlecsik @ 2010-11-29 12:39 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Jan Engelhardt, Netfilter Developer Mailing List

On Mon, 29 Nov 2010, Pablo Neira Ayuso wrote:

> On 27/11/10 21:42, Jozsef Kadlecsik wrote:
> > On Sat, 27 Nov 2010, Jan Engelhardt wrote:
> > 
> > > On Saturday 2010-11-27 18:04, Jozsef Kadlecsik wrote:
> > > > 
> > > > AFAIK when the kernel dumps and the skb is full, it's not returned
> > > > directly to the userspace but first enqueued.
> > > 
> > > I don't recognize that inside the code however.
> > > 
> > > In netlink_dump(), there is the cb->dump call. There are no loops
> > > inside this function. Neither are there in the two parents,
> > > netlink_dump_start() and netlink_recvmsg().
> > 
> > In netlink_dump() after the call to cb->dump, you can see the call to
> > skb_queue_tail. So the message is queued.
> > 
> > Where the looping happens, I do not know. Some socket magic?
> 
> 1) you send a NLM_F_DUMP request.
> 2) the kernel fills one skb and enqueue it into the socket buffer.
> 3) the process invokes recvmsg(), it gets the datagram, then go back to step
> 2).
> 
> Thus, the dump only consumes 1 memory page per recv() invocation. That's the
> magic.

So Jan has got right: if the process which initiated the dumping is 
suspended and locking is used, then the suspended process locks out all 
other processes.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-29 12:39                                 ` Jozsef Kadlecsik
@ 2010-11-29 12:55                                   ` Pablo Neira Ayuso
  2010-11-29 13:26                                     ` Jan Engelhardt
  2010-11-29 13:49                                   ` Pablo Neira Ayuso
  1 sibling, 1 reply; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-11-29 12:55 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Jan Engelhardt, Netfilter Developer Mailing List

On 29/11/10 13:39, Jozsef Kadlecsik wrote:
> On Mon, 29 Nov 2010, Pablo Neira Ayuso wrote:
>
>> On 27/11/10 21:42, Jozsef Kadlecsik wrote:
>>> On Sat, 27 Nov 2010, Jan Engelhardt wrote:
>>>
>>>> On Saturday 2010-11-27 18:04, Jozsef Kadlecsik wrote:
>>>>>
>>>>> AFAIK when the kernel dumps and the skb is full, it's not returned
>>>>> directly to the userspace but first enqueued.
>>>>
>>>> I don't recognize that inside the code however.
>>>>
>>>> In netlink_dump(), there is the cb->dump call. There are no loops
>>>> inside this function. Neither are there in the two parents,
>>>> netlink_dump_start() and netlink_recvmsg().
>>>
>>> In netlink_dump() after the call to cb->dump, you can see the call to
>>> skb_queue_tail. So the message is queued.
>>>
>>> Where the looping happens, I do not know. Some socket magic?
>>
>> 1) you send a NLM_F_DUMP request.
>> 2) the kernel fills one skb and enqueue it into the socket buffer.
>> 3) the process invokes recvmsg(), it gets the datagram, then go back to step
>> 2).
>>
>> Thus, the dump only consumes 1 memory page per recv() invocation. That's the
>> magic.
>
> So Jan has got right: if the process which initiated the dumping is
> suspended and locking is used, then the suspended process locks out all
> other processes.

Suspending the process in the middle of the dump is a strange scenario. 
If you did this accidentally, then you can notice it since you hit 
EAGAIN over and over again, so you can kill the dumper manually.

A user-space process that stops in the middle of a dump must be 
considered buggy, you have to fix your user-space program.

This does not affect packet processing, only user-space interaction.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-29 12:55                                   ` Pablo Neira Ayuso
@ 2010-11-29 13:26                                     ` Jan Engelhardt
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Engelhardt @ 2010-11-29 13:26 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Jozsef Kadlecsik, Netfilter Developer Mailing List


On Monday 2010-11-29 13:55, Pablo Neira Ayuso wrote:
>
>Suspending the process in the middle of the dump is a strange scenario. If you
>did this accidentally, then you can notice it since you hit EAGAIN over and
>over again, so you can kill the dumper manually.

Though since frontend programs like iptables automatically retry on -EAGAIN,
the user won't see it.

>A user-space process that stops in the middle of a dump must be considered
>buggy, you have to fix your user-space program.

Short of misscheduling of processes on behalf of the kernel ;-)

So barring process using -EAGAIN it is, then.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-29 12:39                                 ` Jozsef Kadlecsik
  2010-11-29 12:55                                   ` Pablo Neira Ayuso
@ 2010-11-29 13:49                                   ` Pablo Neira Ayuso
  1 sibling, 0 replies; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-11-29 13:49 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Jan Engelhardt, Netfilter Developer Mailing List

On 29/11/10 13:39, Jozsef Kadlecsik wrote:
> On Mon, 29 Nov 2010, Pablo Neira Ayuso wrote:
> 
>> On 27/11/10 21:42, Jozsef Kadlecsik wrote:
>>> On Sat, 27 Nov 2010, Jan Engelhardt wrote:
>>>
>>>> On Saturday 2010-11-27 18:04, Jozsef Kadlecsik wrote:
>>>>>
>>>>> AFAIK when the kernel dumps and the skb is full, it's not returned
>>>>> directly to the userspace but first enqueued.
>>>>
>>>> I don't recognize that inside the code however.
>>>>
>>>> In netlink_dump(), there is the cb->dump call. There are no loops
>>>> inside this function. Neither are there in the two parents,
>>>> netlink_dump_start() and netlink_recvmsg().
>>>
>>> In netlink_dump() after the call to cb->dump, you can see the call to
>>> skb_queue_tail. So the message is queued.
>>>
>>> Where the looping happens, I do not know. Some socket magic?
>>
>> 1) you send a NLM_F_DUMP request.
>> 2) the kernel fills one skb and enqueue it into the socket buffer.
>> 3) the process invokes recvmsg(), it gets the datagram, then go back to step
>> 2).
>>
>> Thus, the dump only consumes 1 memory page per recv() invocation. That's the
>> magic.
> 
> So Jan has got right: if the process which initiated the dumping is 
> suspended and locking is used, then the suspended process locks out all 
> other processes.

We may use also some optimistic locking approach:

* We assume that there's an ID for every table.
* That ID is increased if you perform some modification in the rule-set
of that table.
* That ID has to be included as an attribute.
* If the ID changes in the middle of one dump, you restart the dump of
that table since the beginning.
* Once you start receiving information from a different table, you can
consider that the previous table has been fully dumped. For the last
table, you can take the NLM_F_DONE as trailing.

The user-space application has to keep the entries in a list until that
table has been fully dumped, if it notices that the ID increases, it
releases previous entries and get new ones.

This means that the iptables-save command based on netlink does not
write the entries into the disk straight forward, instead it keeps the
rules for that table in the list until the dump is finished. Then, it
writes them to the disk (so we make sure there are no duplicated entries).

Optimistic approaches have one problem, if the rule-set is modified
during the dump quite so often, it may keep dumping indefinitely.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26  8:25         ` Pablo Neira Ayuso
  2010-11-26 13:59           ` Jan Engelhardt
  2010-11-26 15:27           ` Jan Engelhardt
@ 2010-12-03 21:03           ` Jan Engelhardt
  2010-12-07  7:49             ` Pablo Neira Ayuso
  2 siblings, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-12-03 21:03 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Netfilter Developer Mailing List


On Friday 2010-11-26 09:25, Pablo Neira Ayuso wrote:
>
>> In fact, why don't we just use genetlink for new code instead?
>
>Genetlink is similar. The main difference is that the ID family number
>and multicast groups for each subsystem is not fixed, it's registered in
>runtime. This means that you have to make the "family name resolution",
>ie. to send a message to resolve the ID family number and multicast
>groups before doing any operation.
>
>Another reason is consistency, it's a good idea to use the mechanism
>that other netfilter subsystems already use.

"Look, iptables uses ioctl! Let's use ioctl again for xt2."

I am skeptical about shrinkfitting something onto an older
interface (nfnetlink) when there is genetlink..

>>> BTW, I didn't look at your protocol in deep yet but I'd suggest the 
>>> following basis to rework it: one netlink message, one rule operation.
>> 
>> I can agree with that suggestion, so I will be doing that.

Something else that came to mind -- if ordering of nlattrs is not
guaranteed inside nlmsg, we could just pack all the data into a
single attribute and mark it binary, which means potential relays (if
nl ever gets that far!) won't reorder it.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-03 21:03           ` Jan Engelhardt
@ 2010-12-07  7:49             ` Pablo Neira Ayuso
  2010-12-07 13:30               ` Jan Engelhardt
  0 siblings, 1 reply; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-12-07  7:49 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List

On 03/12/10 22:03, Jan Engelhardt wrote:
>
> On Friday 2010-11-26 09:25, Pablo Neira Ayuso wrote:
>>
>>> In fact, why don't we just use genetlink for new code instead?
>>
>> Genetlink is similar. The main difference is that the ID family number
>> and multicast groups for each subsystem is not fixed, it's registered in
>> runtime. This means that you have to make the "family name resolution",
>> ie. to send a message to resolve the ID family number and multicast
>> groups before doing any operation.
>>
>> Another reason is consistency, it's a good idea to use the mechanism
>> that other netfilter subsystems already use.
>
> "Look, iptables uses ioctl! Let's use ioctl again for xt2."

It's up to you to use an interface from the stone age.

> I am skeptical about shrinkfitting something onto an older
> interface (nfnetlink) when there is genetlink..

That's an empty argument. Tell me one feature that nfnetlink does not 
have have but genetlink does.

>>>> BTW, I didn't look at your protocol in deep yet but I'd suggest the
>>>> following basis to rework it: one netlink message, one rule operation.
>>>
>>> I can agree with that suggestion, so I will be doing that.
>
> Something else that came to mind -- if ordering of nlattrs is not
> guaranteed inside nlmsg, we could just pack all the data into a
> single attribute and mark it binary, which means potential relays (if
> nl ever gets that far!) won't reorder it.

that's an abuse of netlink.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-07  7:49             ` Pablo Neira Ayuso
@ 2010-12-07 13:30               ` Jan Engelhardt
  2010-12-08 11:36                 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-12-07 13:30 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Netfilter Developer Mailing List

On Tuesday 2010-12-07 08:49, Pablo Neira Ayuso wrote:
>
>> I am skeptical about shrinkfitting something onto an older
>> interface (nfnetlink) when there is genetlink..
>
> That's an empty argument. Tell me one feature that nfnetlink does not have have
> but genetlink does.

I was thinking about the name resolution of netlink subsubsystems. In
plain netlink and nfnetlink, subsys pointers are kept in a static
array, which means the more subsystems are defined, the more memory
is used (for n -> infinity) even if only one subsys is used.
Would that be a reasonable concern?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-07 13:30               ` Jan Engelhardt
@ 2010-12-08 11:36                 ` Pablo Neira Ayuso
  0 siblings, 0 replies; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-12-08 11:36 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Netfilter Developer Mailing List

On 07/12/10 14:30, Jan Engelhardt wrote:
> 
> On Tuesday 2010-12-07 08:49, Pablo Neira Ayuso wrote:
>>
>>> I am skeptical about shrinkfitting something onto an older
>>> interface (nfnetlink) when there is genetlink..
>>
>> That's an empty argument. Tell me one feature that nfnetlink does not have have
>> but genetlink does.
> 
> I was thinking about the name resolution of netlink subsubsystems. In
> plain netlink and nfnetlink, subsys pointers are kept in a static
> array, which means the more subsystems are defined, the more memory
> is used (for n -> infinity) even if only one subsys is used.
> Would that be a reasonable concern?

The number of nfnetlink subsystem is small and it will remain small
along time I guess. I would not spend time on such optimization.

Anyway, the static subsys ID number is a good thing IMO.

The floating genetlink ID allows to have out-of-tree subsystems, which
is something that I don't like that. Moreover, you have to send an
initial message to resolve the ID number and subscribe to possible
changes in it.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26 19:01 ` Jozsef Kadlecsik
@ 2010-12-09 12:08   ` Pablo Neira Ayuso
  2010-12-14  2:01     ` Jan Engelhardt
  2010-12-15  4:55   ` Jan Engelhardt
  1 sibling, 1 reply; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-12-09 12:08 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Jan Engelhardt, Netfilter Developer Mailing List, netfilter

On 26/11/10 20:01, Jozsef Kadlecsik wrote:
> Hi Jan,
> 
> On Wed, 24 Nov 2010, Jan Engelhardt wrote:
> 
>> By request of Pablo, I am posting the Xtables2 Netlink interface 
>> specification for review. Additionally, further documentation and 
>> toolchain around it is available through the temporary project page at
>>
>> 	http://jengelh.medozas.de/projects/xtables/
>>
>> which currently includes
>>
>>  * User Documentation Chapter 1: Architectural Differences
>>
>>  * Developer Documentation Part 1: Netlink interface (WIP)
>>    This is copied below to facilitate inline replies
>>
>>  * Runnable Linux source tree
>>
>>  * Runnable userspace library (libnetfilter_xtables)
>>    with small test-and-debug program
> [...]
> 
> Please add fine-grained error reporting to the protocol: in my opinion the 
> main shortcoming of the current kernel-userspace xtables protocol is the 
> lack of the proper error reporting. I mean, the new protocol should be 
> able to carry back which rule caused the error, in the rule whether it was 
> a general kind of error (ENOMEM), or a table, chain, match or target error 
> and exactly what was the error at table/chain/match/target level.
> 
> Say, the TCPMSS target should be able to report back that it cannot be 
> used outside of FORWARD, OUTPUT and POSTROUTING. Or that the rule must 
> match TCP SYN packets.

If we follow the one message per rule basis, you can put several
messages into one batch with different sequence numbers. Thus, you can
know what message in the batch has triggered the error and the reason.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-09 12:08   ` Pablo Neira Ayuso
@ 2010-12-14  2:01     ` Jan Engelhardt
  2010-12-14  2:16       ` James Nurmi
  2010-12-15 13:54       ` Pablo Neira Ayuso
  0 siblings, 2 replies; 55+ messages in thread
From: Jan Engelhardt @ 2010-12-14  2:01 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Jozsef Kadlecsik, Netfilter Developer Mailing List, netfilter

On Thursday 2010-12-09 13:08, Pablo Neira Ayuso wrote:
>
>If we follow the one message per rule basis, you can put several
>messages into one batch with different sequence numbers. Thus, you can
>know what message in the batch has triggered the error and the reason.

/* The unwritten laws of netlink */

Normally, the sequence number of a response message is simply
the one from the request message. But in a dump where there
can be multiple messages, do they all share the sequence number?

Must the response sequence numbers match at all, or is it like TCP
where each side has its own set?

BTW, can response messages - all those leading up to NLMSG_DONE -
have different nlmsg_type, or not? Does the nlmsg_type need
to match the request type?

Jan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-14  2:01     ` Jan Engelhardt
@ 2010-12-14  2:16       ` James Nurmi
  2010-12-14  3:46         ` Jan Engelhardt
  2010-12-15 13:54       ` Pablo Neira Ayuso
  1 sibling, 1 reply; 55+ messages in thread
From: James Nurmi @ 2010-12-14  2:16 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter

On Mon, Dec 13, 2010 at 6:01 PM, Jan Engelhardt <jengelh@medozas.de> wrote:
>
> On Thursday 2010-12-09 13:08, Pablo Neira Ayuso wrote:
>>
>>If we follow the one message per rule basis, you can put several
>>messages into one batch with different sequence numbers. Thus, you can
>>know what message in the batch has triggered the error and the reason.
>
> /* The unwritten laws of netlink */
>
> Normally, the sequence number of a response message is simply
> the one from the request message. But in a dump where there
> can be multiple messages, do they all share the sequence number?
>
> Must the response sequence numbers match at all, or is it like TCP
> where each side has its own set?
>
> BTW, can response messages - all those leading up to NLMSG_DONE -
> have different nlmsg_type, or not? Does the nlmsg_type need
> to match the request type?
>

There's an upper layer netlink flag set (NLM_F_MULTI , NLMSG_DONE,
etc) that's usually used in NETLINK_ROUTE, etc.

Essentially, you set the MULTI flag on multi-part messages, and DONE
on the last one, clients are expected to collect and respond;

In those cases, the sequence number does NOT change intra-messageset
in my experience, instead it is assumed they may be received out of
order.

> Jan
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-14  2:16       ` James Nurmi
@ 2010-12-14  3:46         ` Jan Engelhardt
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Engelhardt @ 2010-12-14  3:46 UTC (permalink / raw)
  To: James Nurmi
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter


On Tuesday 2010-12-14 03:16, James Nurmi wrote:
>On Mon, Dec 13, 2010 at 6:01 PM, Jan Engelhardt <jengelh@medozas.de> wrote:
>>
>> Normally, the sequence number of a response message is simply
>> the one from the request message. But in a dump where there
>> can be multiple messages, do they all share the sequence number?
>
>In those cases, the sequence number does NOT change intra-messageset
>in my experience, instead it is assumed they may be received out of
>order.

That would defeat the purpose (as seen from over here) of multipart
messages in the first place.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-11-26 19:01 ` Jozsef Kadlecsik
  2010-12-09 12:08   ` Pablo Neira Ayuso
@ 2010-12-15  4:55   ` Jan Engelhardt
  2010-12-15  8:51     ` Jozsef Kadlecsik
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-12-15  4:55 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Netfilter Developer Mailing List, netfilter, Pablo Neira Ayuso

On Friday 2010-11-26 20:01, Jozsef Kadlecsik wrote:
>On Wed, 24 Nov 2010, Jan Engelhardt wrote:
>
>> By request of Pablo, I am posting the Xtables2 Netlink interface 
>> specification for review. Additionally, further documentation and 
>> toolchain around it is available through the temporary project page at
>> 
>> 	http://jengelh.medozas.de/projects/xtables/
>>  * Runnable userspace library (libnetfilter_xtables)
>>    with small test-and-debug program
>[...]
>
>Please add fine-grained error reporting to the protocol: in my opinion the 
>main shortcoming of the current kernel-userspace xtables protocol is the 
>lack of the proper error reporting. I mean, the new protocol should be 
>able to carry back which rule caused the error, in the rule whether it was 
>a general kind of error (ENOMEM), or a table, chain, match or target error 
>and exactly what was the error at table/chain/match/target level.

That should not be a problem. However, what do we do with the general
kind of error that is overly general? A.k.a. the dreaded EINVAL.

Say a user requested jumping to a chain, but did not give a chain name.
Normally that would be EINVAL, but EINVAL is overused already.

What would you like? Comprehensive error numbers (sort of like the
ones Windows is said to use) aka. NFXTE_NO_CHAIN_NAME_GIVEN,
or a human-readable string; or something else?

>Say, the TCPMSS target should be able to report back that it cannot be 
>used outside of FORWARD, OUTPUT and POSTROUTING.

NFXTE_HOOKMASK_NOT_ADHERED or string?

>Or that the rule must match TCP SYN packets.

TCPMSS doing a rule-search for -p tcp is pretty ugly (it must understand
the data structures, and that is sort of a backwards shot); Patrick
once suggested IIRC that TCPMSS should just silently skip non-SYNs.

Maybe both error numbers, and providing extensions with the
possibility to send strings? It is impossible to provision error
numbers for out-of-tree extensions, so having a way for an extension
to return some NFXTA_ERRSTR "One of my parameters is wrong!" seems
required at least.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-15  4:55   ` Jan Engelhardt
@ 2010-12-15  8:51     ` Jozsef Kadlecsik
  2010-12-16  9:57       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 55+ messages in thread
From: Jozsef Kadlecsik @ 2010-12-15  8:51 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Netfilter Developer Mailing List, netfilter, Pablo Neira Ayuso

On Wed, 15 Dec 2010, Jan Engelhardt wrote:

> On Friday 2010-11-26 20:01, Jozsef Kadlecsik wrote:
> >On Wed, 24 Nov 2010, Jan Engelhardt wrote:
> >
> >> By request of Pablo, I am posting the Xtables2 Netlink interface 
> >> specification for review. Additionally, further documentation and 
> >> toolchain around it is available through the temporary project page at
> >> 
> >> 	http://jengelh.medozas.de/projects/xtables/
> >>  * Runnable userspace library (libnetfilter_xtables)
> >>    with small test-and-debug program
> >[...]
> >
> >Please add fine-grained error reporting to the protocol: in my opinion the 
> >main shortcoming of the current kernel-userspace xtables protocol is the 
> >lack of the proper error reporting. I mean, the new protocol should be 
> >able to carry back which rule caused the error, in the rule whether it was 
> >a general kind of error (ENOMEM), or a table, chain, match or target error 
> >and exactly what was the error at table/chain/match/target level.
> 
> That should not be a problem. However, what do we do with the general
> kind of error that is overly general? A.k.a. the dreaded EINVAL.
> 
> Say a user requested jumping to a chain, but did not give a chain name.
> Normally that would be EINVAL, but EINVAL is overused already.

When I wrote general error I meant the ones where there is no point (or 
cannot be) to specify the nature of the error exactly. Like in the 
example, ENOMEM: it's needles to report which new data field could not be 
allocated. But if the specified chain does not exists, that must not be 
masked by a general EINVAL. The user must be alerted that the chain with 
the given name does not exist.

> What would you like? Comprehensive error numbers (sort of like the
> ones Windows is said to use) aka. NFXTE_NO_CHAIN_NAME_GIVEN,
> or a human-readable string; or something else?

Yes, use comprehensive error codes. And it's the responsibility of the 
userspace tool to translate them to proper error messages.

> >Say, the TCPMSS target should be able to report back that it cannot be 
> >used outside of FORWARD, OUTPUT and POSTROUTING.
> 
> NFXTE_HOOKMASK_NOT_ADHERED or string?

The former, i.e. error code.

> >Or that the rule must match TCP SYN packets.
> 
> TCPMSS doing a rule-search for -p tcp is pretty ugly (it must understand
> the data structures, and that is sort of a backwards shot); Patrick
> once suggested IIRC that TCPMSS should just silently skip non-SYNs.

For the clarity of the rules I'd prefer the current solution, i.e. check 
and enforce that the rule matches TCP SYN packets. If we relax it, next 
time someone could complain why TCPMSS is restricted to FORWARD, OUTPUT, 
POSTROUTING, why can't the system simply skip the target when called at 
non-appropriate hooks.

But I just picked TCPMSS target for errors which currently expressed 
in EINVAL.

> Maybe both error numbers, and providing extensions with the
> possibility to send strings? It is impossible to provision error
> numbers for out-of-tree extensions, so having a way for an extension
> to return some NFXTA_ERRSTR "One of my parameters is wrong!" seems
> required at least.

I have got a three-level error coding in my mind: general, standard error 
codes (ENOMEM, EPERM, etc.), general netfilter specific ones (like 
NFXTE_HOOKMASK_NOT_ADHERED) and table/match/target specific ones.

But I do realize that it's much easier (and therefore quite tempting) to 
construct the full error message in kernel space and just send it back.
However, that'd make quite hard to support internationalization.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-14  2:01     ` Jan Engelhardt
  2010-12-14  2:16       ` James Nurmi
@ 2010-12-15 13:54       ` Pablo Neira Ayuso
  2010-12-16 14:05         ` Thomas Graf
  1 sibling, 1 reply; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-12-15 13:54 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jozsef Kadlecsik, Netfilter Developer Mailing List, netfilter

On 14/12/10 03:01, Jan Engelhardt wrote:
> 
> On Thursday 2010-12-09 13:08, Pablo Neira Ayuso wrote:
>>
>> If we follow the one message per rule basis, you can put several
>> messages into one batch with different sequence numbers. Thus, you can
>> know what message in the batch has triggered the error and the reason.
> 
> /* The unwritten laws of netlink */
>
> Normally, the sequence number of a response message is simply
> the one from the request message. But in a dump where there
> can be multiple messages, do they all share the sequence number?

Yes, because they are the result of one request. Responses use the
sequence number of the original request. Thus, you can identify what
messages come as part of what requests.

> Must the response sequence numbers match at all, or is it like TCP
> where each side has its own set?

They have to match.

> BTW, can response messages - all those leading up to NLMSG_DONE -
> have different nlmsg_type, or not?

They all have the same type.

> Does the nlmsg_type need to match the request type?

No.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-15  8:51     ` Jozsef Kadlecsik
@ 2010-12-16  9:57       ` Jesper Dangaard Brouer
  2010-12-16 12:51         ` Error reporting in Netlink (Re: Xtables2 Netlink spec) Jan Engelhardt
  0 siblings, 1 reply; 55+ messages in thread
From: Jesper Dangaard Brouer @ 2010-12-16  9:57 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Jan Engelhardt, Netfilter Developer Mailing List, netfilter,
	Pablo Neira Ayuso, tgraf


Cc.ed Thomas Graf (tgraf@redhat.com), Thomas presented some interesting 
ideas on netlink error-codes and strings during NetConf 2010, see:

  http://vger.kernel.org/netconf2010.html
  http://vger.kernel.org/netconf2010_slides/tgraf_netconf10.odp


On Wed, 15 Dec 2010, Jozsef Kadlecsik wrote:
> On Wed, 15 Dec 2010, Jan Engelhardt wrote:
>> On Friday 2010-11-26 20:01, Jozsef Kadlecsik wrote:
>>> On Wed, 24 Nov 2010, Jan Engelhardt wrote:
>>>
>>>> By request of Pablo, I am posting the Xtables2 Netlink interface
>>>> specification for review. Additionally, further documentation and
>>>> toolchain around it is available through the temporary project page at
>>>>
>>>> 	http://jengelh.medozas.de/projects/xtables/
>>>>  * Runnable userspace library (libnetfilter_xtables)
>>>>    with small test-and-debug program
>>> [...]
>>>
>>> Please add fine-grained error reporting to the protocol: in my opinion the
>>> main shortcoming of the current kernel-userspace xtables protocol is the
>>> lack of the proper error reporting. I mean, the new protocol should be
>>> able to carry back which rule caused the error, in the rule whether it was
>>> a general kind of error (ENOMEM), or a table, chain, match or target error
>>> and exactly what was the error at table/chain/match/target level.
>>
>> That should not be a problem. However, what do we do with the general
>> kind of error that is overly general? A.k.a. the dreaded EINVAL.
>>
>> Say a user requested jumping to a chain, but did not give a chain name.
>> Normally that would be EINVAL, but EINVAL is overused already.
>
> When I wrote general error I meant the ones where there is no point (or
> cannot be) to specify the nature of the error exactly. Like in the
> example, ENOMEM: it's needles to report which new data field could not be
> allocated. But if the specified chain does not exists, that must not be
> masked by a general EINVAL. The user must be alerted that the chain with
> the given name does not exist.
>
>> What would you like? Comprehensive error numbers (sort of like the
>> ones Windows is said to use) aka. NFXTE_NO_CHAIN_NAME_GIVEN,
>> or a human-readable string; or something else?
>
> Yes, use comprehensive error codes. And it's the responsibility of the
> userspace tool to translate them to proper error messages.
>
>>> Say, the TCPMSS target should be able to report back that it cannot be
>>> used outside of FORWARD, OUTPUT and POSTROUTING.
>>
>> NFXTE_HOOKMASK_NOT_ADHERED or string?
>
> The former, i.e. error code.
>
>>> Or that the rule must match TCP SYN packets.
>>
>> TCPMSS doing a rule-search for -p tcp is pretty ugly (it must understand
>> the data structures, and that is sort of a backwards shot); Patrick
>> once suggested IIRC that TCPMSS should just silently skip non-SYNs.
>
> For the clarity of the rules I'd prefer the current solution, i.e. check
> and enforce that the rule matches TCP SYN packets. If we relax it, next
> time someone could complain why TCPMSS is restricted to FORWARD, OUTPUT,
> POSTROUTING, why can't the system simply skip the target when called at
> non-appropriate hooks.
>
> But I just picked TCPMSS target for errors which currently expressed
> in EINVAL.
>
>> Maybe both error numbers, and providing extensions with the
>> possibility to send strings? It is impossible to provision error
>> numbers for out-of-tree extensions, so having a way for an extension
>> to return some NFXTA_ERRSTR "One of my parameters is wrong!" seems
>> required at least.

I like this.

> I have got a three-level error coding in my mind: general, standard error
> codes (ENOMEM, EPERM, etc.), general netfilter specific ones (like
> NFXTE_HOOKMASK_NOT_ADHERED) and table/match/target specific ones.
>
> But I do realize that it's much easier (and therefore quite tempting) to
> construct the full error message in kernel space and just send it back.
> However, that'd make quite hard to support internationalization.

To support internationalization, we could just add an error-number-code 
in front of the constructed error message?

I'm a fan of the full error message system from the kernel, becuase its 
much easier to maintain, as we don't need to update iptables userspace 
each time we introduce a new error code/message.


Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Error reporting in Netlink (Re: Xtables2 Netlink spec)
  2010-12-16  9:57       ` Jesper Dangaard Brouer
@ 2010-12-16 12:51         ` Jan Engelhardt
  2010-12-16 13:43           ` Thomas Graf
  2010-12-16 23:23           ` Patrick McHardy
  0 siblings, 2 replies; 55+ messages in thread
From: Jan Engelhardt @ 2010-12-16 12:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jozsef Kadlecsik, Netfilter Developer Mailing List, netfilter,
	Pablo Neira Ayuso, tgraf

On Thursday 2010-12-16 10:57, Jesper Dangaard Brouer wrote:

>Cc.ed Thomas Graf (tgraf@redhat.com), Thomas presented some interesting 
>ideas on netlink error-codes and strings during NetConf 2010, see:
>
>http://vger.kernel.org/netconf2010.html 
>http://vger.kernel.org/netconf2010_slides/tgraf_netconf10.odp

The idea is appending an error string is ok for Netlink as a protocol
(specification-wise), but the size constraints of the skbuffs in the
Linux may make its practical implementation a little harder. "Half of
the packet" is already used for the original request message, and
cramming an extra error string may bust the space.
It also does not look very netlinky to not use nlattrs ;-)

On Wed, 15 Dec 2010, Jozsef Kadlecsik came about with:
>
>>I have got a three-level error coding in my mind: general, standard 
>>error codes (ENOMEM, EPERM, etc.), general netfilter specific ones 
>>(like NFXTE_HOOKMASK_NOT_ADHERED) and table/match/target specific 
>>ones.
>>
>>But I do realize that it's much easier (and therefore quite tempting) 
>>to construct the full error message in kernel space and just send it 
>>back. However, that'd make quite hard to support internationalization.

It's not like those strings change all that much.

>To support internationalization, we could just add an error-number-code 
>in front of the constructed error message?

Buying us what? If you change the string, but the gettext lookup is 
based upon a number, you will get an outdated translation, which is not 
nice either. IMHO: Better an English message than an inaccurate message.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Error reporting in Netlink (Re: Xtables2 Netlink spec)
  2010-12-16 12:51         ` Error reporting in Netlink (Re: Xtables2 Netlink spec) Jan Engelhardt
@ 2010-12-16 13:43           ` Thomas Graf
  2010-12-16 13:51             ` Jan Engelhardt
                               ` (2 more replies)
  2010-12-16 23:23           ` Patrick McHardy
  1 sibling, 3 replies; 55+ messages in thread
From: Thomas Graf @ 2010-12-16 13:43 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jesper Dangaard Brouer, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter, Pablo Neira Ayuso

On Thu, 2010-12-16 at 13:51 +0100, Jan Engelhardt wrote: 
> On Thursday 2010-12-16 10:57, Jesper Dangaard Brouer wrote:
> 
> >Cc.ed Thomas Graf (tgraf@redhat.com), Thomas presented some interesting 
> >ideas on netlink error-codes and strings during NetConf 2010, see:
> >
> >http://vger.kernel.org/netconf2010.html 
> >http://vger.kernel.org/netconf2010_slides/tgraf_netconf10.odp
> 
> The idea is appending an error string is ok for Netlink as a protocol
> (specification-wise), but the size constraints of the skbuffs in the
> Linux may make its practical implementation a little harder. "Half of
> the packet" is already used for the original request message, and
> cramming an extra error string may bust the space.
> It also does not look very netlinky to not use nlattrs ;-)

Why not use nlattr to encode the error string? It would make error
messages easier to extend in the future. At some point we might want
to add an offset field which points into the original netlink message
describing the attribute which caused the failure.

> On Wed, 15 Dec 2010, Jozsef Kadlecsik came about with:
> >
> >>I have got a three-level error coding in my mind: general, standard 
> >>error codes (ENOMEM, EPERM, etc.), general netfilter specific ones 
> >>(like NFXTE_HOOKMASK_NOT_ADHERED) and table/match/target specific 
> >>ones.
> >>
> >>But I do realize that it's much easier (and therefore quite tempting) 
> >>to construct the full error message in kernel space and just send it 
> >>back. However, that'd make quite hard to support internationalization.

Thinking of netlink protocols in general and netfilter in specific,
maintaining a list of reserved error codes for each subsystem/target/
module will result in an unbearable pain if the error codes are not
separated into individual namespaces for each module.

That would in turn require each module to define a unique number or
some form of namespace resolution mechanism which does not help to keep
things simple.

This is the main reason why I advocate the use of error strings.

> It's not like those strings change all that much.
> 
> 
> >To support internationalization, we could just add an error-number-code 
> >in front of the constructed error message?
> 
> Buying us what? If you change the string, but the gettext lookup is 
> based upon a number, you will get an outdated translation, which is not 
> nice either. IMHO: Better an English message than an inaccurate message.

Do we *really* need internationalization for error messages on this
level? Primarily userspace should be in charge of checking for all kinds
of erroneous user input and produce meaningful context based,
translatable error messages. Error messages produced by the kernel
should be the exception and not a substitute for proper userspace error
handling.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Error reporting in Netlink (Re: Xtables2 Netlink spec)
  2010-12-16 13:43           ` Thomas Graf
@ 2010-12-16 13:51             ` Jan Engelhardt
  2010-12-16 14:19               ` Thomas Graf
  2010-12-16 14:47             ` Jozsef Kadlecsik
  2010-12-16 23:31             ` Patrick McHardy
  2 siblings, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-12-16 13:51 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Jesper Dangaard Brouer, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter, Pablo Neira Ayuso


On Thursday 2010-12-16 14:43, Thomas Graf wrote:
>
>>It also does not look very netlinky to not use nlattrs
>
>Why not use nlattr to encode the error string? It would make error
>messages easier to extend in the future. At some point we might want
>to add an offset field which points into the original netlink
>message describing the attribute which caused the failure.

Is that a yes or a no?


>>>>But I do realize that it's much easier (and therefore quite
>>>>tempting) to construct the full error message in kernel space and
>>>>just send it back. However, that'd make quite hard to support
>>>>internationalization.
>
>Thinking of netlink protocols in general and netfilter in specific,
>maintaining a list of reserved error codes for each
>subsystem/target/ module will result in an unbearable pain if the
>error codes are not separated into individual namespaces for each
>module.

I did not plan on giving extensions a numeric namespace; here it's
largely strings only.


>> It's not like those strings change all that much.
>> 
>> >To support internationalization, we could just add an error-number-code 
>> >in front of the constructed error message?
>> 
>> Buying us what? If you change the string, but the gettext lookup is 
>> based upon a number, you will get an outdated translation, which is not 
>> nice either. IMHO: Better an English message than an inaccurate message.
>
>Do we *really* need internationalization for error messages on this
>level? Primarily userspace should be in charge of checking for all kinds
>of erroneous user input and produce meaningful context based,
>translatable error messages.

Let's see, why does iproute2 not just do that, then? Because some
things only the kernel knows about, so even if the parameters are
correct from the userspace side of view, the kernel may reject the
request nevertheless.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-15 13:54       ` Pablo Neira Ayuso
@ 2010-12-16 14:05         ` Thomas Graf
  2010-12-16 14:22           ` Jan Engelhardt
  2010-12-17  9:55           ` Pablo Neira Ayuso
  0 siblings, 2 replies; 55+ messages in thread
From: Thomas Graf @ 2010-12-16 14:05 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Jan Engelhardt, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter

On Wed, Dec 15, 2010 at 02:54:26PM +0100, Pablo Neira Ayuso wrote:
> > BTW, can response messages - all those leading up to NLMSG_DONE -
> > have different nlmsg_type, or not?
> 
> They all have the same type.

This is not a MUST. It is perfectly legal to f.e.:

 -> FOO_GET (seq=1, NLM_F_REQUEST)
 <- FOO_DEL (seq=1, NLM_F_MULTI)
 <- FOO_ADD (seq=1, NLM_F_MULTI)
 <- NLMSG_DONE (seq=1)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Error reporting in Netlink (Re: Xtables2 Netlink spec)
  2010-12-16 13:51             ` Jan Engelhardt
@ 2010-12-16 14:19               ` Thomas Graf
  2010-12-17 10:00                 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 55+ messages in thread
From: Thomas Graf @ 2010-12-16 14:19 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Thomas Graf, Jesper Dangaard Brouer, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter, Pablo Neira Ayuso

On Thu, Dec 16, 2010 at 02:51:07PM +0100, Jan Engelhardt wrote:
> >Why not use nlattr to encode the error string? It would make error
> >messages easier to extend in the future. At some point we might want
> >to add an offset field which points into the original netlink
> >message describing the attribute which caused the failure.
> 
> Is that a yes or a no?

The proposed solution at netconf involved appending the error string
directly. Inspired by your comment I realizezd that encoding the
error string as nlattr allow for additional attributes would be a
better implementation.

As for size limitations, even though most netlink protocols do it, I
don't see the point in appending the whole request message in a error
message. The header would be completely sufficient for all request/reply
based protocols. It is no problem for userspace to keep a copy of the
last request sent.

> Let's see, why does iproute2 not just do that, then? Because some
> things only the kernel knows about, so even if the parameters are
> correct from the userspace side of view, the kernel may reject the
> request nevertheless.

You are not the first to come up with this but it is still a pretty
lazy excuse. iproute2 could do a lot better than it does today and
it has been improved a lot over time. The main reason for the current
situation is the atomic nature of routing netlink requests handlers.
Checking for errors in an atomic context where no interface disappears
and no route can be removed while the request is being processed simply
makes checking for errors a lot easier.

This does not mean that userspace should have a card blanche for
sending bogus combinations of netlink attributes and expect the kernel
to always come up with a perfect verbose error message.

The kernel can still send ENODEV to indicate the specified network
device does not exist but it should only cover the case where the
interface disappeared between the userspace checking for its existance
and the request being processed.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-16 14:05         ` Thomas Graf
@ 2010-12-16 14:22           ` Jan Engelhardt
  2010-12-17  7:25             ` Thomas Graf
  2010-12-17  9:55           ` Pablo Neira Ayuso
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-12-16 14:22 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter

On Thursday 2010-12-16 15:05, Thomas Graf wrote:

>On Wed, Dec 15, 2010 at 02:54:26PM +0100, Pablo Neira Ayuso wrote:
>> > BTW, can response messages - all those leading up to NLMSG_DONE -
>> > have different nlmsg_type, or not?
>> 
>> They all have the same type.
>
>This is not a MUST. It is perfectly legal to f.e.:
>
> -> FOO_GET (seq=1, NLM_F_REQUEST)
> <- FOO_DEL (seq=1, NLM_F_MULTI)
> <- FOO_ADD (seq=1, NLM_F_MULTI)
> <- NLMSG_DONE (seq=1)

Oh great, now the confusion is complete. One person says this, another 
says something else. Best of all, the Netlink RFC leaves it unspecified, 
so it's all hearsay, beliefs and Perl5-style ("Source acts as normative 
reference") referencing. I guess we are doomed until the original 
Netlink3549 authors step up and tell us their intentions.

As I see it, we need a discussion to specify what is to be done with 
unspecified parts, with 3549 as an origin.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Error reporting in Netlink (Re: Xtables2 Netlink spec)
  2010-12-16 13:43           ` Thomas Graf
  2010-12-16 13:51             ` Jan Engelhardt
@ 2010-12-16 14:47             ` Jozsef Kadlecsik
  2010-12-16 15:09               ` Jan Engelhardt
  2010-12-16 23:31             ` Patrick McHardy
  2 siblings, 1 reply; 55+ messages in thread
From: Jozsef Kadlecsik @ 2010-12-16 14:47 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Jan Engelhardt, Jesper Dangaard Brouer,
	Netfilter Developer Mailing List, netfilter, Pablo Neira Ayuso

On Thu, 16 Dec 2010, Thomas Graf wrote:

> On Thu, 2010-12-16 at 13:51 +0100, Jan Engelhardt wrote: 
> > On Thursday 2010-12-16 10:57, Jesper Dangaard Brouer wrote:
> > 
> > >Cc.ed Thomas Graf (tgraf@redhat.com), Thomas presented some interesting 
> > >ideas on netlink error-codes and strings during NetConf 2010, see:
> > >
> > >http://vger.kernel.org/netconf2010.html 
> > >http://vger.kernel.org/netconf2010_slides/tgraf_netconf10.odp
> > 
> > The idea is appending an error string is ok for Netlink as a protocol
> > (specification-wise), but the size constraints of the skbuffs in the
> > Linux may make its practical implementation a little harder. "Half of
> > the packet" is already used for the original request message, and
> > cramming an extra error string may bust the space.
> > It also does not look very netlinky to not use nlattrs ;-)
> 
> Why not use nlattr to encode the error string? It would make error
> messages easier to extend in the future. At some point we might want
> to add an offset field which points into the original netlink message
> describing the attribute which caused the failure.

I use full netlink messages with respect the buffers, taking into account 
the size of struct nlmsgerr in case of an error. If there was an error 
string, how much space should be reserved for that?

> > On Wed, 15 Dec 2010, Jozsef Kadlecsik came about with:
> > >
> > >>I have got a three-level error coding in my mind: general, standard 
> > >>error codes (ENOMEM, EPERM, etc.), general netfilter specific ones 
> > >>(like NFXTE_HOOKMASK_NOT_ADHERED) and table/match/target specific 
> > >>ones.
> > >>
> > >>But I do realize that it's much easier (and therefore quite tempting) 
> > >>to construct the full error message in kernel space and just send it 
> > >>back. However, that'd make quite hard to support internationalization.
> 
> Thinking of netlink protocols in general and netfilter in specific,
> maintaining a list of reserved error codes for each subsystem/target/
> module will result in an unbearable pain if the error codes are not
> separated into individual namespaces for each module.
> 
> That would in turn require each module to define a unique number or
> some form of namespace resolution mechanism which does not help to keep
> things simple.
> 
> This is the main reason why I advocate the use of error strings.

But why error codes looks complicated? Netlink already supports it and 
it's simple to separate them:
	errcode < 256	generic errors
256 <=	errcode < 512	generic netfilter specific errors
512 <=	errcode		table/match/module specific errors

An error pointer is required which points to the table/match/module, which
triggered the error.
 
> > It's not like those strings change all that much.
> > 
> > 
> > >To support internationalization, we could just add an error-number-code 
> > >in front of the constructed error message?
> > 
> > Buying us what? If you change the string, but the gettext lookup is 
> > based upon a number, you will get an outdated translation, which is not 
> > nice either. IMHO: Better an English message than an inaccurate message.
> 
> Do we *really* need internationalization for error messages on this
> level? Primarily userspace should be in charge of checking for all kinds
> of erroneous user input and produce meaningful context based,
> translatable error messages. Error messages produced by the kernel
> should be the exception and not a substitute for proper userspace error
> handling.

Netlink based protocol is the path to internationalization. Netlink based 
protol leads to usable library. Library leads to gui. Gui leads to 
internationalization. ;-)

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlec@mail.kfki.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Error reporting in Netlink (Re: Xtables2 Netlink spec)
  2010-12-16 14:47             ` Jozsef Kadlecsik
@ 2010-12-16 15:09               ` Jan Engelhardt
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Engelhardt @ 2010-12-16 15:09 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Thomas Graf, Jesper Dangaard Brouer,
	Netfilter Developer Mailing List, netfilter, Pablo Neira Ayuso


On Thursday 2010-12-16 15:47, Jozsef Kadlecsik wrote:
>> 
>> Thinking of netlink protocols in general and netfilter in specific,
>> maintaining a list of reserved error codes for each subsystem/target/
>> module will result in an unbearable pain if the error codes are not
>> separated into individual namespaces for each module.
>> 
>> That would in turn require each module to define a unique number or
>> some form of namespace resolution mechanism which does not help to keep
>> things simple.
>> 
>> This is the main reason why I advocate the use of error strings.
>
>But why error codes looks complicated? Netlink already supports it and 
>it's simple to separate them:
>	errcode < 256	generic errors
>256 <=	errcode < 512	generic netfilter specific errors
>512 <=	errcode		table/match/module specific errors

Actually, <4096 is reserved for the generic system errors (Exxx).

The specific issue here however is that you cannot easily delegate
the space >=512 to modules, so it's probably best if we don't try.

>>Do we *really* need internationalization for error messages on this
>>level? Primarily userspace should be in charge of checking for all kinds
>>of erroneous user input and produce meaningful context based,
>>translatable error messages. Error messages produced by the kernel
>>should be the exception and not a substitute for proper userspace error
>>handling.
>
>Netlink based protocol is the path to internationalization. Netlink based 
>protol leads to usable library. Library leads to gui. Gui leads to 
>internationalization. ;-)

On the contrary:
"Users leads to guis, guis lead to requests, requests lead to libraries,
libraries lead to netlink.
Doesn't make sense? Ask SUN!"

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Error reporting in Netlink (Re: Xtables2 Netlink spec)
  2010-12-16 12:51         ` Error reporting in Netlink (Re: Xtables2 Netlink spec) Jan Engelhardt
  2010-12-16 13:43           ` Thomas Graf
@ 2010-12-16 23:23           ` Patrick McHardy
  2010-12-17 10:02             ` Pablo Neira Ayuso
  1 sibling, 1 reply; 55+ messages in thread
From: Patrick McHardy @ 2010-12-16 23:23 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jesper Dangaard Brouer, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter, Pablo Neira Ayuso,
	tgraf

Am 16.12.2010 13:51, schrieb Jan Engelhardt:
> On Thursday 2010-12-16 10:57, Jesper Dangaard Brouer wrote:
> 
>> Cc.ed Thomas Graf (tgraf@redhat.com), Thomas presented some interesting 
>> ideas on netlink error-codes and strings during NetConf 2010, see:
>>
>> http://vger.kernel.org/netconf2010.html 
>> http://vger.kernel.org/netconf2010_slides/tgraf_netconf10.odp
> 
> The idea is appending an error string is ok for Netlink as a protocol
> (specification-wise), but the size constraints of the skbuffs in the
> Linux may make its practical implementation a little harder. "Half of
> the packet" is already used for the original request message, and
> cramming an extra error string may bust the space.
> It also does not look very netlinky to not use nlattrs ;-)

I agree, error strings don't look like a viable solution to me,
they are basically impossible to interpret by an application,
you run into localization issues and so on.

I'd suggest to return an errno value and the attribute causing
the error, possibly also the value (we append the original message
anyways, but in case of lists it might be hard to locate the specific
attribute). The harder cases are when a combination of multiple
attributes are responsible for the error, but still, the application
has to understand the kernel interpretation anyways, so I'd simply
return the errno and all attributes responsible. Leave interpretation
up to userspace.

> On Wed, 15 Dec 2010, Jozsef Kadlecsik came about with:
>>
>>> I have got a three-level error coding in my mind: general, standard 
>>> error codes (ENOMEM, EPERM, etc.), general netfilter specific ones 
>>> (like NFXTE_HOOKMASK_NOT_ADHERED) and table/match/target specific 
>>> ones.
>>>
>>> But I do realize that it's much easier (and therefore quite tempting) 
>>> to construct the full error message in kernel space and just send it 
>>> back. However, that'd make quite hard to support internationalization.
> 
> It's not like those strings change all that much.
> 
> 
>> To support internationalization, we could just add an error-number-code 
>> in front of the constructed error message?
> 
> Buying us what? If you change the string, but the gettext lookup is 
> based upon a number, you will get an outdated translation, which is not 
> nice either. IMHO: Better an English message than an inaccurate message.

Forget about strings, any error returned by the kernel should not only
be suitable for interpretation by a human, but also by an application.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Error reporting in Netlink (Re: Xtables2 Netlink spec)
  2010-12-16 13:43           ` Thomas Graf
  2010-12-16 13:51             ` Jan Engelhardt
  2010-12-16 14:47             ` Jozsef Kadlecsik
@ 2010-12-16 23:31             ` Patrick McHardy
  2010-12-17  6:58               ` Thomas Graf
  2 siblings, 1 reply; 55+ messages in thread
From: Patrick McHardy @ 2010-12-16 23:31 UTC (permalink / raw)
  To: tgraf
  Cc: Jan Engelhardt, Jesper Dangaard Brouer, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter, Pablo Neira Ayuso

Am 16.12.2010 14:43, schrieb Thomas Graf:
> Thinking of netlink protocols in general and netfilter in specific,
> maintaining a list of reserved error codes for each subsystem/target/
> module will result in an unbearable pain if the error codes are not
> separated into individual namespaces for each module.
> 
> That would in turn require each module to define a unique number or
> some form of namespace resolution mechanism which does not help to keep
> things simple.
> 
> This is the main reason why I advocate the use of error strings.

I completely disagree. As I said previously, userspace has to have
knowledge of the kernel interpretation anyways.

We already have libc calls which define complex errors like:

stdtod(): if val == HUGE_VAL && errno == ERANGE: positive overflow

I see no reason why we can't define combination of attributes
and errno values for netlink messages.

Something like:

[IFLA_VLAN_ID] == NULL && errno == EINVAL: missing attribute
[IFLA_VLAN_LINK] && errno == ENODEV: lower link does not exist

and so on.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Error reporting in Netlink (Re: Xtables2 Netlink spec)
  2010-12-16 23:31             ` Patrick McHardy
@ 2010-12-17  6:58               ` Thomas Graf
  0 siblings, 0 replies; 55+ messages in thread
From: Thomas Graf @ 2010-12-17  6:58 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jan Engelhardt, Jesper Dangaard Brouer, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter, Pablo Neira Ayuso

On Fri, 2010-12-17 at 00:31 +0100, Patrick McHardy wrote: 
> I completely disagree. As I said previously, userspace has to have
> knowledge of the kernel interpretation anyways.
> 
> We already have libc calls which define complex errors like:
> 
> stdtod(): if val == HUGE_VAL && errno == ERANGE: positive overflow
> 
> I see no reason why we can't define combination of attributes
> and errno values for netlink messages.
> 
> Something like:
> 
> [IFLA_VLAN_ID] == NULL && errno == EINVAL: missing attribute
> [IFLA_VLAN_LINK] && errno == ENODEV: lower link does not exist
> 
> and so on.

Using strings would still involve including an errno as we currently
do and as I pointed out in my other mail I am also very positive
about including an offset pointer or a attribute type to specify the
attribute that caused the failure.

Maybe I am thinking too much about other netlink protocols where
errors often occur due to complicated combinations of missing attributes
and specific attribute values having special meanings. Those cases
would benefit a lot from error strings.

I really don't want to see:

return nl_errstring(ENOMEM, "Out of memory");

I am aiming at a verbose error or status string which acts as an
additional helping point in case of complicated errors. I really
don't want to replace the method of using errno to report errors
in general.

I thought about using attrs to specify the error and the reason I
did choose strings was that when we introduce new errors in the
kernel we have to update all applications. Which is no problem if
everyone uses libraries, so it is probably trivial for netfilter but
will be less trivial for netlink users as a group.

That said, I completely understand your point of view a well.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-16 14:22           ` Jan Engelhardt
@ 2010-12-17  7:25             ` Thomas Graf
  2010-12-17  9:35               ` Jan Engelhardt
  0 siblings, 1 reply; 55+ messages in thread
From: Thomas Graf @ 2010-12-17  7:25 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter

On Thu, Dec 16, 2010 at 03:22:07PM +0100, Jan Engelhardt wrote:
> On Thursday 2010-12-16 15:05, Thomas Graf wrote:
> >
> > -> FOO_GET (seq=1, NLM_F_REQUEST)
> > <- FOO_DEL (seq=1, NLM_F_MULTI)
> > <- FOO_ADD (seq=1, NLM_F_MULTI)
> > <- NLMSG_DONE (seq=1)
> 
> Oh great, now the confusion is complete. One person says this, another 
> says something else. Best of all, the Netlink RFC leaves it unspecified, 
> so it's all hearsay, beliefs and Perl5-style ("Source acts as normative 
> reference") referencing. I guess we are doomed until the original 
> Netlink3549 authors step up and tell us their intentions.
> 
> As I see it, we need a discussion to specify what is to be done with 
> unspecified parts, with 3549 as an origin.

The RFC was not written prior to the implementation but after it has
been around for a while.

The current netlink code implementation defines the standard. It is the
standard because we have not been breaking it and will never do.

Netlink is very open minded, it does not care if individual protocols
define their own semantics. Most will never make use of the above but
it is perfectly legal to do so.

NLM_F_MULTI && NLMSG_DONE is simply a way to have the receiver
continue recieving and parsing. The flag states "Wait, be patient,
my reply consists of multiple messages" and nothing more.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-17  7:25             ` Thomas Graf
@ 2010-12-17  9:35               ` Jan Engelhardt
  2010-12-17  9:50                 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Engelhardt @ 2010-12-17  9:35 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter


On Friday 2010-12-17 08:25, Thomas Graf wrote:
>On Thu, Dec 16, 2010 at 03:22:07PM +0100, Jan Engelhardt wrote:
>> 
>>Oh great, now the confusion is complete. One person says this,
>>another says something else. Best of all, the Netlink RFC leaves it
>>unspecified, so it's all hearsay, beliefs and Perl5-style ("Source
>>acts as normative reference") referencing. I guess we are doomed
>>until the original Netlink3549 authors step up and tell us their
>>intentions.
>> 
>>As I see it, we need a discussion to specify what is to be done
>>with unspecified parts, with 3549 as an origin.
>
>The current netlink code implementation defines the standard. It is
>the standard because we have not been breaking it and will never do.
>
>Netlink is very open minded, it does not care if individual
>protocols define their own semantics.

So in fact, it does allow for preservation of attribute order and 
support for multiple attributes appearing with the same type, since that 
is part of my subprotocol anyway, right?

Cf. http://marc.info/?l=netfilter-devel&m=129068531114996&w=2

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-17  9:35               ` Jan Engelhardt
@ 2010-12-17  9:50                 ` Pablo Neira Ayuso
  0 siblings, 0 replies; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-12-17  9:50 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Thomas Graf, Jozsef Kadlecsik, Netfilter Developer Mailing List,
	netfilter

On 17/12/10 10:35, Jan Engelhardt wrote:
> 
> On Friday 2010-12-17 08:25, Thomas Graf wrote:
>> On Thu, Dec 16, 2010 at 03:22:07PM +0100, Jan Engelhardt wrote:
>>>
>>> Oh great, now the confusion is complete. One person says this,
>>> another says something else. Best of all, the Netlink RFC leaves it
>>> unspecified, so it's all hearsay, beliefs and Perl5-style ("Source
>>> acts as normative reference") referencing. I guess we are doomed
>>> until the original Netlink3549 authors step up and tell us their
>>> intentions.
>>>
>>> As I see it, we need a discussion to specify what is to be done
>>> with unspecified parts, with 3549 as an origin.
>>
>> The current netlink code implementation defines the standard. It is
>> the standard because we have not been breaking it and will never do.
>>
>> Netlink is very open minded, it does not care if individual
>> protocols define their own semantics.
> 
> So in fact, it does allow for preservation of attribute order and 
> support for multiple attributes appearing with the same type, since that 
> is part of my subprotocol anyway, right?
> 
> Cf. http://marc.info/?l=netfilter-devel&m=129068531114996&w=2

As Thomas said, Netlink whatever protocol upon it, but I already told
you: making assumptions on the order of the attributes is not a good
practise because you'll have to stick to a certain message layout.

That's the opposite to what we aim which is to provide protocols that
can be easily extended in the future. If you assume that we cannot
change the attribute ordering, that's a rule that everybody will have to
live with forever.

Again, it will be valid, yes, but it's a poorly designed protocol.

Moreover, the reason why you want that attribute trailer is because of
the supposed-to-be limitations that you're trying to avoid. And, again,
I have to tell you that, avoiding the limitation by introducing
assumptions in the protocol is not a good idea.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-16 14:05         ` Thomas Graf
  2010-12-16 14:22           ` Jan Engelhardt
@ 2010-12-17  9:55           ` Pablo Neira Ayuso
  2010-12-17 14:56             ` Jan Engelhardt
  1 sibling, 1 reply; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-12-17  9:55 UTC (permalink / raw)
  To: Jan Engelhardt, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter

On 16/12/10 15:05, Thomas Graf wrote:
> On Wed, Dec 15, 2010 at 02:54:26PM +0100, Pablo Neira Ayuso wrote:
>>> BTW, can response messages - all those leading up to NLMSG_DONE -
>>> have different nlmsg_type, or not?
>>
>> They all have the same type.
> 
> This is not a MUST. It is perfectly legal to f.e.:
> 
>  -> FOO_GET (seq=1, NLM_F_REQUEST)
>  <- FOO_DEL (seq=1, NLM_F_MULTI)
>  <- FOO_ADD (seq=1, NLM_F_MULTI)
>  <- NLMSG_DONE (seq=1)

What realistic situation will require this?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Error reporting in Netlink (Re: Xtables2 Netlink spec)
  2010-12-16 14:19               ` Thomas Graf
@ 2010-12-17 10:00                 ` Pablo Neira Ayuso
  0 siblings, 0 replies; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-12-17 10:00 UTC (permalink / raw)
  To: Jan Engelhardt, Thomas Graf, Jesper Dangaard Brouer,
	Jozsef Kadlecsik, Netfilter 

On 16/12/10 15:19, Thomas Graf wrote:
> On Thu, Dec 16, 2010 at 02:51:07PM +0100, Jan Engelhardt wrote:
>>> Why not use nlattr to encode the error string? It would make error
>>> messages easier to extend in the future. At some point we might want
>>> to add an offset field which points into the original netlink
>>> message describing the attribute which caused the failure.
>>
>> Is that a yes or a no?
> 
> The proposed solution at netconf involved appending the error string
> directly. Inspired by your comment I realizezd that encoding the
> error string as nlattr allow for additional attributes would be a
> better implementation.

Indeed, I'd prefer adding an extra netlink header with a new type like
NLM_ERROR2 followed by attributes like the string (or an int) that
contain the error.

This also can be added with breaking existing apps and libraries IIRC.

> As for size limitations, even though most netlink protocols do it, I
> don't see the point in appending the whole request message in a error
> message. The header would be completely sufficient for all request/reply
> based protocols. It is no problem for userspace to keep a copy of the
> last request sent.

At least, we require the netlink header of the request message in error
messages. This is useful for message batches that are sent to
kernel-space, to know which one triggered the error.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Error reporting in Netlink (Re: Xtables2 Netlink spec)
  2010-12-16 23:23           ` Patrick McHardy
@ 2010-12-17 10:02             ` Pablo Neira Ayuso
  0 siblings, 0 replies; 55+ messages in thread
From: Pablo Neira Ayuso @ 2010-12-17 10:02 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jan Engelhardt, Jesper Dangaard Brouer, Jozsef Kadlecsik,
	Netfilter Developer Mailing List, netfilter, tgraf

On 17/12/10 00:23, Patrick McHardy wrote:
> Am 16.12.2010 13:51, schrieb Jan Engelhardt:
>> On Thursday 2010-12-16 10:57, Jesper Dangaard Brouer wrote:
>>
>>> Cc.ed Thomas Graf (tgraf@redhat.com), Thomas presented some interesting 
>>> ideas on netlink error-codes and strings during NetConf 2010, see:
>>>
>>> http://vger.kernel.org/netconf2010.html 
>>> http://vger.kernel.org/netconf2010_slides/tgraf_netconf10.odp
>>
>> The idea is appending an error string is ok for Netlink as a protocol
>> (specification-wise), but the size constraints of the skbuffs in the
>> Linux may make its practical implementation a little harder. "Half of
>> the packet" is already used for the original request message, and
>> cramming an extra error string may bust the space.
>> It also does not look very netlinky to not use nlattrs ;-)
> 
> I agree, error strings don't look like a viable solution to me,
> they are basically impossible to interpret by an application,
> you run into localization issues and so on.
> 
> I'd suggest to return an errno value and the attribute causing
> the error, possibly also the value (we append the original message
> anyways, but in case of lists it might be hard to locate the specific
> attribute). The harder cases are when a combination of multiple
> attributes are responsible for the error, but still, the application
> has to understand the kernel interpretation anyways, so I'd simply
> return the errno and all attributes responsible. Leave interpretation
> up to userspace.

I'd also prefer an int. We can define the meaning of the error numbers
in the protocol header file.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Xtables2 Netlink spec
  2010-12-17  9:55           ` Pablo Neira Ayuso
@ 2010-12-17 14:56             ` Jan Engelhardt
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Engelhardt @ 2010-12-17 14:56 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Jozsef Kadlecsik, Netfilter Developer Mailing List, netfilter

On Friday 2010-12-17 10:55, Pablo Neira Ayuso wrote:

>On 16/12/10 15:05, Thomas Graf wrote:
>> On Wed, Dec 15, 2010 at 02:54:26PM +0100, Pablo Neira Ayuso wrote:
>>>> BTW, can response messages - all those leading up to NLMSG_DONE -
>>>> have different nlmsg_type, or not?
>>>
>>> They all have the same type.
>> 
>> This is not a MUST. It is perfectly legal to f.e.:
>> 
>>  -> FOO_GET (seq=1, NLM_F_REQUEST)
>>  <- FOO_DEL (seq=1, NLM_F_MULTI)
>>  <- FOO_ADD (seq=1, NLM_F_MULTI)
>>  <- NLMSG_DONE (seq=1)
>
>What realistic situation will require this?

This does:

-> NFXTM_CHAIN_DUMP<NFXTA_NAME>
<- NFXTM_RULE_START<>
<- NFXTM_EMATCH<NFXTA_NAME,NFXTA_REVISION,NFXTA_DATA>
<- NFXTM_EMATCH<NFXTA_NAME,NFXTA_REVISION,NFXTA_DATA>
<- NFXTM_ETARGET<NFXTA_NAME,NFXTA_REVISION,NFXTA_DATA>
<- NFXTM_ETARGET<NFXTA_NAME,NFXTA_REVISION,NFXTA_DATA>
<- NFXTM_RULE_END<>
<- NFXTM_RULE_START<>
<- NFXTM_ETARGET<NFXTA_VERDICT>
<- NFXTM_RULE_END<>
<- NLMSG_DONE

This is 9 messages with answers related to the ruleset.

If only a single nlmsg_type was possible for NLM_F_MULTI replies,
this is probably how things would have looked:

-> CHAIN_DUMP<NFXTA_NAME>
<- CHAIN_DUMP<NFXTA_RULE_START>
<- CHAIN_DUMP<NFXTA_MATCH_START>
<- CHAIN_DUMP<NFXTA_NAME><NFXTA_REVISION><NFXTA_DATA>
<- CHAIN_DUMP<NFXTA_MATCH_END>
<- CHAIN_DUMP<NFXTA_MATCH_START>
<- CHAIN_DUMP<NFXTA_NAME><NFXTA_REVISION><NFXTA_DATA>
<- CHAIN_DUMP<NFXTA_MATCH_END>
<- CHAIN_DUMP<NFXTA_TARGET_START>
<- CHAIN_DUMP<NFXTA_NAME><NFXTA_REVISION><NFXTA_DATA>
<- CHAIN_DUMP<NFXTA_TARGET_END>
<- CHAIN_DUMP<NFXTA_TARGET_START>
<- CHAIN_DUMP<NFXTA_NAME><NFXTA_REVISION><NFXTA_DATA>
<- CHAIN_DUMP<NFXTA_TARGET_END>
<- CHAIN_DUMP<NFXTA_RULE_END>
<- CHAIN_DUMP<NFXTA_RULE_START>
<- CHAIN_DUMP<NFXTA_TARGET_START>
<- CHAIN_DUMP<NFXTA_VERDICT>
<- CHAIN_DUMP<NFXTA_TARGET_END>
<- CHAIN_DUMP<NFXTA_RULE_END>
<- NLMSG_DONE

This requires more forth-and-back between userspace and the kernel:
19 messages instead. Using multiple nlmsg_type seems a good thing to
exploit.

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2010-12-17 14:56 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-24 22:29 Xtables2 Netlink spec Jan Engelhardt
2010-11-25 11:42 ` Pablo Neira Ayuso
2010-11-25 13:35   ` Jan Engelhardt
2010-11-25 14:21     ` Pablo Neira Ayuso
2010-11-25 21:46       ` Jan Engelhardt
2010-11-26  8:25         ` Pablo Neira Ayuso
2010-11-26 13:59           ` Jan Engelhardt
2010-11-26 19:48             ` Jozsef Kadlecsik
2010-11-26 19:55               ` Jan Engelhardt
2010-11-26 20:05                 ` Jozsef Kadlecsik
2010-11-26 21:33                   ` Jan Engelhardt
     [not found]                     ` <alpine.DEB.2.00.1011270951330.20431@blackhole.kfki.hu>
2010-11-27 13:39                       ` Jan Engelhardt
2010-11-27 17:04                         ` Jozsef Kadlecsik
2010-11-27 17:35                           ` Jan Engelhardt
2010-11-27 20:42                             ` Jozsef Kadlecsik
2010-11-29 12:30                               ` Pablo Neira Ayuso
2010-11-29 12:39                                 ` Jozsef Kadlecsik
2010-11-29 12:55                                   ` Pablo Neira Ayuso
2010-11-29 13:26                                     ` Jan Engelhardt
2010-11-29 13:49                                   ` Pablo Neira Ayuso
2010-11-29 12:23                 ` Pablo Neira Ayuso
2010-11-27 11:10             ` Pablo Neira Ayuso
2010-11-26 15:27           ` Jan Engelhardt
2010-11-27 12:25             ` Pablo Neira Ayuso
2010-12-03 21:03           ` Jan Engelhardt
2010-12-07  7:49             ` Pablo Neira Ayuso
2010-12-07 13:30               ` Jan Engelhardt
2010-12-08 11:36                 ` Pablo Neira Ayuso
2010-11-26 19:01 ` Jozsef Kadlecsik
2010-12-09 12:08   ` Pablo Neira Ayuso
2010-12-14  2:01     ` Jan Engelhardt
2010-12-14  2:16       ` James Nurmi
2010-12-14  3:46         ` Jan Engelhardt
2010-12-15 13:54       ` Pablo Neira Ayuso
2010-12-16 14:05         ` Thomas Graf
2010-12-16 14:22           ` Jan Engelhardt
2010-12-17  7:25             ` Thomas Graf
2010-12-17  9:35               ` Jan Engelhardt
2010-12-17  9:50                 ` Pablo Neira Ayuso
2010-12-17  9:55           ` Pablo Neira Ayuso
2010-12-17 14:56             ` Jan Engelhardt
2010-12-15  4:55   ` Jan Engelhardt
2010-12-15  8:51     ` Jozsef Kadlecsik
2010-12-16  9:57       ` Jesper Dangaard Brouer
2010-12-16 12:51         ` Error reporting in Netlink (Re: Xtables2 Netlink spec) Jan Engelhardt
2010-12-16 13:43           ` Thomas Graf
2010-12-16 13:51             ` Jan Engelhardt
2010-12-16 14:19               ` Thomas Graf
2010-12-17 10:00                 ` Pablo Neira Ayuso
2010-12-16 14:47             ` Jozsef Kadlecsik
2010-12-16 15:09               ` Jan Engelhardt
2010-12-16 23:31             ` Patrick McHardy
2010-12-17  6:58               ` Thomas Graf
2010-12-16 23:23           ` Patrick McHardy
2010-12-17 10:02             ` Pablo Neira Ayuso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).