Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] net/usb: Add Samsung Kalmia driver for Samsung GT-B3730
From: David Miller @ 2011-06-11 23:29 UTC (permalink / raw)
  To: marius.kotsbak; +Cc: netdev, linux-usb, marius
In-Reply-To: <20110611.162711.907394751613071878.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Sat, 11 Jun 2011 16:27:11 -0700 (PDT)

> From: "Marius B. Kotsbak" <marius.kotsbak@gmail.com>
> Date: Sat, 11 Jun 2011 23:55:18 +0200
> 
>> Introducing driver for the network port of Samsung Kalmia based USB LTE modems.
>> It has also an ACM interface that previous patches associates with the "option"
>> module. To access those interfaces, the modem must first be switched from modem
>> mode using a tool like usb_modeswitch.
>> 
>> As the proprietary protocol has been discovered by watching the MS Windows driver
>> behavior, there might be errors in the protocol handling, but stable and fast
>> connection has been established for hours with Norwegian operator NetCom that
>> distributes this modem with their LTE/4G subscription.
>> 
>> More and updated information about how to use this driver is available here:
>> 
>> http://www.draisberghof.de/usb_modeswitch/bb/viewtopic.php?t=465
>> https://github.com/mkotsbak/Samsung-GT-B3730-linux-driver
>> 
>> Signed-off-by: Marius B. Kotsbak <marius@kotsbak.com>
> 
> Applied, thanks.

Actually, reverted.

There's a typo in your Makefile patch, and because of this it
won't even build the new driver.

People are so damn anxious to get this backported into stable
and various distributions, yet this patch wasn't even tested
properly.

That really drives me crazy.

^ permalink raw reply

* Re: [patch] netpoll: call dev_put() on error in netpoll_setup()
From: David Miller @ 2011-06-12  1:55 UTC (permalink / raw)
  To: error27; +Cc: amwang, herbert, nhorman, eric.dumazet, netdev, kernel-janitors
In-Reply-To: <20110611155047.GA3583@shale.localdomain>

From: Dan Carpenter <error27@gmail.com>
Date: Sat, 11 Jun 2011 18:50:47 +0300

> There is a dev_put(ndev) missing on an error path.  This was
> introduced in 0c1ad04aecb "netpoll: prevent netpoll setup on slave
> devices".
> 
> Signed-off-by: Dan Carpenter <error27@gmail.com>
> ---
> This is a static checker bug, and it's possible I've misunderstood
> something.

Definitely looks correct to me, applied, thanks!

^ permalink raw reply

* Re: [PATCH] ISDN, hfcsusb: Don't leak in hfcsusb_ph_info()
From: David Miller @ 2011-06-12  1:59 UTC (permalink / raw)
  To: jj; +Cc: isdn, netdev, linux-kernel, sprenger, info
In-Reply-To: <alpine.LNX.2.00.1106111832270.23835@swampdragon.chaosbits.net>

From: Jesper Juhl <jj@chaosbits.net>
Date: Sat, 11 Jun 2011 18:36:42 +0200 (CEST)

> We leak the memory allocated to 'phi' when the variable goes out of scope 
> in hfcsusb_ph_info().
> 
> Signed-off-by: Jesper Juhl <jj@chaosbits.net>

Applied, thanks Jesper.

^ permalink raw reply

* Re: [net-next 13/13] ixgbe: use per NUMA node lock for FCoE DDP
From: David Miller @ 2011-06-12  2:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jeffrey.t.kirsher, vasu.dev, netdev, gospo
In-Reply-To: <1307770931.2872.70.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 11 Jun 2011 07:42:11 +0200

> This patch seems overkill to me, have you tried the more simple way I
> did in commit 79640a4ca6955e3ebdb7038508fa7a0cd7fa5527
> (net: add additional lock to qdisc to increase throughput )
> 
> (remember you must place ->busylock in a separate cache line, to not
> slow down the two cpus that have access to ->lock)
> 
> struct ixgbe_fcoe could probably be more carefuly reordered to lower
> false sharing
> 
> I kindly ask you guys provide actual perf numbers between
> 
> 1) before any patch
> 2) After your multilevel per numanode locks
> 3) A more simple way (my suggestion of adding a single 'busylock')

Jeff, please sort out these issues with Eric and resend your pull
request once things are resolved.

Thanks!

^ permalink raw reply

* Re: [PATCH] net/usb: Add Samsung Kalmia driver for Samsung GT-B3730
From: Ben Hutchings @ 2011-06-12  3:46 UTC (permalink / raw)
  To: David Miller
  Cc: marius.kotsbak-Re5JQEeQqe8AvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA,
	marius-iy5w9mehe2BBDgjK7y7TUQ
In-Reply-To: <20110611.162942.1706711069327005315.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Sat, 2011-06-11 at 16:29 -0700, David Miller wrote:
> From: David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> Date: Sat, 11 Jun 2011 16:27:11 -0700 (PDT)
> 
> > From: "Marius B. Kotsbak" <marius.kotsbak-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > Date: Sat, 11 Jun 2011 23:55:18 +0200
> > 
> >> Introducing driver for the network port of Samsung Kalmia based USB LTE modems.
> >> It has also an ACM interface that previous patches associates with the "option"
> >> module. To access those interfaces, the modem must first be switched from modem
> >> mode using a tool like usb_modeswitch.
> >> 
> >> As the proprietary protocol has been discovered by watching the MS Windows driver
> >> behavior, there might be errors in the protocol handling, but stable and fast
> >> connection has been established for hours with Norwegian operator NetCom that
> >> distributes this modem with their LTE/4G subscription.
> >> 
> >> More and updated information about how to use this driver is available here:
> >> 
> >> http://www.draisberghof.de/usb_modeswitch/bb/viewtopic.php?t=465
> >> https://github.com/mkotsbak/Samsung-GT-B3730-linux-driver
> >> 
> >> Signed-off-by: Marius B. Kotsbak <marius-iy5w9mehe2BBDgjK7y7TUQ@public.gmane.org>
> > 
> > Applied, thanks.
> 
> Actually, reverted.
> 
> There's a typo in your Makefile patch, and because of this it
> won't even build the new driver.
[...]

Surely the error is in the Kconfig - the 'CONFIG_' prefix should not be
there.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2011-06-12  4:01 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) Memory leak in ISDN hfcsusb driver, from Jesper Juhl.
2) Network device leak in netpoll setup, from Dan Carpenter
3) Fix error handling in get_net_ns_by_fd() and sanitize error
   return values of l2tp_dfs_seq_open(), from Al Viro.
4) Bridge netfilter fake rtable ops needs cow_metrics handler, otherwise
   we OOPS, from Alexander Holler.
5) Endianness fix in dl2k EEPROM code from Daniel Hellstrom.
6) Signedness fix in ip{,6}_queue from Dave Jones.
7) Mark lockdep classes in IRDA properly.
8) Prevent kernel stack data leak in af_packet, from Eric Dumazet.
9) ep93xx_eth DMA et al. fixes from Mika Westerberg.
10) Bonding screws up TX queue selection on the way down to physical
    device, save and restore it properly, from Neil HOrman.
11) Fix conntrack ct leak in l4proto->error(), from Pablo Neira Ayuso.
12) am79c961 fixes from Russell King
13) Channel switch locking fixes in ilwagn from Stanislaw Gruszka.
14) IPSEC replay handling has off-by-one error, from Steffen Klassert.
15) gianfar filter table needs to be per-device, from Wu Jiajun-B06378.
16) Turn off ath5k fast channel switching by default, causes problems
    for some people.  From Nick Kossifidis.
17) CPU offlining can stall packet processing, fix from Heiko Carstens.

Please pull, thanks a lot!

The following changes since commit b99ca60c83a631adaba9c2fff8f2dd14d3517a61:

  Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 (2011-06-11 19:56:25 -0700)

are available in the git repository at:

  master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master

Al Viro (2):
      get_net_ns_by_fd() oopses if proc_ns_fget() returns an error
      fix return values of l2tp_dfs_seq_open()

Alexander Holler (1):
      bridge: provide a cow_metrics method for fake_ops

Anirban Chakraborty (1):
      qlcnic: Fix bug in FW queue dump

Dan Carpenter (1):
      netpoll: call dev_put() on error in netpoll_setup()

Daniel Drake (1):
      libertas_sdio: handle spurious interrupts

Daniel Hellstrom (1):
      dl2k: EEPROM CRC calculation wrong endianess on bigendian machine

Dave Jones (1):
      netfilter: use unsigned variables for packet lengths in ip[6]_queue.

David S. Miller (4):
      Merge branch 'pablo/nf-2.6-updates' of git://1984.lsi.us.es/net-2.6
      Merge branch 'for-davem' of git://git.kernel.org/.../linville/wireless-2.6
      net: Rework netdev_drivername() to avoid warning.
      irda: iriap: Use seperate lockdep class for irias_objects->hb_spinlock

Eric Dumazet (3):
      netfilter: add more values to enum ip_conntrack_info
      af_packet: prevent information leak
      net: pmtu_expires fixes

Grant Likely (1):
      net: fix smc91x.c device tree support

H Hartley Sweeten (1):
      ep93xx_eth: Update MAINTAINERS

Heiko Carstens (1):
      net: cpu offline cause napi stall

Jesper Juhl (1):
      ISDN, hfcsusb: Don't leak in hfcsusb_ph_info()

Jiri Pirko (1):
      vlan: Fix the ingress VLAN_FLAG_REORDER_HDR check

Johannes Berg (1):
      mac80211: fix IBSS teardown race

John W. Linville (4):
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless-2.6 into for-davem
      Revert "mac80211: Skip tailroom reservation for full HW-crypto devices"
      Revert "mac80211: stop queues before rate control updation"
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless-2.6 into for-davem

Jozsef Kadlecsik (2):
      netfilter: ipset: Fix return code for destroy when sets are in use
      netfilter: ipset: Use the stored first cidr value instead of '1'

Julian Anastasov (2):
      ipvs: restore support for iptables SNAT
      netfilter: nf_nat: fix crash in nf_nat_csum

Luciano Coelho (1):
      nl80211: fix overflow in ssid_len

Marcus Meissner (1):
      net/ipv6: check for mistakenly passed in non-AF_INET6 sockaddrs

Mika Westerberg (5):
      ep93xx: set DMA masks for the ep93xx_eth
      net: ep93xx_eth: pass struct device to DMA API functions
      net: ep93xx_eth: allocate buffers using kmalloc()
      net: ep93xx_eth: drop GFP_DMA from call to dma_alloc_coherent()
      net: ep93xx_eth: fix DMA API violations

Mike McCormack (2):
      rtlwifi: Fix logic in rx_interrupt
      rtlwifi: Avoid modifying skbs that are resubmitted

Neil Horman (1):
      bonding: reset queue mapping prior to transmission to physical device (v5)

Nick Kossifidis (1):
      ath5k: Disable fast channel switching by default

Pablo Neira Ayuso (1):
      netfilter: nf_conntrack: fix ct refcount leak in l4proto->error()

Rafał Miłecki (1):
      ssb: fix PCI(e) driver regression causing oops on PCI cards

Russell King - ARM Linux (3):
      NET: am79c961: ensure asm() statements are marked volatile
      NET: am79c961: ensure multicast filter is correctly set at open
      NET: am79c961: fix assembler warnings

Stanislaw Gruszka (5):
      iwlagn: fix channel switch locking
      iwlagn: use cts-to-self protection on 5000 adapters series
      rt2x00: fix rmmod crash
      iwl4965: set tx power after rxon_assoc
      iwlegacy: fix channel switch locking

Steffen Klassert (2):
      xfrm: Fix off by one in the replay advance functions
      ipv4: Fix packet size calculation for raw IPsec packets in __ip_append_data

Sucheta Chakraborty (1):
      qlcnic: Avoid double free of skb in tx path

Thadeu Lima de Souza Cascardo (1):
      mac80211: call dev_alloc_name before copying name to sdata

WANG Cong (1):
      netpoll: prevent netpoll setup on slave devices

Wey-Yi Guy (1):
      iwlagn: send tx power command if defer cause by RXON not match

Williams, Mitch A (1):
      igb: fix i350 SR-IOV failture

Wu Jiajun-B06378 (1):
      gianfar:localized filer table

Yegor Yefremov (1):
      ethtool.h: fix typos

 MAINTAINERS                                    |    2 +-
 arch/arm/mach-ep93xx/core.c                    |    6 +-
 drivers/isdn/hardware/mISDN/hfcsusb.c          |    1 +
 drivers/net/arm/am79c961a.c                    |  126 ++++++++++++------------
 drivers/net/arm/ep93xx_eth.c                   |   82 ++++++++--------
 drivers/net/bonding/bond_main.c                |   11 ++
 drivers/net/dl2k.c                             |    2 +-
 drivers/net/gianfar.c                          |   29 +++---
 drivers/net/gianfar.h                          |    8 +-
 drivers/net/gianfar_ethtool.c                  |   64 ++++++------
 drivers/net/igb/igb_main.c                     |    3 +
 drivers/net/qlcnic/qlcnic_hw.c                 |    1 +
 drivers/net/qlcnic/qlcnic_main.c               |    1 +
 drivers/net/smc91x.c                           |    6 +-
 drivers/net/wireless/ath/ath5k/base.c          |   11 ++-
 drivers/net/wireless/ath/ath5k/reset.c         |    5 +-
 drivers/net/wireless/iwlegacy/iwl-4965.c       |   12 +--
 drivers/net/wireless/iwlegacy/iwl-core.c       |   30 +++---
 drivers/net/wireless/iwlegacy/iwl-core.h       |    2 +-
 drivers/net/wireless/iwlegacy/iwl-dev.h        |   13 +---
 drivers/net/wireless/iwlegacy/iwl4965-base.c   |   20 ++--
 drivers/net/wireless/iwlwifi/iwl-2000.c        |   74 --------------
 drivers/net/wireless/iwlwifi/iwl-5000.c        |    3 -
 drivers/net/wireless/iwlwifi/iwl-6000.c        |    2 -
 drivers/net/wireless/iwlwifi/iwl-agn-hcmd.c    |   12 +--
 drivers/net/wireless/iwlwifi/iwl-agn-rxon.c    |   19 +++-
 drivers/net/wireless/iwlwifi/iwl-agn.c         |   19 ++--
 drivers/net/wireless/iwlwifi/iwl-core.c        |    6 +-
 drivers/net/wireless/iwlwifi/iwl-core.h        |    1 +
 drivers/net/wireless/iwlwifi/iwl-dev.h         |   13 +---
 drivers/net/wireless/iwlwifi/iwl-rx.c          |   24 +++---
 drivers/net/wireless/libertas/if_sdio.c        |   21 +++-
 drivers/net/wireless/rt2x00/rt2x00config.c     |    3 +-
 drivers/net/wireless/rt2x00/rt2x00dev.c        |    4 +
 drivers/net/wireless/rtlwifi/pci.c             |   30 +++---
 drivers/ssb/driver_pcicore.c                   |   10 +-
 include/linux/ethtool.h                        |    6 +-
 include/linux/if_packet.h                      |    2 +
 include/linux/if_vlan.h                        |   25 ++++-
 include/linux/netdevice.h                      |    2 +-
 include/linux/netfilter/nf_conntrack_common.h  |    3 +
 include/linux/skbuff.h                         |    5 +
 net/8021q/vlan_core.c                          |   60 ++++++-----
 net/bridge/br_netfilter.c                      |    6 +
 net/core/dev.c                                 |   23 ++---
 net/core/net_namespace.c                       |   16 ++--
 net/core/netpoll.c                             |    7 ++
 net/ipv4/ip_output.c                           |    6 +-
 net/ipv4/netfilter/ip_queue.c                  |    3 +-
 net/ipv4/netfilter/ipt_CLUSTERIP.c             |    6 +-
 net/ipv4/netfilter/ipt_MASQUERADE.c            |    2 +-
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |    2 +-
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |    2 +-
 net/ipv4/netfilter/nf_nat_core.c               |    2 +-
 net/ipv4/netfilter/nf_nat_helper.c             |    2 +-
 net/ipv4/netfilter/nf_nat_rule.c               |    2 +-
 net/ipv4/netfilter/nf_nat_standalone.c         |    4 +-
 net/ipv4/route.c                               |   78 ++++++++-------
 net/ipv6/af_inet6.c                            |    4 +
 net/ipv6/netfilter/ip6_queue.c                 |    3 +-
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |    2 +-
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |    2 +-
 net/irda/iriap.c                               |    5 +
 net/l2tp/l2tp_debugfs.c                        |    2 +-
 net/mac80211/ibss.c                            |    6 +-
 net/mac80211/ieee80211_i.h                     |    3 -
 net/mac80211/iface.c                           |    4 +
 net/mac80211/key.c                             |   21 +----
 net/mac80211/mlme.c                            |    6 -
 net/mac80211/tx.c                              |    7 +-
 net/netfilter/ipset/ip_set_core.c              |    2 +-
 net/netfilter/ipset/ip_set_hash_ipportnet.c    |   10 +-
 net/netfilter/ipset/ip_set_hash_net.c          |    8 +-
 net/netfilter/ipset/ip_set_hash_netport.c      |    6 +-
 net/netfilter/ipvs/ip_vs_core.c                |   16 ++--
 net/netfilter/nf_conntrack_core.c              |    7 +-
 net/netfilter/nf_conntrack_ftp.c               |    2 +-
 net/netfilter/nf_conntrack_h323_main.c         |   10 +-
 net/netfilter/nf_conntrack_irc.c               |    3 +-
 net/netfilter/nf_conntrack_pptp.c              |    3 +-
 net/netfilter/nf_conntrack_sane.c              |    2 +-
 net/netfilter/nf_conntrack_sip.c               |    2 +-
 net/netfilter/xt_socket.c                      |    4 +-
 net/packet/af_packet.c                         |    2 +
 net/sched/sch_generic.c                        |    3 +-
 net/wireless/nl80211.c                         |    9 +-
 net/xfrm/xfrm_replay.c                         |    4 +-
 87 files changed, 553 insertions(+), 545 deletions(-)

^ permalink raw reply

* [PATCH iproute2] xfrm: Update documentation
From: David Ward @ 2011-06-12  2:13 UTC (permalink / raw)
  To: netdev; +Cc: David Ward

The ip(8) man page and the "ip xfrm [ XFRM-OBJECT ] help" command output
are updated to include missing options, fix errors, and improve grammar.
There are no functional changes made.

The documentation for the ip command has many different meanings for the
same formatting symbols (which really needs to be fixed). This patch makes
consistent use of brackets [ ] to indicate optional parameters, pipes | to
mean "OR", braces { } to group things together, and dashes - instead of
underscores _ inside of parameter names. The parameters are listed in the
order in which they are parsed in the source code.

There are several parameters and options that are still not mentioned or
need to be described more thoroughly in the "COMMAND SYNTAX" section of
the ip(8) man page. I would appreciate help from the developers with this.

Signed-off-by: David Ward <david.ward@ll.mit.edu>
---
 ip/ipxfrm.c       |    8 +-
 ip/xfrm_monitor.c |    2 +-
 ip/xfrm_policy.c  |   72 +++---
 ip/xfrm_state.c   |  120 ++++-----
 man/man8/ip.8     |  751 +++++++++++++++++++++++++++++------------------------
 5 files changed, 515 insertions(+), 438 deletions(-)

diff --git a/ip/ipxfrm.c b/ip/ipxfrm.c
index 48a732f..f55bff9 100644
--- a/ip/ipxfrm.c
+++ b/ip/ipxfrm.c
@@ -59,8 +59,8 @@ static void usage(void) __attribute__((noreturn));
 static void usage(void)
 {
 	fprintf(stderr,
-		"Usage: ip xfrm XFRM_OBJECT { COMMAND | help }\n"
-		"where  XFRM_OBJECT := { state | policy | monitor }\n");
+		"Usage: ip xfrm XFRM-OBJECT { COMMAND | help }\n"
+		"where  XFRM-OBJECT := state | policy | monitor\n");
 	exit(-1);
 }
 
@@ -1040,7 +1040,7 @@ int xfrm_id_parse(xfrm_address_t *saddr, struct xfrm_id *id, __u16 *family,
 
 			ret = xfrm_xfrmproto_getbyname(*argv);
 			if (ret < 0)
-				invarg("\"XFRM_PROTO\" is invalid", *argv);
+				invarg("\"XFRM-PROTO\" is invalid", *argv);
 
 			id->proto = (__u8)ret;
 
@@ -1072,7 +1072,7 @@ int xfrm_id_parse(xfrm_address_t *saddr, struct xfrm_id *id, __u16 *family,
 		invarg("the same address family is required between \"src\" and \"dst\"", *argv);
 
 	if (loose == 0 && id->proto == 0)
-		missarg("XFRM_PROTO");
+		missarg("XFRM-PROTO");
 	if (argc == *argcp)
 		missarg("ID");
 
diff --git a/ip/xfrm_monitor.c b/ip/xfrm_monitor.c
index dc12fca..6a5b331 100644
--- a/ip/xfrm_monitor.c
+++ b/ip/xfrm_monitor.c
@@ -37,7 +37,7 @@ static void usage(void) __attribute__((noreturn));
 
 static void usage(void)
 {
-	fprintf(stderr, "Usage: ip xfrm monitor [ all | LISTofOBJECTS ]\n");
+	fprintf(stderr, "Usage: ip xfrm monitor [ all | LISTofXFRM-OBJECTS ]\n");
 	exit(-1);
 }
 
diff --git a/ip/xfrm_policy.c b/ip/xfrm_policy.c
index 7827f91..2a14903 100644
--- a/ip/xfrm_policy.c
+++ b/ip/xfrm_policy.c
@@ -54,50 +54,50 @@ static void usage(void) __attribute__((noreturn));
 
 static void usage(void)
 {
-	fprintf(stderr, "Usage: ip xfrm policy { add | update } dir DIR SELECTOR [ ctx SEC_CTX ][ index INDEX ] [ ptype PTYPE ]\n");
-	fprintf(stderr, "        [ action ACTION ] [ priority PRIORITY ] [ flag FLAG-LIST ] [ LIMIT-LIST ] [ TMPL-LIST ] [mark MARK [mask MASK]]\n");
-	fprintf(stderr, "Usage: ip xfrm policy { delete | get } dir DIR [ SELECTOR | index INDEX ] [ ctx SEC_CTX ][ ptype PTYPE ] [mark MARK [mask MASK]]\n");
-	fprintf(stderr, "Usage: ip xfrm policy { deleteall | list } [ dir DIR ] [ SELECTOR ]\n");
-	fprintf(stderr, "        [ index INDEX ] [ action ACTION ] [ priority PRIORITY ]  [ flag FLAG-LIST ]\n");
+	fprintf(stderr, "Usage: ip xfrm policy { add | update } SELECTOR dir DIR [ ctx CTX ]\n");
+	fprintf(stderr, "        [ mark MARK [ mask MASK ] ] [ index INDEX ] [ ptype PTYPE ]\n");
+	fprintf(stderr, "        [ action ACTION ] [ priority PRIORITY ] [ flag FLAG-LIST ]\n");
+	fprintf(stderr, "        [ LIMIT-LIST ] [ TMPL-LIST ]\n");
+	fprintf(stderr, "Usage: ip xfrm policy { delete | get } { SELECTOR | index INDEX } dir DIR\n");
+	fprintf(stderr, "        [ ctx CTX ] [ mark MARK [ mask MASK ] ] [ ptype PTYPE ]\n");
+	fprintf(stderr, "Usage: ip xfrm policy { deleteall | list } [ SELECTOR ] [ dir DIR ]\n");
+	fprintf(stderr, "        [ index INDEX ] [ ptype PTYPE ] [ action ACTION ] [ priority PRIORITY ]\n");
+	fprintf(stderr, "        [ flag FLAG-LIST ]\n");
 	fprintf(stderr, "Usage: ip xfrm policy flush [ ptype PTYPE ]\n");
 	fprintf(stderr, "Usage: ip xfrm count\n");
-	fprintf(stderr, "PTYPE := [ main | sub ](default=main)\n");
-	fprintf(stderr, "DIR := [ in | out | fwd ]\n");
-
-	fprintf(stderr, "SELECTOR := src ADDR[/PLEN] dst ADDR[/PLEN] [ UPSPEC ] [ dev DEV ]\n");
-
-	fprintf(stderr, "UPSPEC := proto PROTO [ [ sport PORT ] [ dport PORT ] |\n");
-	fprintf(stderr, "                        [ type NUMBER ] [ code NUMBER ] |\n");
-	fprintf(stderr, "                        [ key { DOTTED_QUAD | NUMBER } ] ]\n");
-
-	//fprintf(stderr, "DEV - device name(default=none)\n");
-
-	fprintf(stderr, "ACTION := [ allow | block ](default=allow)\n");
-
-	//fprintf(stderr, "PRIORITY - priority value(default=0)\n");
-
+	fprintf(stderr, "SELECTOR := [ src ADDR[/PLEN] ] [ dst ADDR[/PLEN] ] [ dev DEV ] [ UPSPEC ]\n");
+	fprintf(stderr, "UPSPEC := proto { { ");
+	fprintf(stderr, "%s | ", strxf_proto(IPPROTO_TCP));
+	fprintf(stderr, "%s | ", strxf_proto(IPPROTO_UDP));
+	fprintf(stderr, "%s | ", strxf_proto(IPPROTO_SCTP));
+	fprintf(stderr, "%s", strxf_proto(IPPROTO_DCCP));
+	fprintf(stderr, " } [ sport PORT ] [ dport PORT ] |\n");
+	fprintf(stderr, "                  { ");
+	fprintf(stderr, "%s | ", strxf_proto(IPPROTO_ICMP));
+	fprintf(stderr, "%s | ", strxf_proto(IPPROTO_ICMPV6));
+	fprintf(stderr, "%s", strxf_proto(IPPROTO_MH));
+	fprintf(stderr, " } [ type NUMBER ] [ code NUMBER ] |\n");
+	fprintf(stderr, "                  %s", strxf_proto(IPPROTO_GRE));
+	fprintf(stderr, " [ key { DOTTED-QUAD | NUMBER } ] | PROTO }\n");
+	fprintf(stderr, "DIR := in | out | fwd\n");
+	fprintf(stderr, "PTYPE := main | sub\n");
+	fprintf(stderr, "ACTION := allow | block\n");
 	fprintf(stderr, "FLAG-LIST := [ FLAG-LIST ] FLAG\n");
-	fprintf(stderr, "FLAG := [ localok | icmp ]\n");
-
-	fprintf(stderr, "LIMIT-LIST := [ LIMIT-LIST ] | [ limit LIMIT ]\n");
-	fprintf(stderr, "LIMIT := [ [time-soft|time-hard|time-use-soft|time-use-hard] SECONDS ] |\n");
-	fprintf(stderr, "         [ [byte-soft|byte-hard] SIZE ] | [ [packet-soft|packet-hard] NUMBER ]\n");
-
-	fprintf(stderr, "TMPL-LIST := [ TMPL-LIST ] | [ tmpl TMPL ]\n");
+	fprintf(stderr, "FLAG := localok | icmp\n");
+	fprintf(stderr, "LIMIT-LIST := [ LIMIT-LIST ] limit LIMIT\n");
+	fprintf(stderr, "LIMIT := { time-soft | time-hard | time-use-soft | time-use-hard } SECONDS |\n");
+	fprintf(stderr, "         { byte-soft | byte-hard } SIZE | { packet-soft | packet-hard } COUNT\n");
+	fprintf(stderr, "TMPL-LIST := [ TMPL-LIST ] tmpl TMPL\n");
 	fprintf(stderr, "TMPL := ID [ mode MODE ] [ reqid REQID ] [ level LEVEL ]\n");
-	fprintf(stderr, "ID := [ src ADDR ] [ dst ADDR ] [ proto XFRM_PROTO ] [ spi SPI ]\n");
-
-	fprintf(stderr, "XFRM_PROTO := [ ");
+	fprintf(stderr, "ID := [ src ADDR ] [ dst ADDR ] [ proto XFRM-PROTO ] [ spi SPI ]\n");
+	fprintf(stderr, "XFRM-PROTO := ");
 	fprintf(stderr, "%s | ", strxf_xfrmproto(IPPROTO_ESP));
 	fprintf(stderr, "%s | ", strxf_xfrmproto(IPPROTO_AH));
 	fprintf(stderr, "%s | ", strxf_xfrmproto(IPPROTO_COMP));
 	fprintf(stderr, "%s | ", strxf_xfrmproto(IPPROTO_ROUTING));
-	fprintf(stderr, "%s ", strxf_xfrmproto(IPPROTO_DSTOPTS));
-	fprintf(stderr, "]\n");
-
- 	fprintf(stderr, "MODE := [ transport | tunnel | beet ](default=transport)\n");
- 	//fprintf(stderr, "REQID - number(default=0)\n");
-	fprintf(stderr, "LEVEL := [ required | use ](default=required)\n");
+	fprintf(stderr, "%s\n", strxf_xfrmproto(IPPROTO_DSTOPTS));
+ 	fprintf(stderr, "MODE := transport | tunnel | ro | in_trigger | beet\n");
+	fprintf(stderr, "LEVEL := required | use\n");
 
 	exit(-1);
 }
diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
index 8ac3437..a76be47 100644
--- a/ip/xfrm_state.c
+++ b/ip/xfrm_state.c
@@ -56,63 +56,57 @@ static void usage(void) __attribute__((noreturn));
 
 static void usage(void)
 {
-	fprintf(stderr, "Usage: ip xfrm state { add | update } ID [ XFRM_OPT ] [ ctx SEC_CTX ] [ mode MODE ]\n");
-	fprintf(stderr, "        [ reqid REQID ] [ seq SEQ ] [ replay-window SIZE ] [ flag FLAG-LIST ]\n");
-	fprintf(stderr, "        [ encap ENCAP ] [ sel SELECTOR ] [ replay-seq SEQ ]\n");
-	fprintf(stderr, "        [ replay-oseq SEQ ] [ LIMIT-LIST ]\n");
-	fprintf(stderr, "Usage: ip xfrm state allocspi ID [ mode MODE ] [ reqid REQID ] [ seq SEQ ]\n");
-	fprintf(stderr, "        [ min SPI max SPI ]\n");
-	fprintf(stderr, "Usage: ip xfrm state { delete | get } ID\n");
+	fprintf(stderr, "Usage: ip xfrm state { add | update } ID [ ALGO-LIST ] [ mode MODE ]\n");
+	fprintf(stderr, "        [ mark MARK [ mask MASK ] ] [ reqid REQID ] [ seq SEQ ]\n");
+	fprintf(stderr, "        [ replay-window SIZE ] [ replay-seq SEQ ] [ replay-oseq SEQ ]\n");
+	fprintf(stderr, "        [ flag FLAG-LIST ] [ sel SELECTOR ] [ LIMIT-LIST ] [ encap ENCAP ]\n");
+	fprintf(stderr, "        [ coa ADDR[/PLEN] ] [ ctx CTX ]\n");
+	fprintf(stderr, "Usage: ip xfrm state allocspi ID [ mode MODE ] [ mark MARK [ mask MASK ] ]\n");
+	fprintf(stderr, "        [ reqid REQID ] [ seq SEQ ] [ min SPI max SPI ]\n");
+	fprintf(stderr, "Usage: ip xfrm state { delete | get } ID [ mark MARK [ mask MASK ] ]\n");
 	fprintf(stderr, "Usage: ip xfrm state { deleteall | list } [ ID ] [ mode MODE ] [ reqid REQID ]\n");
 	fprintf(stderr, "        [ flag FLAG-LIST ]\n");
-	fprintf(stderr, "Usage: ip xfrm state flush [ proto XFRM_PROTO ]\n");
-	fprintf(stderr, "Usage: ip xfrm state count \n");
-
-	fprintf(stderr, "ID := [ src ADDR ] [ dst ADDR ] [ proto XFRM_PROTO ] [ spi SPI ] [mark MARK [mask MASK]]\n");
-	//fprintf(stderr, "XFRM_PROTO := [ esp | ah | comp ]\n");
-	fprintf(stderr, "XFRM_PROTO := [ ");
+	fprintf(stderr, "Usage: ip xfrm state flush [ proto XFRM-PROTO ]\n");
+	fprintf(stderr, "Usage: ip xfrm state count\n");
+	fprintf(stderr, "ID := [ src ADDR ] [ dst ADDR ] [ proto XFRM-PROTO ] [ spi SPI ]\n");
+	fprintf(stderr, "XFRM-PROTO := ");
 	fprintf(stderr, "%s | ", strxf_xfrmproto(IPPROTO_ESP));
 	fprintf(stderr, "%s | ", strxf_xfrmproto(IPPROTO_AH));
 	fprintf(stderr, "%s | ", strxf_xfrmproto(IPPROTO_COMP));
 	fprintf(stderr, "%s | ", strxf_xfrmproto(IPPROTO_ROUTING));
-	fprintf(stderr, "%s ", strxf_xfrmproto(IPPROTO_DSTOPTS));
-	fprintf(stderr, "]\n");
-
-	//fprintf(stderr, "SPI - security parameter index(default=0)\n");
-
- 	fprintf(stderr, "MODE := [ transport | tunnel | ro | beet ](default=transport)\n");
- 	//fprintf(stderr, "REQID - number(default=0)\n");
-
-	fprintf(stderr, "FLAG-LIST := [ FLAG-LIST ] FLAG\n");
-	fprintf(stderr, "FLAG := [ noecn | decap-dscp | nopmtudisc | wildrecv | icmp | af-unspec | align4 ]\n");
-
-        fprintf(stderr, "ENCAP := ENCAP-TYPE SPORT DPORT OADDR\n");
-        fprintf(stderr, "ENCAP-TYPE := espinudp | espinudp-nonike\n");
-
-	fprintf(stderr, "ALGO-LIST := [ ALGO-LIST ] | [ ALGO ]\n");
-	fprintf(stderr, "ALGO := ALGO_TYPE ALGO_NAME ALGO_KEY "
-			"[ ALGO_ICV_LEN | ALGO_TRUNC_LEN ]\n");
-	fprintf(stderr, "ALGO_TYPE := [ ");
-	fprintf(stderr, "%s | ", strxf_algotype(XFRMA_ALG_AEAD));
+	fprintf(stderr, "%s\n", strxf_xfrmproto(IPPROTO_DSTOPTS));
+	fprintf(stderr, "ALGO-LIST := [ ALGO-LIST ] ALGO\n");
+	fprintf(stderr, "ALGO := { ");
 	fprintf(stderr, "%s | ", strxf_algotype(XFRMA_ALG_CRYPT));
 	fprintf(stderr, "%s | ", strxf_algotype(XFRMA_ALG_AUTH));
-	fprintf(stderr, "%s | ", strxf_algotype(XFRMA_ALG_AUTH_TRUNC));
-	fprintf(stderr, "%s ", strxf_algotype(XFRMA_ALG_COMP));
-	fprintf(stderr, "]\n");
-
-	//fprintf(stderr, "ALGO_NAME - algorithm name\n");
-	//fprintf(stderr, "ALGO_KEY - algorithm key\n");
-
-	fprintf(stderr, "SELECTOR := src ADDR[/PLEN] dst ADDR[/PLEN] [ UPSPEC ] [ dev DEV ]\n");
-
-	fprintf(stderr, "UPSPEC := proto PROTO [ [ sport PORT ] [ dport PORT ] |\n");
-	fprintf(stderr, "                        [ type NUMBER ] [ code NUMBER ] ]\n");
-
+	fprintf(stderr, "%s", strxf_algotype(XFRMA_ALG_COMP));
+	fprintf(stderr, " } ALGO-NAME ALGO-KEY |\n");
+	fprintf(stderr, "        %s", strxf_algotype(XFRMA_ALG_AEAD));
+	fprintf(stderr, " ALGO-NAME ALGO-KEY ALGO-ICV-LEN |\n");
+	fprintf(stderr, "        %s", strxf_algotype(XFRMA_ALG_AUTH_TRUNC));
+	fprintf(stderr, " ALGO-NAME ALGO-KEY ALGO-TRUNC-LEN\n");
+ 	fprintf(stderr, "MODE := transport | tunnel | ro | in_trigger | beet\n");
+	fprintf(stderr, "FLAG-LIST := [ FLAG-LIST ] FLAG\n");
+	fprintf(stderr, "FLAG := noecn | decap-dscp | nopmtudisc | wildrecv | icmp | af-unspec | align4\n");
+	fprintf(stderr, "SELECTOR := [ src ADDR[/PLEN] ] [ dst ADDR[/PLEN] ] [ dev DEV ] [ UPSPEC ]\n");
+	fprintf(stderr, "UPSPEC := proto { { ");
+	fprintf(stderr, "%s | ", strxf_proto(IPPROTO_TCP));
+	fprintf(stderr, "%s | ", strxf_proto(IPPROTO_UDP));
+	fprintf(stderr, "%s | ", strxf_proto(IPPROTO_SCTP));
+	fprintf(stderr, "%s", strxf_proto(IPPROTO_DCCP));
+	fprintf(stderr, " } [ sport PORT ] [ dport PORT ] |\n");
+	fprintf(stderr, "                  { ");
+	fprintf(stderr, "%s | ", strxf_proto(IPPROTO_ICMP));
+	fprintf(stderr, "%s | ", strxf_proto(IPPROTO_ICMPV6));
+	fprintf(stderr, "%s", strxf_proto(IPPROTO_MH));
+	fprintf(stderr, " } [ type NUMBER ] [ code NUMBER ] |\n");
+	fprintf(stderr, "                  %s", strxf_proto(IPPROTO_GRE));
+	fprintf(stderr, " [ key { DOTTED-QUAD | NUMBER } ] | PROTO }\n");
+	fprintf(stderr, "LIMIT-LIST := [ LIMIT-LIST ] limit LIMIT\n");
+	fprintf(stderr, "LIMIT := { time-soft | time-hard | time-use-soft | time-use-hard } SECONDS |\n");
+	fprintf(stderr, "         { byte-soft | byte-hard } SIZE | { packet-soft | packet-hard } COUNT\n");
+        fprintf(stderr, "ENCAP := { espinudp | espinudp-nonike } SPORT DPORT OADDR\n");
 
-	//fprintf(stderr, "DEV - device name(default=none)\n");
-	fprintf(stderr, "LIMIT-LIST := [ LIMIT-LIST ] | [ limit LIMIT ]\n");
-	fprintf(stderr, "LIMIT := [ [time-soft|time-hard|time-use-soft|time-use-hard] SECONDS ] |\n");
-	fprintf(stderr, "         [ [byte-soft|byte-hard] SIZE ] | [ [packet-soft|packet-hard] COUNT ]\n");
 	exit(-1);
 }
 
@@ -124,7 +118,7 @@ static int xfrm_algo_parse(struct xfrm_algo *alg, enum xfrm_attr_type_t type,
 
 #if 0
 	/* XXX: verifying both name and key is required! */
-	fprintf(stderr, "warning: ALGONAME/ALGOKEY will send to kernel promiscuously!(verifying them isn't implemented yet)\n");
+	fprintf(stderr, "warning: ALGO-NAME/ALGO-KEY will send to kernel promiscuously! (verifying them isn't implemented yet)\n");
 #endif
 
 	strncpy(alg->alg_name, name, sizeof(alg->alg_name));
@@ -144,7 +138,7 @@ static int xfrm_algo_parse(struct xfrm_algo *alg, enum xfrm_attr_type_t type,
 		/* calculate length of the converted values(real key) */
 		len = (plen + 1) / 2;
 		if (len > max)
-			invarg("\"ALGOKEY\" makes buffer overflow\n", key);
+			invarg("\"ALGO-KEY\" makes buffer overflow\n", key);
 
 		for (i = - (plen % 2), j = 0; j < len; i += 2, j++) {
 			char vbuf[3];
@@ -155,7 +149,7 @@ static int xfrm_algo_parse(struct xfrm_algo *alg, enum xfrm_attr_type_t type,
 			vbuf[2] = '\0';
 
 			if (get_u8(&val, vbuf, 16))
-				invarg("\"ALGOKEY\" is invalid", key);
+				invarg("\"ALGO-KEY\" is invalid", key);
 
 			buf[j] = val;
 		}
@@ -163,7 +157,7 @@ static int xfrm_algo_parse(struct xfrm_algo *alg, enum xfrm_attr_type_t type,
 		len = slen;
 		if (len > 0) {
 			if (len > max)
-				invarg("\"ALGOKEY\" makes buffer overflow\n", key);
+				invarg("\"ALGO-KEY\" makes buffer overflow\n", key);
 
 			strncpy(buf, key, len);
 		}
@@ -384,37 +378,37 @@ static int xfrm_state_modify(int cmd, unsigned flags, int argc, char **argv)
 				switch (type) {
 				case XFRMA_ALG_AEAD:
 					if (aeadop)
-						duparg("ALGOTYPE", *argv);
+						duparg("ALGO-TYPE", *argv);
 					aeadop = *argv;
 					break;
 				case XFRMA_ALG_CRYPT:
 					if (ealgop)
-						duparg("ALGOTYPE", *argv);
+						duparg("ALGO-TYPE", *argv);
 					ealgop = *argv;
 					break;
 				case XFRMA_ALG_AUTH:
 				case XFRMA_ALG_AUTH_TRUNC:
 					if (aalgop)
-						duparg("ALGOTYPE", *argv);
+						duparg("ALGO-TYPE", *argv);
 					aalgop = *argv;
 					break;
 				case XFRMA_ALG_COMP:
 					if (calgop)
-						duparg("ALGOTYPE", *argv);
+						duparg("ALGO-TYPE", *argv);
 					calgop = *argv;
 					break;
 				default:
 					/* not reached */
-					invarg("\"ALGOTYPE\" is invalid\n", *argv);
+					invarg("\"ALGO-TYPE\" is invalid\n", *argv);
 				}
 
 				if (!NEXT_ARG_OK())
-					missarg("ALGONAME");
+					missarg("ALGO-NAME");
 				NEXT_ARG();
 				name = *argv;
 
 				if (!NEXT_ARG_OK())
-					missarg("ALGOKEY");
+					missarg("ALGO-KEY");
 				NEXT_ARG();
 				key = *argv;
 
@@ -424,7 +418,7 @@ static int xfrm_state_modify(int cmd, unsigned flags, int argc, char **argv)
 				switch (type) {
 				case XFRMA_ALG_AEAD:
 					if (!NEXT_ARG_OK())
-						missarg("ALGOICVLEN");
+						missarg("ALGO-ICV-LEN");
 					NEXT_ARG();
 					if (get_u32(&icvlen, *argv, 0))
 						invarg("\"aead\" ICV length is invalid",
@@ -436,7 +430,7 @@ static int xfrm_state_modify(int cmd, unsigned flags, int argc, char **argv)
 					break;
 				case XFRMA_ALG_AUTH_TRUNC:
 					if (!NEXT_ARG_OK())
-						missarg("ALGOTRUNCLEN");
+						missarg("ALGO-TRUNC-LEN");
 					NEXT_ARG();
 					if (get_u32(&trunclen, *argv, 0))
 						invarg("\"auth\" trunc length is invalid",
@@ -649,7 +643,7 @@ static int xfrm_state_allocspi(int argc, char **argv)
 			exit(1);
 		}
 		if (req.xspi.min > req.xspi.max) {
-			fprintf(stderr, "\"min\" valie is larger than \"max\" one\n");
+			fprintf(stderr, "\"min\" value is larger than \"max\" value\n");
 			exit(1);
 		}
 	} else {
@@ -1164,7 +1158,7 @@ static int xfrm_state_flush(int argc, char **argv)
 
 			ret = xfrm_xfrmproto_getbyname(*argv);
 			if (ret < 0)
-				invarg("\"XFRM_PROTO\" is invalid", *argv);
+				invarg("\"XFRM-PROTO\" is invalid", *argv);
 
 			req.xsf.proto = (__u8)ret;
 		} else
diff --git a/man/man8/ip.8 b/man/man8/ip.8
index c5248ef..4ddc78c 100644
--- a/man/man8/ip.8
+++ b/man/man8/ip.8
@@ -421,318 +421,348 @@ throw " | " unreachable " | " prohibit " | " blackhole " | " nat " ]"
 .ti -8
 .BR "ip monitor" " [ " all " |"
 .IR LISTofOBJECTS " ]"
+.sp
 
 .ti -8
-.BR "ip xfrm"
-.IR XFRM_OBJECT " { " COMMAND " }"
+.B "ip xfrm"
+.IR XFRM-OBJECT " { " COMMAND " | "
+.BR help " }"
+.sp
 
 .ti -8
-.IR XFRM_OBJECT " := { " state " | " policy " | " monitor " } "
+.IR XFRM-OBJECT " :="
+.BR state " | " policy " | " monitor
+.sp
 
 .ti -8
 .BR "ip xfrm state " { " add " | " update " } "
-.IR ID " [ "
-.IR XFRM_OPT " ] "
-.RB " [ " mode
-.IR MODE " ] "
-.br
-.RB " [ " reqid
-.IR REQID " ] "
-.RB " [ " seq
-.IR SEQ " ] "
-.RB " [ " replay-window
-.IR SIZE " ] "
-.br
-.RB " [ " flag
-.IR FLAG-LIST " ] "
-.RB " [ " encap
-.IR ENCAP " ] "
-.RB " [ " sel
-.IR SELECTOR " ] "
-.br
-.RB " [ "
-.IR LIMIT-LIST " ] "
-
-.ti -8
-.BR "ip xfrm state allocspi "
-.IR ID
-.RB " [ " mode
-.IR MODE " ] "
-.RB " [ " reqid
-.IR REQID " ] "
-.RB " [ " seq
-.IR SEQ " ] "
-.RB " [ " min
-.IR SPI
+.IR ID " [ " ALGO-LIST " ]"
+.RB "[ " mode
+.IR MODE " ]"
+.RB "[ " mark
+.I MARK
+.RB "[ " mask
+.IR MASK " ] ]"
+.RB "[ " reqid
+.IR REQID " ]"
+.RB "[ " seq
+.IR SEQ " ]"
+.RB "[ " replay-window
+.IR SIZE " ]"
+.RB "[ " replay-seq
+.IR SEQ " ]"
+.RB "[ " replay-oseq
+.IR SEQ " ]"
+.RB "[ " flag
+.IR FLAG-LIST " ]"
+.RB "[ " sel
+.IR SELECTOR " ] [ " LIMIT-LIST " ]"
+.RB "[ " encap
+.IR ENCAP " ]"
+.RB "[ " coa
+.IR ADDR "[/" PLEN "] ]"
+.RB "[ " ctx
+.IR CTX " ]"
+
+.ti -8
+.B "ip xfrm state allocspi"
+.I ID
+.RB "[ " mode
+.IR MODE " ]"
+.RB "[ " mark
+.I MARK
+.RB "[ " mask
+.IR MASK " ] ]"
+.RB "[ " reqid
+.IR REQID " ]"
+.RB "[ " seq
+.IR SEQ " ]"
+.RB "[ " min
+.I SPI
 .B max
-.IR SPI " ] "
+.IR SPI " ]"
 
 .ti -8
 .BR "ip xfrm state" " { " delete " | " get " } "
-.IR ID
+.I ID
+.RB "[ " mark
+.I MARK
+.RB "[ " mask
+.IR MASK " ] ]"
 
 .ti -8
-.BR "ip xfrm state" " { " deleteall " | " list " } [ "
-.IR ID " ] "
-.RB " [ " mode
-.IR MODE " ] "
-.br
-.RB " [ " reqid
-.IR REQID " ] "
-.RB " [ " flag
-.IR FLAG_LIST " ] "
+.BR "ip xfrm state" " { " deleteall " | " list " } ["
+.IR ID " ]"
+.RB "[ " mode
+.IR MODE " ]"
+.RB "[ " reqid
+.IR REQID " ]"
+.RB "[ " flag
+.IR FLAG-LIST " ]"
 
 .ti -8
 .BR "ip xfrm state flush" " [ " proto
-.IR XFRM_PROTO " ] "
+.IR XFRM-PROTO " ]"
 
 .ti -8
 .BR "ip xfrm state count"
 
 .ti -8
-.IR ID " := "
-.RB " [ " src
-.IR ADDR " ] "
-.RB " [ " dst
-.IR ADDR " ] "
-.RB " [ " proto
-.IR XFRM_PROTO " ] "
-.RB " [ " spi
-.IR SPI " ] "
-
-.ti -8
-.IR XFRM_PROTO " := "
-.RB " [ " esp " | " ah " | " comp " | " route2 " | " hao " ] "
-
-.ti -8
-.IR MODE " := "
-.RB " [ " transport " | " tunnel " | " ro " | " beet " ] "
-.B (default=transport)
+.IR ID " :="
+.RB "[ " src
+.IR ADDR " ]"
+.RB "[ " dst
+.IR ADDR " ]"
+.RB "[ " proto
+.IR XFRM-PROTO " ]"
+.RB "[ " spi
+.IR SPI " ]"
 
 .ti -8
-.IR FLAG-LIST " := "
-.RI " [ " FLAG-LIST " ] " FLAG
+.IR XFRM-PROTO " :="
+.BR esp " | " ah " | " comp " | " route2 " | " hao
 
 .ti -8
-.IR FLAG " := "
-.RB " [ " noecn " | " decap-dscp " | " wildrecv " ] "
+.IR ALGO-LIST " := [ " ALGO-LIST " ] " ALGO
 
 .ti -8
-.IR ENCAP " := " ENCAP-TYPE " " SPORT " " DPORT " " OADDR
-
-.ti -8
-.IR ENCAP-TYPE " := "
-.B espinudp
-.RB " | "
-.B espinudp-nonike
+.IR ALGO " :="
+.RB "{ " enc " | " auth " | " comp " } " 
+.IR ALGO-NAME " " ALGO-KEY
+.R "|"
+.br
+.B aead
+.IR ALGO-NAME " " ALGO-KEY " " ALGO-ICV-LEN
+.R "|"
+.br
+.B auth-trunc
+.IR ALGO-NAME " " ALGO-KEY " " ALGO-TRUNC-LEN
 
 .ti -8
-.IR ALGO-LIST " := [ "
-.IR ALGO-LIST " ] | [ "
-.IR ALGO " ] "
+.IR MODE " := "
+.BR transport " | " tunnel " | " ro " | " in_trigger " | " beet
 
 .ti -8
-.IR ALGO " := "
-.IR ALGO_TYPE
-.IR ALGO_NAME
-.IR ALGO_KEY
+.IR FLAG-LIST " := [ " FLAG-LIST " ] " FLAG
 
 .ti -8
-.IR ALGO_TYPE " := "
-.RB " [ " enc " | " auth " | " comp " ] "
+.IR FLAG " :="
+.BR noecn " | " decap-dscp " | " nopmtudisc " | " wildrecv " | " icmp " | " af-unspec " | " align4
 
 .ti -8
-.IR SELECTOR " := "
-.B src
-.IR ADDR "[/" PLEN "]"
-.B dst
-.IR ADDR "[/" PLEN "]"
-.RI " [ " UPSPEC " ] "
-.RB " [ " dev
-.IR DEV " ] "
+.IR SELECTOR " :="
+.RB "[ " src
+.IR ADDR "[/" PLEN "] ]"
+.RB "[ " dst
+.IR ADDR "[/" PLEN "] ]"
+.RB "[ " dev
+.IR DEV " ]"
+.br
+.RI "[ " UPSPEC " ]"
 
 .ti -8
 .IR UPSPEC " := "
-.B proto
-.IR PROTO " [[ "
-.B sport
-.IR PORT " ] "
-.RB " [ " dport
-.IR PORT " ] | "
+.BR proto " {"
+.IR PROTO " |"
+.br
+.RB "{ " tcp " | " udp " | " sctp " | " dccp " } [ " sport
+.IR PORT " ]"
+.RB "[ " dport
+.IR PORT " ] |"
 .br
-.RB " [ " type
-.IR NUMBER " ] "
-.RB " [ " code
-.IR NUMBER " ] | "
+.RB "{ " icmp " | " ipv6-icmp " | " mobility-header " } [ " type
+.IR NUMBER " ]"
+.RB "[ " code
+.IR NUMBER " ] |"
 .br
-.RB " [ " key
-.IR KEY " ]] "
+.BR gre " [ " key
+.RI "{ " DOTTED-QUAD " | " NUMBER " } ] }"
 
 .ti -8
-.IR LIMIT-LIST " := [ " LIMIT-LIST " ] |"
-.RB " [ "limit
-.IR LIMIT " ] "
+.IR LIMIT-LIST " := [ " LIMIT-LIST " ]"
+.B limit
+.I LIMIT
 
 .ti -8
-.IR LIMIT " := "
-.RB " [ [" time-soft "|" time-hard "|" time-use-soft "|" time-use-hard "]"
-.IR SECONDS " ] | "
-.RB "[ ["byte-soft "|" byte-hard "]"
-.IR SIZE " ] | "
+.IR LIMIT " :="
+.RB "{ " time-soft " | " time-hard " | " time-use-soft " | " time-use-hard " }"
+.IR "SECONDS" " |"
 .br
-.RB " [ ["packet-soft "|" packet-hard "]"
-.IR COUNT " ] "
-
-.ti -8
-.BR "ip xfrm policy" " { " add " | " update " } " " dir "
-.IR DIR
-.IR SELECTOR " [ "
-.BR index
-.IR INDEX " ] "
+.RB "{ " byte-soft " | " byte-hard " }"
+.IR SIZE " |"
 .br
-.RB " [ " ptype
-.IR PTYPE " ] "
-.RB " [ " action
-.IR ACTION " ] "
-.RB " [ " priority
-.IR PRIORITY " ] "
-.br
-.RI " [ " LIMIT-LIST " ] [ "
-.IR TMPL-LIST " ] "
+.RB "{ " packet-soft " | " packet-hard " }"
+.I COUNT
 
 .ti -8
-.BR "ip xfrm policy" " { " delete " | " get " } " " dir "
-.IR DIR " [ " SELECTOR " | "
-.BR index
-.IR INDEX
-.RB " ] "
-.br
-.RB " [ " ptype
-.IR PTYPE " ] "
+.IR ENCAP " :="
+.RB "{ " espinudp " | " espinudp-nonike " }"
+.IR SPORT " " DPORT " " OADDR
 
 .ti -8
-.BR "ip xfrm policy" " { " deleteall " | " list " } "
-.RB " [ " dir
-.IR DIR " ] [ "
-.IR SELECTOR " ] "
-.br
-.RB " [ " index
-.IR INDEX " ] "
-.RB " [ " action
-.IR ACTION " ] "
-.RB " [ " priority
-.IR PRIORITY " ] "
+.BR "ip xfrm policy" " { " add " | " update " }"
+.I SELECTOR
+.B dir
+.I DIR
+.RB "[ " ctx
+.IR CTX " ]"
+.RB "[ " mark
+.I MARK
+.RB "[ " mask
+.IR MASK " ] ]"
+.RB "[ " index
+.IR INDEX " ]"
+.RB "[ " ptype
+.IR PTYPE " ]"
+.RB "[ " action
+.IR ACTION " ]"
+.RB "[ " priority
+.IR PRIORITY " ]"
+.RB "[ " flag
+.IR FLAG-LIST " ]"
+.RI "[ " LIMIT-LIST " ] [ " TMPL-LIST " ]"
+
+.ti -8
+.BR "ip xfrm policy" " { " delete " | " get " }"
+.RI "{ " SELECTOR " | "
+.B index
+.IR INDEX " }"
+.B dir
+.I DIR
+.RB "[ " ctx
+.IR CTX " ]"
+.RB "[ " mark
+.I MARK
+.RB "[ " mask
+.IR MASK " ] ]"
+.RB "[ " ptype
+.IR PTYPE " ]"
+
+.ti -8
+.BR "ip xfrm policy" " { " deleteall " | " list " }"
+.RI "[ " SELECTOR " ]"
+.RB "[ " dir
+.IR DIR " ]"
+.RB "[ " index
+.IR INDEX " ]"
+.RB "[ " ptype
+.IR PTYPE " ]"
+.RB "[ " action
+.IR ACTION " ]"
+.RB "[ " priority
+.IR PRIORITY " ]"
 
 .ti -8
 .B "ip xfrm policy flush"
-.RB " [ " ptype
-.IR PTYPE " ] "
+.RB "[ " ptype
+.IR PTYPE " ]"
 
 .ti -8
-.B "ip xfrm count"
+.B "ip xfrm policy count"
 
 .ti -8
-.IR PTYPE " := "
-.RB " [ " main " | " sub " ] "
-.B (default=main)
+.IR SELECTOR " :="
+.RB "[ " src
+.IR ADDR "[/" PLEN "] ]"
+.RB "[ " dst
+.IR ADDR "[/" PLEN "] ]"
+.RB "[ " dev
+.IR DEV " ]"
+.RI "[ " UPSPEC " ]"
 
 .ti -8
-.IR DIR " := "
-.RB " [ " in " | " out " | " fwd " ] "
+.IR UPSPEC " := "
+.BR proto " {"
+.IR PROTO " |"
+.br
+.RB "{ " tcp " | " udp " | " sctp " | " dccp " } [ " sport
+.IR PORT " ]"
+.RB "[ " dport
+.IR PORT " ] |"
+.br
+.RB "{ " icmp " | " ipv6-icmp " | " mobility-header " } [ " type
+.IR NUMBER " ]"
+.RB "[ " code
+.IR NUMBER " ] |"
+.br
+.BR gre " [ " key
+.RI "{ " DOTTED-QUAD " | " NUMBER " } ] }"
 
 .ti -8
-.IR SELECTOR " := "
-.B src
-.IR ADDR "[/" PLEN "]"
-.B dst
-.IR ADDR "[/" PLEN] " [ " UPSPEC
-.RB " ] [ " dev
-.IR DEV " ] "
+.IR DIR " := "
+.BR in " | " out " | " fwd
 
 .ti -8
-.IR UPSPEC " := "
-.B proto
-.IR PROTO " [ "
-.RB " [ " sport
-.IR PORT " ] "
-.RB " [ " dport
-.IR PORT " ] | "
-.br
-.RB " [ " type
-.IR NUMBER " ] "
-.RB " [ " code
-.IR NUMBER " ] | "
-.br
-.RB " [ " key
-.IR KEY " ] ] "
+.IR PTYPE " := "
+.BR main " | " sub
 
 .ti -8
 .IR ACTION " := "
-.RB " [ " allow " | " block " ]"
-.B (default=allow)
+.BR allow " | " block
 
 .ti -8
-.IR LIMIT-LIST " := "
-.RB " [ "
-.IR LIMIT-LIST " ] | "
-.RB " [ " limit
-.IR LIMIT " ] "
+.IR FLAG-LIST " := [ " FLAG-LIST " ] " FLAG
 
 .ti -8
-.IR LIMIT " := "
-.RB " [ [" time-soft "|" time-hard "|" time-use-soft "|" time-use-hard "]"
-.IR SECONDS " ] | "
-.RB " [ [" byte-soft "|" byte-hard "]"
-.IR SIZE " ] | "
-.br [ "
-.RB "[" packet-soft "|" packet-hard "]"
-.IR NUMBER " ] "
+.IR FLAG " :="
+.BR localok " | " icmp
 
 .ti -8
-.IR TMPL-LIST " := "
-.B " [ "
-.IR TMPL-LIST " ] | "
-.RB " [ " tmpl
-.IR TMPL " ] "
+.IR LIMIT-LIST " := [ " LIMIT-LIST " ]"
+.B limit
+.I LIMIT
 
 .ti -8
-.IR TMPL " := "
-.IR ID " [ "
-.B mode
-.IR MODE " ] "
-.RB " [ " reqid
-.IR REQID " ] "
-.RB " [ " level
-.IR LEVEL " ] "
+.IR LIMIT " :="
+.RB "{ " time-soft " | " time-hard " | " time-use-soft " | " time-use-hard " }"
+.IR "SECONDS" " |"
+.br
+.RB "{ " byte-soft " | " byte-hard " }"
+.IR SIZE " |"
+.br
+.RB "{ " packet-soft " | " packet-hard " }"
+.I COUNT
 
 .ti -8
-.IR ID " := "
-.RB " [ " src
-.IR ADDR " ] "
-.RB " [ " dst
-.IR ADDR " ] "
-.RB " [ " proto
-.IR XFRM_PROTO " ] "
-.RB " [ " spi
-.IR SPI " ] "
+.IR TMPL-LIST " := [ " TMPL-LIST " ]"
+.B tmpl
+.I TMPL
 
 .ti -8
-.IR XFRM_PROTO " := "
-.RB " [ " esp " | " ah " | " comp " | " route2 " | " hao " ] "
+.IR TMPL " := " ID
+.RB "[ " mode
+.IR MODE " ]"
+.RB "[ " reqid
+.IR REQID " ]"
+.RB "[ " level
+.IR LEVEL " ]"
+
+.ti -8
+.IR ID " :="
+.RB "[ " src
+.IR ADDR " ]"
+.RB "[ " dst
+.IR ADDR " ]"
+.RB "[ " proto
+.IR XFRM-PROTO " ]"
+.RB "[ " spi
+.IR SPI " ]"
+
+.ti -8
+.IR XFRM-PROTO " :="
+.BR esp " | " ah " | " comp " | " route2 " | " hao
 
 .ti -8
 .IR MODE " := "
-.RB " [ " transport " | " tunnel " | " beet " ] "
-.B (default=transport)
+.BR transport " | " tunnel " | " ro " | " in_trigger " | " beet
 
 .ti -8
-.IR LEVEL " := "
-.RB " [ " required " | " use " ] "
-.B (default=required)
+.IR LEVEL " :="
+.BR required " | " use
 
 .ti -8
-.BR "ip xfrm monitor" " [ " all " | "
-.IR LISTofOBJECTS " ] "
+.BR "ip xfrm monitor" " [ " all " |"
+.IR LISTofXFRM-OBJECTS " ]"
 
 .in -8
 .ad b
@@ -849,10 +879,6 @@ host addresses.
 .B tunnel
 - tunnel over IP.
 
-.TP
-.B xfrm
-- framework for IPsec protocol.
-
 .PP
 The names of all objects may be written in full or
 abbreviated form, f.e.
@@ -2470,169 +2496,226 @@ at any time.
 It prepends the history with the state snapshot dumped at the moment
 of starting.
 
-.SH ip xfrm - setting xfrm
-xfrm is an IP framework, which can transform format of the datagrams,
-.br
-i.e. encrypt the packets with some algorithm. xfrm policy and xfrm state
-are associated through templates
-.IR TMPL_LIST "."
-This framework is used as a part of IPsec protocol.
+.SH ip xfrm - transform configuration
+xfrm is an IP framework for transforming packets (such as encrypting
+their payloads). This framework is used to implement the IPsec protocol
+suite (with the
+.B state
+object operating on the Security Association Database, and the
+.B policy
+object operating on the Security Policy Database). It is also used for
+the IP Payload Compression Protocol and features of Mobile IPv6.
 
 .SS ip xfrm state add - add new state into xfrm
 
-.SS ip xfrm state update - update existing xfrm state
+.SS ip xfrm state update - update existing state in xfrm
+
+.SS ip xfrm state allocspi - allocate an SPI value
+
+.SS ip xfrm state delete - delete existing state in xfrm
+
+.SS ip xfrm state get - get existing state in xfrm
 
-.SS ip xfrm state allocspi - allocate SPI value
+.SS ip xfrm state deleteall - delete all existing state in xfrm
+
+.SS ip xfrm state list - print out the list of existing state in xfrm
+
+.SS ip xfrm state flush - flush all state in xfrm
+
+.SS ip xfrm state count - count all existing state in xfrm
+
+.TP
+.IR ID
+is specified by a source address, destination address,
+.RI "transform protocol " XFRM-PROTO ","
+and/or Security Parameter Index
+.IR SPI "."
+
+.TP
+.I XFRM-PROTO
+specifies a transform protocol:
+.RB "IPsec Encapsulating Security Payload (" esp "),"
+.RB "IPsec Authentication Header (" ah "),"
+.RB "IP Payload Compression (" comp "),"
+.RB "Mobile IPv6 Type 2 Routing Header (" route2 "), or"
+.RB "Mobile IPv6 Home Address Option (" hao ")."
+
+.TP
+.I ALGO-LIST
+specifies one or more algorithms
+.IR ALGO
+to use. Algorithm types include
+.RB "encryption (" enc "),"
+.RB "authentication (" auth "),"
+.RB "authentication with a specified truncation length (" auth-trunc "),"
+.RB "authenticated encryption with associated data (" aead "), and"
+.RB "compression (" comp ")."
+For each algorithm used, the algorithm type, the algorithm name
+.IR ALGO-NAME ","
+and the key
+.I ALGO-KEY
+must be specified. For
+.BR aead ","
+the Integrity Check Value length
+.I ALGO-ICV-LEN
+must additionally be specified.
+For
+.BR auth-trunc ","
+the signature truncation length
+.I ALGO-TRUNC-LEN
+must additionally be specified.
 
 .TP
 .I MODE
-is set as default to
-.BR transport ","
-but it could be set to
-.BR tunnel "," ro " or " beet "."
+specifies a mode of operation:
+.RB "IPsec transport mode (" transport "), "
+.RB "IPsec tunnel mode (" tunnel "), "
+.RB "Mobile IPv6 route optimization mode (" ro "), "
+.RB "Mobile IPv6 inbound trigger mode (" in_trigger "), or "
+.RB "IPsec ESP Bound End-to-End Tunnel Mode (" beet ")."
 
 .TP
 .I FLAG-LIST
-contains one or more flags.
+contains one or more of the following optional flags:
+.BR noecn ", " decap-dscp ", " nopmtudisc ", " wildrecv ", " icmp ", "
+.BR af-unspec ", or " align4 "."
 
 .TP
-.I FLAG
-could be set to
-.BR noecn ", " decap-dscp " or " wildrecv "."
+.IR SELECTOR
+selects the traffic that will be controlled by the policy, based on the source
+address, the destination address, the network device, and/or
+.IR UPSPEC "."
 
 .TP
-.I ENCAP
-encapsulation is set to encapsulation type
-.IR ENCAP-TYPE ", source port " SPORT ", destination port "  DPORT " and " OADDR "."
+.IR UPSPEC
+selects traffic by protocol. For the
+.BR tcp ", " udp ", " sctp ", or " dccp
+protocols, the source and destination port can optionally be specified.
+For the
+.BR icmp ", " ipv6-icmp ", or " mobility-header
+protocols, the type and code numbers can optionally be specified.
+For the
+.B gre
+protocol, the key can optionally be specified as a dotted-quad or number.
+Other protocols can be selected by name or number
+.IR PROTO "."
 
 .TP
-.I ENCAP-TYPE
-could be set to
-.BR espinudp " or " espinudp-nonike "."
+.I LIMIT-LIST
+sets limits in seconds, bytes, or numbers of packets.
 
 .TP
-.I ALGO-LIST
-contains one or more algorithms
-.I ALGO
-which depend on the type of algorithm set by
-.IR ALGO_TYPE "."
-Valid algorithms are:
-.BR enc ", " auth " or " comp "."
+.I ENCAP
+encapsulates packets with protocol
+.BR espinudp " or " espinudp-nonike ","
+.RI "using source port " SPORT ", destination port "  DPORT
+.RI ", and original address " OADDR "."
 
 .SS ip xfrm policy add - add a new policy
 
 .SS ip xfrm policy update - update an existing policy
 
-.SS ip xfrm policy delete - delete existing policy
+.SS ip xfrm policy delete - delete an existing policy
 
-.SS ip xfrm policy get - get existing policy
+.SS ip xfrm policy get - get an existing policy
 
-.SS ip xfrm policy deleteall - delete all existing xfrm policy
+.SS ip xfrm policy deleteall - delete all existing xfrm policies
 
-.SS ip xfrm policy list - print out the list of xfrm policy
+.SS ip xfrm policy list - print out the list of xfrm policies
 
 .SS ip xfrm policy flush - flush policies
-It can be flush
-.BR all
-policies or only those specified with
-.BR ptype "."
 
-.TP
-.BI dir " DIR "
-directory could be one of these:
-.BR "inp", " out " or " fwd".
+.SS ip xfrm policy count - count existing policies
 
 .TP
 .IR SELECTOR
-selects for which addresses will be set up the policy. The selector
-is defined by source and destination address.
+selects the traffic that will be controlled by the policy, based on the source
+address, the destination address, the network device, and/or
+.IR UPSPEC "."
 
 .TP
 .IR UPSPEC
-is defined by source port
-.BR sport ", "
-destination port
-.BR dport ", " type
-as number,
-.B code
-also number and
-.BR key
-as dotted-quad or number.
+selects traffic by protocol. For the
+.BR tcp ", " udp ", " sctp ", or " dccp
+protocols, the source and destination port can optionally be specified.
+For the
+.BR icmp ", " ipv6-icmp ", or " mobility-header
+protocols, the type and code numbers can optionally be specified.
+For the
+.B gre
+protocol, the key can optionally be specified as a dotted-quad or number.
+Other protocols can be selected by name or number
+.IR PROTO "."
 
 .TP
-.BI dev " DEV "
-specify network device.
+.I DIR
+selects the policy direction as
+.BR in ", " out ", or " fwd "."
 
 .TP
-.BI index " INDEX "
-the number of indexed policy.
+.I CTX
+sets the security context.
 
 .TP
-.BI ptype " PTYPE "
-type is set as default on
-.BR "main" ,
-could be switch on
-.BR "sub" .
+.I PTYPE
+can be
+.BR main " (default) or " sub "."
 
 .TP
-.BI action " ACTION "
-is set as default on
-.BR "allow".
-It could be switch on
-.BR "block".
+.I ACTION
+can be
+.BR allow " (default) or " block "."
 
 .TP
-.BI priority " PRIORITY "
-priority is a number. Default priority is set on zero.
+.I PRIORITY
+is a number that defaults to zero.
 
 .TP
-.IR LIMIT-LIST
-limits are set in seconds, bytes or numbers of packets.
+.I FLAG-LIST
+contains one or both of the following optional flags:
+.BR local " or " icmp "."
 
 .TP
-.IR TMPL-LIST
-template list is based on
-.IR ID ","
-.BR mode ", " reqid " and " level ". "
+.I LIMIT-LIST
+sets limits in seconds, bytes, or numbers of packets.
 
 .TP
-.IR ID
-is specified by source address, destination address,
-.I proto
-and value of
-.IR spi "."
+.I TMPL-LIST
+is a template list specified using
+.IR ID ", " MODE ", " REQID ", and/or " LEVEL ". "
 
 .TP
-.IR XFRM_PROTO
-values:
-.BR esp ", " ah ", " comp ", " route2 " or " hao "."
+.IR ID
+is specified by a source address, destination address,
+.RI "transform protocol " XFRM-PROTO ","
+and/or Security Parameter Index
+.IR SPI "."
 
 .TP
-.IR MODE
-is set as default on
-.BR transport ","
-but it could be set on
-.BR tunnel " or " beet "."
+.I XFRM-PROTO
+specifies a transform protocol:
+.RB "IPsec Encapsulating Security Payload (" esp "),"
+.RB "IPsec Authentication Header (" ah "),"
+.RB "IP Payload Compression (" comp "),"
+.RB "Mobile IPv6 Type 2 Routing Header (" route2 "), or"
+.RB "Mobile IPv6 Home Address Option (" hao ")."
 
 .TP
-.IR LEVEL
-is set as default on
-.BR required
-and the other choice is
-.BR use "."
+.I MODE
+specifies a mode of operation:
+.RB "IPsec transport mode (" transport "), "
+.RB "IPsec tunnel mode (" tunnel "), "
+.RB "Mobile IPv6 route optimization mode (" ro "), "
+.RB "Mobile IPv6 inbound trigger mode (" in_trigger "), or "
+.RB "IPsec ESP Bound End-to-End Tunnel Mode (" beet ")."
 
 .TP
-.IR UPSPEC
-is specified by
-.BR sport " and " dport " (for UDP/TCP), "
-.BR type " and " code " (for ICMP; as number) or "
-.BR key " (for GRE; as dotted-quad or number)."
-.
+.I LEVEL
+can be
+.BR required " (default) or " use "."
 
-.SS ip xfrm monitor - is used for listing all objects or defined group of them.
-The
-.B xfrm monitor
-can monitor the policies for all objects or defined group of them.
+.SS ip xfrm monitor - state monitoring for xfrm objects
+The xfrm objects to monitor can be optionally specified.
 
 .SH HISTORY
 .B ip
-- 
1.7.1

^ permalink raw reply related

* Re: Question about LRO/GRO and TCP acknowledgements
From: Ben Hutchings @ 2011-06-12  3:43 UTC (permalink / raw)
  To: Joris van Rantwijk; +Cc: netdev
In-Reply-To: <20110611215919.5fc29c27@konijn>

On Sat, 2011-06-11 at 21:59 +0200, Joris van Rantwijk wrote:
> Hi,
> 
> I'm trying to understand how Linux produces TCP acknowledgements
> for segments received via LRO/GRO.
> 
> As far as I can see, the network driver uses GRO to collect several
> received packets into one big super skb, which is then handled
> during just one call to tcp_v4_rcv(). This will eventually result
> in the sending of at most one ACK packet for the entire GRO packet.
> 
> Conventional wisdom (RFC 5681) says that a receiver should send at
> least one ACK for every two data segments received. The sending TCP
> needs these ACKs to update its congestion window (e.g. slow start).
> 
> It seems to me that the current implementation in Linux may send
> just one ACK for a large number of received segments. This would
> be a deviation from the standard. As a result the congestion
> window of the sender would grow much slower than intended.

This was a problem in older versions of Linux (and still is on other
network stacks that aren't aware of LRO).

> Maybe I misunderstand something in the network code (likely).
> Could someone please explain me how this ACK issue is handled?

LRO implementations (and GRO) are expected to put the actual segment
size in skb_shared_info(skb)->gso_size on the aggregated skb.  TCP will
then use that rather than the aggregated payload size when deciding
whether to defer an ACK.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [patch] netpoll: call dev_put() on error in netpoll_setup()
From: Américo Wang @ 2011-06-12  5:48 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: WANG Cong, Herbert Xu, Neil Horman, David S. Miller, Eric Dumazet,
	open list:NETWORKING [GENERAL], kernel-janitors
In-Reply-To: <20110611155047.GA3583@shale.localdomain>

On Sat, Jun 11, 2011 at 11:50 PM, Dan Carpenter <error27@gmail.com> wrote:
> There is a dev_put(ndev) missing on an error path.  This was
> introduced in 0c1ad04aecb "netpoll: prevent netpoll setup on slave
> devices".
>
> Signed-off-by: Dan Carpenter <error27@gmail.com>
> ---
> This is a static checker bug, and it's possible I've misunderstood
> something.

Oops! My bad... you don't miss anything.

Thanks a lot for fixing it!

^ permalink raw reply

* Re: [PATCH 3/3] vlan: Simplify the code now that VLAN_FLAG_REORDER_HDR is always set
From: Eric W. Biederman @ 2011-06-12  6:17 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Miller, Nicolas de Pesloüan, Changli Gao, netdev,
	shemminger, kaber, fubar, eric.dumazet, andy, Jesse Gross
In-Reply-To: <20110609105900.GA11005@minipsycho.brq.redhat.com>

Jiri Pirko <jpirko@redhat.com> writes:

> Sun, May 22, 2011 at 09:42:20PM CEST, ebiederm@xmission.com wrote:
>>
>>Now that we no longer support clearing VLAN_FLAG_REORDER_HDR remove the
>>code that was needed to cope with the case when it was cleared.
>>
>>Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>>---
>> net/8021q/vlan_dev.c |   45 +++++----------------------------------------
>> 1 files changed, 5 insertions(+), 40 deletions(-)
>>
>>diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
>>index 20629fe..2b3ca1e 100644
>>--- a/net/8021q/vlan_dev.c
>>+++ b/net/8021q/vlan_dev.c
>>@@ -96,63 +96,28 @@ static int vlan_dev_hard_header(struct sk_buff *skb, struct net_device *dev,
>> 				const void *daddr, const void *saddr,
>> 				unsigned int len)
>> {
>>-	struct vlan_hdr *vhdr;
>>-	unsigned int vhdrlen = 0;
>>-	u16 vlan_tci = 0;
>> 	int rc;
>> 
>>-	if (!(vlan_dev_info(dev)->flags & VLAN_FLAG_REORDER_HDR)) {
>>-		vhdr = (struct vlan_hdr *) skb_push(skb, VLAN_HLEN);
>>-
>>-		vlan_tci = vlan_dev_info(dev)->vlan_id;
>>-		vlan_tci |= vlan_dev_get_egress_qos_mask(dev, skb);
>>-		vhdr->h_vlan_TCI = htons(vlan_tci);
>>-
>>-		/*
>>-		 *  Set the protocol type. For a packet of type ETH_P_802_3/2 we
>>-		 *  put the length in here instead.
>>-		 */
>>-		if (type != ETH_P_802_3 && type != ETH_P_802_2)
>>-			vhdr->h_vlan_encapsulated_proto = htons(type);
>>-		else
>>-			vhdr->h_vlan_encapsulated_proto = htons(len);
>>-
>>-		skb->protocol = htons(ETH_P_8021Q);
>>-		type = ETH_P_8021Q;
>>-		vhdrlen = VLAN_HLEN;
>>-	}
>>-
>> 	/* Before delegating work to the lower layer, enter our MAC-address */
>> 	if (saddr == NULL)
>> 		saddr = dev->dev_addr;
>> 
>> 	/* Now make the underlying real hard header */
>> 	dev = vlan_dev_info(dev)->real_dev;
>>-	rc = dev_hard_header(skb, dev, type, daddr, saddr, len + vhdrlen);
>>-	if (rc > 0)
>>-		rc += vhdrlen;
>>+	rc = dev_hard_header(skb, dev, type, daddr, saddr, len);
>> 	return rc;
>> }
>> 
>> static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
>> 					    struct net_device *dev)
>> {
>>-	struct vlan_ethhdr *veth = (struct vlan_ethhdr *)(skb->data);
>> 	unsigned int len;
>>+	u16 vlan_tci;
>> 	int ret;
>> 
>>-	/* Handle non-VLAN frames if they are sent to us, for example by DHCP.
>>-	 *
>>-	 * NOTE: THIS ASSUMES DIX ETHERNET, SPECIFICALLY NOT SUPPORTING
>>-	 * OTHER THINGS LIKE FDDI/TokenRing/802.3 SNAPs...
>>-	 */
>>-	if (veth->h_vlan_proto != htons(ETH_P_8021Q) ||
> 	    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	    		this should stay here.
>

I admit this is a change in behavior from today so we need to be careful
here.

At least the comment needs to change.

When this code was written the assumption was that everything would come
in tagged and this code was a special exception for pf_packet sockets.

Now everything comes in untagged this code becomes a special exception
for pf_packet sockets of not putting on a vlan header.

To me this special exception pf_packet sockets looks like a bug, that
should have been conditionality on VLAN_FLAG_REORDER_HDR but was over
looked.

Now maybe we want to be bug compatible, or do this as a separate patch.

I would expect sending tagged packets out a vlan device would result
in double tagged packets but apparently not always.


Eric

^ permalink raw reply

* Re: [RFC] breakage in sysfs_readdir() and s_instances abuse in sysfs
From: Eric W. Biederman @ 2011-06-12  7:15 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds, netdev, Linux Containers
In-Reply-To: <20110609012646.GV11521@ZenIV.linux.org.uk>

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Tue, Jun 07, 2011 at 11:59:02PM +0100, Al Viro wrote:
>
>> Completely untested patch follows:
>
> FWIW, modulo a couple of brainos it seems to work here without blowing
> everything up.  Actual freeing of struct net is delayed until after
> umount of sysfs instances refering it, shutdown is *not* delayed, sysfs
> entries are removed nicely as they ought to, no objects from other net_ns
> are picked by readdir or lookup...
>
> If somebody has objections - yell.  If I don't hear anything, the following
> goes to -next:

The change seems reasonable, and much more likely to keep working after
small changes to sysfs.  Sigh.

I honestly hate the pattern that is being used here.  Holding a
reference count because we can't be bothered to free things reliably
when we actually stop using them.  It does look less error prone than
what I am doing today, and 2.5KiB for struct net isn't that much memory
to pin.

Will pinning an extra 2.5KiB be a problem when we get to the point where
unprivileged mounts are safe?  I expect there are easier ways to pin more
memory so I doubt this is worth worrying about.

The naming of things after your patch stinks.

It isn't clear what is taking or putting what kind of refcount from the
names.  If we don't correct the bad naming your patch will be worse for
maintenance than what we already have.

> commit 72d4e98002b45598bb88af74eeac20874f2789be
> Author: Al Viro <viro@zeniv.linux.org.uk>
> Date:   Wed Jun 8 21:13:01 2011 -0400
>
>     Delay struct net freeing while there's a sysfs instance refering to it
>     
>     	* new refcount in struct net, controlling actual freeing of the memory
>     	* new method in kobj_ns_type_operations (->put_ns())
>     	* ->current_ns() semantics change - it's supposed to be followed by
>     corresponding ->put_ns().  For struct net in case of CONFIG_NET_NS it bumps
>     the new refcount; net_put_ns() decrements it and calls net_free() if the
>     last reference has been dropped.

We need to rename kobj_ns_current so it is clear we get a ref count.

>     	* old net_free() callers call net_put_ns() instead.
>     	* sysfs_exit_ns() is gone, along with a large part of callchain
>     leading to it; now that the references stored in ->ns[...] stay valid we
>     do not need to hunt them down and replace them with NULL.  That fixes
>     problems in sysfs_lookup() and sysfs_readdir(), along with getting rid
>     of sb->s_instances abuse.

Honestly I can fix at least the lookup problems by assigning 
((void *)-1UL) instead of NULL, in sysfs_exit_ns.

>     	Note that struct net *shutdown* logics has not changed - net_cleanup()
>     is called exactly when it used to be called.  The only thing postponed by
>     having a sysfs instance refering to that struct net is actual freeing of
>     memory occupied by struct net.
>     
>     Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
>
> diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
> index 2668957..1e7a2ee 100644
> --- a/fs/sysfs/mount.c
> +++ b/fs/sysfs/mount.c
> @@ -111,8 +111,11 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
>  		info->ns[type] = kobj_ns_current(type);
>  
>  	sb = sget(fs_type, sysfs_test_super, sysfs_set_super, info);
> -	if (IS_ERR(sb) || sb->s_fs_info != info)
> +	if (IS_ERR(sb) || sb->s_fs_info != info) {
> +		for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
> +			kobj_ns_put(type, info->ns[type]);
>  		kfree(info);
> +	}
>  	if (IS_ERR(sb))
>  		return ERR_CAST(sb);
>  	if (!sb->s_root) {
> @@ -131,11 +134,14 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
>  static void sysfs_kill_sb(struct super_block *sb)
>  {
>  	struct sysfs_super_info *info = sysfs_info(sb);
> +	int type;
>  
>  	/* Remove the superblock from fs_supers/s_instances
>  	 * so we can't find it, before freeing sysfs_super_info.
>  	 */
>  	kill_anon_super(sb);
> +	for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
> +		kobj_ns_put(type, info->ns[type]);

This loop and the kfree probably deserve a small function of their own.

>  	kfree(info);
>  }
>  
> @@ -145,28 +151,6 @@ static struct file_system_type sysfs_fs_type = {
>  	.kill_sb	= sysfs_kill_sb,
>  };
>  
> -void sysfs_exit_ns(enum kobj_ns_type type, const void *ns)
> -{
> -	struct super_block *sb;
> -
> -	mutex_lock(&sysfs_mutex);
> -	spin_lock(&sb_lock);
> -	list_for_each_entry(sb, &sysfs_fs_type.fs_supers, s_instances) {
> -		struct sysfs_super_info *info = sysfs_info(sb);
> -		/*
> -		 * If we see a superblock on the fs_supers/s_instances
> -		 * list the unmount has not completed and sb->s_fs_info
> -		 * points to a valid struct sysfs_super_info.
> -		 */
> -		/* Ignore superblocks with the wrong ns */
> -		if (info->ns[type] != ns)
> -			continue;
> -		info->ns[type] = NULL;
> -	}
> -	spin_unlock(&sb_lock);
> -	mutex_unlock(&sysfs_mutex);
> -}
> -
>  int __init sysfs_init(void)
>  {
>  	int err = -ENOMEM;
> diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
> index 3d28af3..2ed2404 100644
> --- a/fs/sysfs/sysfs.h
> +++ b/fs/sysfs/sysfs.h
> @@ -136,7 +136,7 @@ struct sysfs_addrm_cxt {
>   * instance).
>   */
>  struct sysfs_super_info {
> -	const void *ns[KOBJ_NS_TYPES];
> +	void *ns[KOBJ_NS_TYPES];
>  };
>  #define sysfs_info(SB) ((struct sysfs_super_info *)(SB->s_fs_info))
>  extern struct sysfs_dirent sysfs_root;
> diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
> index 82cb5bf..5fa481c 100644
> --- a/include/linux/kobject_ns.h
> +++ b/include/linux/kobject_ns.h
> @@ -38,9 +38,10 @@ enum kobj_ns_type {
>   */
>  struct kobj_ns_type_operations {
>  	enum kobj_ns_type type;
> -	const void *(*current_ns)(void);
> +	void *(*current_ns)(void);

I really don't like removing const here.  It made it very clear that
what we are messing with is a token and not something that we ever will
deference.

>  	const void *(*netlink_ns)(struct sock *sk);
>  	const void *(*initial_ns)(void);
> +	void (*put_ns)(void *);
>  };
>  
>  int kobj_ns_type_register(const struct kobj_ns_type_operations *ops);
> @@ -48,9 +49,9 @@ int kobj_ns_type_registered(enum kobj_ns_type type);
>  const struct kobj_ns_type_operations *kobj_child_ns_ops(struct kobject *parent);
>  const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj);
>  
> -const void *kobj_ns_current(enum kobj_ns_type type);
> +void *kobj_ns_current(enum kobj_ns_type type);
>  const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk);
>  const void *kobj_ns_initial(enum kobj_ns_type type);
> -void kobj_ns_exit(enum kobj_ns_type type, const void *ns);
> +void kobj_ns_put(enum kobj_ns_type type, void *ns);
>  
>  #endif /* _LINUX_KOBJECT_NS_H */
> diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> index c3acda6..e2696d7 100644
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -177,9 +177,6 @@ struct sysfs_dirent *sysfs_get_dirent(struct sysfs_dirent *parent_sd,
>  struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd);
>  void sysfs_put(struct sysfs_dirent *sd);
>  
> -/* Called to clear a ns tag when it is no longer valid */
> -void sysfs_exit_ns(enum kobj_ns_type type, const void *tag);
> -
>  int __must_check sysfs_init(void);
>  
>  #else /* CONFIG_SYSFS */
> @@ -338,10 +335,6 @@ static inline void sysfs_put(struct sysfs_dirent *sd)
>  {
>  }
>  
> -static inline void sysfs_exit_ns(int type, const void *tag)
> -{
> -}
> -
>  static inline int __must_check sysfs_init(void)
>  {
>  	return 0;
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index 2bf9ed9..415af64 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -35,8 +35,11 @@ struct netns_ipvs;
>  #define NETDEV_HASHENTRIES (1 << NETDEV_HASHBITS)
>  
>  struct net {
> +	atomic_t		passive;	/* To decided when the network
> +						 * namespace should be freed.
> +						 */
>  	atomic_t		count;		/* To decided when the network
> -						 *  namespace should be freed.
> +						 *  namespace should be shut down.
>  						 */
>  #ifdef NETNS_REFCNT_DEBUG
>  	atomic_t		use_count;	/* To track references we
> @@ -154,6 +157,9 @@ int net_eq(const struct net *net1, const struct net *net2)
>  {
>  	return net1 == net2;
>  }
> +
> +extern void net_put_ns(void *);
> +
>  #else
>  
>  static inline struct net *get_net(struct net *net)
> @@ -175,6 +181,8 @@ int net_eq(const struct net *net1, const struct net *net2)
>  {
>  	return 1;
>  }
> +
> +#define net_put_ns NULL
>  #endif
>  
>  
> diff --git a/lib/kobject.c b/lib/kobject.c
> index 82dc34c..43116b2 100644
> --- a/lib/kobject.c
> +++ b/lib/kobject.c
> @@ -948,9 +948,9 @@ const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj)
>  }
>  
>  
> -const void *kobj_ns_current(enum kobj_ns_type type)
> +void *kobj_ns_current(enum kobj_ns_type type)
>  {
> -	const void *ns = NULL;
> +	void *ns = NULL;
>  
>  	spin_lock(&kobj_ns_type_lock);
>  	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
> @@ -987,23 +987,15 @@ const void *kobj_ns_initial(enum kobj_ns_type type)
>  	return ns;
>  }
>  
> -/*
> - * kobj_ns_exit - invalidate a namespace tag
> - *
> - * @type: the namespace type (i.e. KOBJ_NS_TYPE_NET)
> - * @ns: the actual namespace being invalidated
> - *
> - * This is called when a tag is no longer valid.  For instance,
> - * when a network namespace exits, it uses this helper to
> - * make sure no sb's sysfs_info points to the now-invalidated
> - * netns.
> - */
> -void kobj_ns_exit(enum kobj_ns_type type, const void *ns)
> +void kobj_ns_put(enum kobj_ns_type type, void *ns)
>  {
> -	sysfs_exit_ns(type, ns);
> +	spin_lock(&kobj_ns_type_lock);
> +	if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
> +	    kobj_ns_ops_tbl[type] && kobj_ns_ops_tbl[type]->put_ns)
> +		kobj_ns_ops_tbl[type]->put_ns(ns);
> +	spin_unlock(&kobj_ns_type_lock);
>  }
>  
> -
>  EXPORT_SYMBOL(kobject_get);
>  EXPORT_SYMBOL(kobject_put);
>  EXPORT_SYMBOL(kobject_del);
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 11b98bc..0c799f1 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1179,9 +1179,14 @@ static void remove_queue_kobjects(struct net_device *net)
>  #endif
>  }
>  
> -static const void *net_current_ns(void)
> +static void *net_current_ns(void)
>  {
> -	return current->nsproxy->net_ns;
> +	struct net *ns = current->nsproxy->net_ns;
> +#ifdef CONFIG_NET_NS
> +	if (ns)
> +		atomic_inc(&ns->passive);
> +#endif
This code  doesn't need to be #ifdef'd
> +	return ns;
>  }
>  
>  static const void *net_initial_ns(void)
> @@ -1199,19 +1204,10 @@ struct kobj_ns_type_operations net_ns_type_operations = {
>  	.current_ns = net_current_ns,
>  	.netlink_ns = net_netlink_ns,
>  	.initial_ns = net_initial_ns,
> +	.put_ns = net_put_ns,
>  };
>  EXPORT_SYMBOL_GPL(net_ns_type_operations);
>  
> -static void net_kobj_ns_exit(struct net *net)
> -{
> -	kobj_ns_exit(KOBJ_NS_TYPE_NET, net);
> -}
> -
> -static struct pernet_operations kobj_net_ops = {
> -	.exit = net_kobj_ns_exit,
> -};
> -
> -
>  #ifdef CONFIG_HOTPLUG
>  static int netdev_uevent(struct device *d, struct kobj_uevent_env *env)
>  {
> @@ -1339,6 +1335,5 @@ EXPORT_SYMBOL(netdev_class_remove_file);
>  int netdev_kobject_init(void)
>  {
>  	kobj_ns_type_register(&net_ns_type_operations);
> -	register_pernet_subsys(&kobj_net_ops);
>  	return class_register(&net_class);
>  }
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 6c6b86d..19feb20 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -128,6 +128,7 @@ static __net_init int setup_net(struct net *net)
>  	LIST_HEAD(net_exit_list);
>  
>  	atomic_set(&net->count, 1);
> +	atomic_set(&net->passive, 1);
>  
>  #ifdef NETNS_REFCNT_DEBUG
>  	atomic_set(&net->use_count, 0);
> @@ -210,6 +211,13 @@ static void net_free(struct net *net)
>  	kmem_cache_free(net_cachep, net);
>  }
>  
> +void net_put_ns(void *p)
> +{
There has got to be a better name.  We already have put and get net
methods.

> +	struct net *ns = p;
> +	if (ns && atomic_dec_and_test(&ns->passive))
> +		net_free(ns);
> +}
> +
>  struct net *copy_net_ns(unsigned long flags, struct net *old_net)
>  {
>  	struct net *net;
> @@ -230,7 +238,7 @@ struct net *copy_net_ns(unsigned long flags, struct net *old_net)
>  	}
>  	mutex_unlock(&net_mutex);
>  	if (rv < 0) {
> -		net_free(net);
> +		net_put_ns(net);
>  		return ERR_PTR(rv);
>  	}
>  	return net;
> @@ -286,7 +294,7 @@ static void cleanup_net(struct work_struct *work)
>  	/* Finally it is safe to free my network namespace structure */
>  	list_for_each_entry_safe(net, tmp, &net_exit_list, exit_list) {
>  		list_del_init(&net->exit_list);
> -		net_free(net);
> +		net_put_ns(net);
>  	}
>  }
>  static DECLARE_WORK(net_cleanup_work, cleanup_net);

Eric

^ permalink raw reply

* [RFC] procfs: add hidepid and hidenet modes
From: Vasiliy Kulikov @ 2011-06-12  7:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: David S. Miller, Andrew Morton, Linus Torvalds,
	Nikanth Karthikesan, David Rientjes, Greg Kroah-Hartman, Al Viro,
	Eric Dumazet, netdev, kernel-hardening

This patch introduces support of procfs mount options and adds mount
options to restrict access to /proc/PID/ directories and /proc/PID/net/
contents.  The default backward-compatible behaviour is left untouched.

The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:

hidepid=0 (default) means the current behaviour - anybody may read all
world-readable /proc/PID/* files.

hidepid=1 means users may not access any /proc/<pid>/ directories, but their
own.  Sensitive files like cmdline, io, sched*, status, wchan are now
protected against other users.  As permission checking done in
proc_pid_permission() and files' permissions are left untouched,
programs expecting specific files' permissions are not confused.

hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to
other users.  It doesn't mean that it hides a fact whether a process
exists (it can be learned by other means, e.g. by sending signals), but
it hides process' euid and egid.  It greatly compicates intruder's task of
gathering info about running processes, whether some daemon runs with
elevated privileges, whether other user runs some sensitive program,
whether other users run any program at all, etc.

hidenet means /proc/PID/net will be accessible to processes with
CAP_NET_ADMIN capability or to members of a special group.

gid=XXX defines a group that will be able to gather all processes' info
and network connections info.

Similar features are implemented for old kernels in -ow patches (for
Linux 2.2 and 2.4) and for Linux 2.6 in -grsecurity (but both of them
are implemented as configure options, not cofigurable in runtime).


In current version hidenet works for CONFIG_NET_NS=y via creating a
"fake" net namespace and slipping it to nonauthorized users, resulting
in users observing blank net files (like nobody use the network).  If
CONFIG_NET_NS=n I don't see anything better than just fully denying
access to /proc/<pid>/net.  More elegant ideas are welcome.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
--
 Documentation/filesystems/proc.txt |   51 ++++++++++++++++++++++
 fs/proc/base.c                     |   62 ++++++++++++++++++++++++++-
 fs/proc/inode.c                    |   20 +++++++++
 fs/proc/internal.h                 |    1 +
 fs/proc/proc_net.c                 |   26 +++++++++++
 fs/proc/root.c                     |   83 +++++++++++++++++++++++++++++++++++-
 include/linux/pid_namespace.h      |    3 +
 include/net/net_namespace.h        |    2 +
 net/core/net_namespace.c           |    2 +-
 9 files changed, 246 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 23cae65..4fd35c4 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -41,6 +41,8 @@ Table of Contents
   3.5	/proc/<pid>/mountinfo - Information about mounts
   3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
 
+  4	Configuring procfs
+  4.1	Mount options
 
 ------------------------------------------------------------------------------
 Preface
@@ -1535,3 +1537,52 @@ a task to set its own or one of its thread siblings comm value. The comm value
 is limited in size compared to the cmdline value, so writing anything longer
 then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
 comm value.
+
+
+------------------------------------------------------------------------------
+Configuring procfs
+------------------------------------------------------------------------------
+
+4.1	Mount options
+---------------------
+
+The following mount options are supported:
+
+	hidepid=	Set /proc/<pid>/ access mode.
+	hidenet		Hide /proc/<pid>/net/ from nonauthorized users.
+	nohidenet	Don't hide /proc/<pid>/net/ from nonauthorized users.
+	gid=		Set the group authorized to learn processes and
+			networking information.
+
+hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
+(default).
+
+hidepid=1 means users may not access any /proc/<pid>/ directories, but their
+own.  Sensitive files like cmdline, io, sched*, status, wchan are now protected
+against other users.  This makes impossible to learn whether any user runs
+specific program (given the program doesn't reveal itself by its behaviour).
+As an additional bonus, as /proc/<pid>/cmdline is unaccessible for other users,
+poorly written programs passing sensitive information via program arguments are
+now protected against local eavesdroppers.
+
+hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be fully invisible to other
+users.  It doesn't mean that it hides a fact whether a process with a specific
+pid value exists (it can be learned by other means, e.g. by sending signals),
+but it hides process' euid and egid, which may be learned by stat()'ing
+/proc/<pid>/ otherwise.  It greatly complicates intruder's task of gathering info
+about running processes, whether some daemon runs with elevated privileges,
+whether other user runs some sensitive program, whether other users run any
+program at all, etc.
+
+hidenet means /proc/<pid>/net/ will be accessible to processes with
+CAP_NET_ADMIN capability or to members of a special group.  It means
+nonauthorized users may not learn any networking connections information.  If
+network namespaces support is enabled (CONFIG_NET_NS=y) then common users would
+obtain net directory, but all files would indicate no networking activity at
+all.  If network namespaces are disabled, net directory is unaccessible to
+common users.
+
+gid= means group authorized to learn processes information prohibited by
+hidepid= and networking information prohibited by hidenet.  If you use some
+daemon like identd which have to learn information about net/processes
+information, just add identd to this group.
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9d096e8..ff2feee 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -568,8 +568,40 @@ static int proc_setattr(struct dentry *dentry, struct iattr *attr)
 	return 0;
 }
 
+static int proc_pid_permission(struct inode *inode, int mask,
+			       unsigned int flags)
+{
+	struct pid_namespace *pid = inode->i_sb->s_fs_info;
+	struct task_struct *task = get_proc_task(inode);
+
+	if (pid->hide_pid &&
+	    !ptrace_may_access(task, PTRACE_MODE_READ) &&
+	    !in_group_p(pid->pid_gid)) {
+		if (pid->hide_pid == 2)
+			return -ENOENT;
+		else
+			return -EPERM;
+	}
+	return generic_permission(inode, mask, flags, NULL);
+}
+
+/*
+ * May current process learn task's euid/egid?
+ */
+static bool proc_pid_may_getattr(struct pid_namespace *pid,
+				 struct task_struct *task)
+{
+	if (pid->hide_pid < 2)
+		return true;
+	if (ptrace_may_access(task, PTRACE_MODE_READ))
+		return true;
+	return in_group_p(pid->pid_gid);
+}
+
+
 static const struct inode_operations proc_def_inode_operations = {
 	.setattr	= proc_setattr,
+	.permission	= proc_pid_permission,
 };
 
 static int mounts_open_common(struct inode *inode, struct file *file,
@@ -1662,6 +1694,7 @@ static const struct inode_operations proc_pid_link_inode_operations = {
 	.readlink	= proc_pid_readlink,
 	.follow_link	= proc_pid_follow_link,
 	.setattr	= proc_setattr,
+	.permission	= proc_pid_permission,
 };
 
 
@@ -1730,6 +1763,7 @@ static int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat
 	struct inode *inode = dentry->d_inode;
 	struct task_struct *task;
 	const struct cred *cred;
+	struct pid_namespace *pid = dentry->d_sb->s_fs_info;
 
 	generic_fillattr(inode, stat);
 
@@ -1738,6 +1772,14 @@ static int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat
 	stat->gid = 0;
 	task = pid_task(proc_pid(inode), PIDTYPE_PID);
 	if (task) {
+		if (!proc_pid_may_getattr(pid, task)) {
+			rcu_read_unlock();
+			/*
+			 * This doesn't prevent learning whether PID exists,
+			 * it only makes getattr() consistent with readdir().
+			 */
+			return -ENOENT;
+		}
 		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
 		    task_dumpable(task)) {
 			cred = __task_cred(task);
@@ -2184,6 +2226,7 @@ static const struct inode_operations proc_fd_inode_operations = {
 	.lookup		= proc_lookupfd,
 	.permission	= proc_fd_permission,
 	.setattr	= proc_setattr,
+	.permission	= proc_pid_permission,
 };
 
 static struct dentry *proc_fdinfo_instantiate(struct inode *dir,
@@ -2236,6 +2279,7 @@ static const struct file_operations proc_fdinfo_operations = {
 static const struct inode_operations proc_fdinfo_inode_operations = {
 	.lookup		= proc_lookupfdinfo,
 	.setattr	= proc_setattr,
+	.permission	= proc_pid_permission,
 };
 
 
@@ -2473,6 +2517,7 @@ static const struct inode_operations proc_attr_dir_inode_operations = {
 	.lookup		= proc_attr_dir_lookup,
 	.getattr	= pid_getattr,
 	.setattr	= proc_setattr,
+	.permission	= proc_pid_permission,
 };
 
 #endif
@@ -2890,6 +2935,7 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
 	.lookup		= proc_tgid_base_lookup,
 	.getattr	= pid_getattr,
 	.setattr	= proc_setattr,
+	.permission	= proc_pid_permission,
 };
 
 static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
@@ -3093,6 +3139,12 @@ static int proc_pid_fill_cache(struct file *filp, void *dirent, filldir_t filldi
 				proc_pid_instantiate, iter.task, NULL);
 }
 
+static int fake_filldir(void *buf, const char *name, int namelen,
+			loff_t offset, u64 ino, unsigned d_type)
+{
+	return 0;
+}
+
 /* for the /proc/ directory itself, after non-process stuff has been done */
 int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
 {
@@ -3100,6 +3152,7 @@ int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
 	struct task_struct *reaper = get_proc_task(filp->f_path.dentry->d_inode);
 	struct tgid_iter iter;
 	struct pid_namespace *ns;
+	filldir_t __filldir;
 
 	if (!reaper)
 		goto out_no_task;
@@ -3116,8 +3169,13 @@ int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
 	for (iter = next_tgid(ns, iter);
 	     iter.task;
 	     iter.tgid += 1, iter = next_tgid(ns, iter)) {
+		if (proc_pid_may_getattr(ns, iter.task))
+			__filldir = filldir;
+		else
+			__filldir = fake_filldir;
+
 		filp->f_pos = iter.tgid + TGID_OFFSET;
-		if (proc_pid_fill_cache(filp, dirent, filldir, iter) < 0) {
+		if (proc_pid_fill_cache(filp, dirent, __filldir, iter) < 0) {
 			put_task_struct(iter.task);
 			goto out;
 		}
@@ -3223,6 +3281,7 @@ static const struct inode_operations proc_tid_base_inode_operations = {
 	.lookup		= proc_tid_base_lookup,
 	.getattr	= pid_getattr,
 	.setattr	= proc_setattr,
+	.permission	= proc_pid_permission,
 };
 
 static struct dentry *proc_task_instantiate(struct inode *dir,
@@ -3448,6 +3507,7 @@ static const struct inode_operations proc_task_inode_operations = {
 	.lookup		= proc_task_lookup,
 	.getattr	= proc_task_getattr,
 	.setattr	= proc_setattr,
+	.permission	= proc_pid_permission,
 };
 
 static const struct file_operations proc_task_operations = {
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 176ce4c..895e3b1 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -7,6 +7,7 @@
 #include <linux/time.h>
 #include <linux/proc_fs.h>
 #include <linux/kernel.h>
+#include <linux/pid_namespace.h>
 #include <linux/mm.h>
 #include <linux/string.h>
 #include <linux/stat.h>
@@ -17,7 +18,9 @@
 #include <linux/init.h>
 #include <linux/module.h>
 #include <linux/sysctl.h>
+#include <linux/seq_file.h>
 #include <linux/slab.h>
+#include <linux/mount.h>
 
 #include <asm/system.h>
 #include <asm/uaccess.h>
@@ -93,12 +96,29 @@ void __init proc_init_inodecache(void)
 					     init_once);
 }
 
+static int proc_show_options(struct seq_file *seq, struct vfsmount *vfs)
+{
+	struct super_block *sb = vfs->mnt_sb;
+	struct pid_namespace *pid = sb->s_fs_info;
+
+	if (pid->pid_gid)
+		seq_printf(seq, ",gid=%lu", (unsigned long)pid->pid_gid);
+	if (pid->hide_pid != 0)
+		seq_printf(seq, ",hidepid=%u", pid->hide_pid);
+	if (pid->hide_net)
+		seq_printf(seq, ",hidenet");
+
+	return 0;
+}
+
 static const struct super_operations proc_sops = {
 	.alloc_inode	= proc_alloc_inode,
 	.destroy_inode	= proc_destroy_inode,
 	.drop_inode	= generic_delete_inode,
 	.evict_inode	= proc_evict_inode,
 	.statfs		= simple_statfs,
+	.remount_fs	= proc_remount,
+	.show_options	= proc_show_options,
 };
 
 static void __pde_users_dec(struct proc_dir_entry *pde)
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 9ad561d..1cacb6a 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -110,6 +110,7 @@ void pde_put(struct proc_dir_entry *pde);
 extern struct vfsmount *proc_mnt;
 int proc_fill_super(struct super_block *);
 struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
+int proc_remount(struct super_block *sb, int *flags, char *data);
 
 /*
  * These are generic /proc routines that use the internal
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 9020ac1..a2a1f08 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -22,10 +22,13 @@
 #include <linux/mount.h>
 #include <linux/nsproxy.h>
 #include <net/net_namespace.h>
+#include <linux/pid_namespace.h>
 #include <linux/seq_file.h>
 
 #include "internal.h"
 
+static struct net *fake_net;
+
 
 static struct net *get_proc_net(const struct inode *inode)
 {
@@ -105,6 +108,15 @@ static struct net *get_proc_task_net(struct inode *dir)
 	struct task_struct *task;
 	struct nsproxy *ns;
 	struct net *net = NULL;
+	struct pid_namespace *pid = dir->i_sb->s_fs_info;
+
+	if (pid->hide_net &&
+	    !in_group_p(pid->pid_gid) &&
+	    !capable(CAP_NET_ADMIN)) {
+		if (fake_net)
+			get_net(fake_net);
+		return fake_net;
+	}
 
 	rcu_read_lock();
 	task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -239,3 +251,17 @@ int __init proc_net_init(void)
 
 	return register_pernet_subsys(&proc_net_ns_ops);
 }
+
+#ifdef CONFIG_NET_NS
+int __init proc_net_initcall(void)
+{
+	fake_net = net_create();
+	if (fake_net == NULL)
+		return -ENOMEM;
+
+	get_net(fake_net);
+	return 0;
+}
+
+late_initcall(proc_net_initcall);
+#endif
diff --git a/fs/proc/root.c b/fs/proc/root.c
index ef9fa8e..10cc071 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -18,6 +18,7 @@
 #include <linux/bitops.h>
 #include <linux/mount.h>
 #include <linux/pid_namespace.h>
+#include <linux/parser.h>
 
 #include "internal.h"
 
@@ -35,6 +36,76 @@ static int proc_set_super(struct super_block *sb, void *data)
 	return set_anon_super(sb, NULL);
 }
 
+enum {
+	Opt_gid, Opt_hidepid, Opt_hidenet, Opt_nohidenet, Opt_err,
+};
+
+static const match_table_t tokens = {
+	{Opt_hidepid, "hidepid=%u"},
+	{Opt_gid, "gid=%u"},
+	{Opt_hidenet, "hidenet"},
+	{Opt_nohidenet, "nohidenet"},
+	{Opt_err, NULL},
+};
+
+static int proc_parse_options(char *options, struct pid_namespace *pid)
+{
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int option;
+
+	pr_debug("proc: options = %s\n", options);
+
+	if (!options)
+		return 1;
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+		if (!*p)
+			continue;
+
+		args[0].to = args[0].from = 0;
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case Opt_gid:
+			if (match_int(&args[0], &option))
+				return 0;
+			pid->pid_gid = option;
+			break;
+		case Opt_hidepid:
+			if (match_int(&args[0], &option))
+				return 0;
+			if (option < 0 || option > 2) {
+				pr_err("proc: hidepid value must be between 0 and 2.\n");
+				return 0;
+			}
+			pid->hide_pid = option;
+			break;
+		case Opt_hidenet:
+			pid->hide_net = true;
+			break;
+		case Opt_nohidenet:
+			pid->hide_net = false;
+			break;
+		default:
+			pr_err("proc: unrecognized mount option \"%s\" "
+			       "or missing value", p);
+			return 0;
+		}
+	}
+
+	pr_debug("proc: gid = %u, hidepid = %o, hidenet = %d\n",
+		pid->pid_gid, pid->hide_pid, (int)pid->hide_net);
+
+	return 1;
+}
+
+int proc_remount(struct super_block *sb, int *flags, char *data)
+{
+	struct pid_namespace *pid = sb->s_fs_info;
+	return !proc_parse_options(data, pid);
+}
+
 static struct dentry *proc_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
@@ -42,6 +113,7 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 	struct super_block *sb;
 	struct pid_namespace *ns;
 	struct proc_inode *ei;
+	char *options;
 
 	if (proc_mnt) {
 		/* Seed the root directory with a pid so it doesn't need
@@ -54,10 +126,13 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 			ei->pid = find_get_pid(1);
 	}
 
-	if (flags & MS_KERNMOUNT)
+	if (flags & MS_KERNMOUNT) {
 		ns = (struct pid_namespace *)data;
-	else
+		options = NULL;
+	} else {
 		ns = current->nsproxy->pid_ns;
+		options = data;
+	}
 
 	sb = sget(fs_type, proc_test_super, proc_set_super, ns);
 	if (IS_ERR(sb))
@@ -65,6 +140,10 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
 
 	if (!sb->s_root) {
 		sb->s_flags = flags;
+		if (!proc_parse_options(options, ns)) {
+			deactivate_locked_super(sb);
+			return ERR_PTR(-EINVAL);
+		}
 		err = proc_fill_super(sb);
 		if (err) {
 			deactivate_locked_super(sb);
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 38d1032..1c33094 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -30,6 +30,9 @@ struct pid_namespace {
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct bsd_acct_struct *bacct;
 #endif
+	gid_t pid_gid;
+	int hide_pid;
+	bool hide_net;
 };
 
 extern struct pid_namespace init_pid_ns;
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 1bf812b..d40c61c 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -113,6 +113,8 @@ static inline struct net *copy_net_ns(unsigned long flags, struct net *net_ns)
 }
 #endif /* CONFIG_NET */
 
+extern struct net *net_create(void);
+
 
 extern struct list_head net_namespace_list;
 
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 3f86026..c7c7310 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -216,7 +216,7 @@ static void net_free(struct net *net)
 	kmem_cache_free(net_cachep, net);
 }
 
-static struct net *net_create(void)
+struct net *net_create(void)
 {
 	struct net *net;
 	int rv;
--

^ permalink raw reply related

* Re: Question about LRO/GRO and TCP acknowledgements
From: Joris van Rantwijk @ 2011-06-12  7:51 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev
In-Reply-To: <1307850224.22348.626.camel@localhost>


On 2011-06-12, Ben Hutchings <bhutchings@solarflare.com> wrote:
> LRO implementations (and GRO) are expected to put the actual segment
> size in skb_shared_info(skb)->gso_size on the aggregated skb.  TCP
> will then use that rather than the aggregated payload size when
> deciding whether to defer an ACK.

Thanks. I see that indeed gso_size is being used for MSS calculations
instead of the total GRO size.

However, I'm not sure that this completely answers my question.
I am not so much concerned about quick ACK vs delayed ACK.
Instead, I'm looking at the total number of ACKs transmitted.
The sender depends on the _number_ of ACKs to update its congestion
window.

As far as I can see, current code will send just one ACK per coalesced
GRO bundle, while the sender expects one ACK per two segments.

Thanks,
Joris.

^ permalink raw reply

* [PATCH net-next-2.6] l2tp: fix l2tp_ip_sendmsg() route handling
From: Eric Dumazet @ 2011-06-12  8:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, James Chapman

l2tp_ip_sendmsg() in non connected mode incorrectly calls
sk_setup_caps(). Subsequent send() calls send data to wrong destination.

We can also avoid changing dst refcount in connected mode, using
appropriate rcu locking. Once output route lookups can also be done
under rcu, sendto() calls wont change dst refcounts too.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: James Chapman <jchapman@katalix.com>
---
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index b6466e7..d21e7eb 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -480,18 +480,16 @@ static int l2tp_ip_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *m
 	if (connected)
 		rt = (struct rtable *) __sk_dst_check(sk, 0);
 
+	rcu_read_lock();
 	if (rt == NULL) {
-		struct ip_options_rcu *inet_opt;
+		const struct ip_options_rcu *inet_opt;
 
-		rcu_read_lock();
 		inet_opt = rcu_dereference(inet->inet_opt);
 
 		/* Use correct destination address if we have options. */
 		if (inet_opt && inet_opt->opt.srr)
 			daddr = inet_opt->opt.faddr;
 
-		rcu_read_unlock();
-
 		/* If this fails, retransmit mechanism of transport layer will
 		 * keep trying until route appears or the connection times
 		 * itself out.
@@ -503,12 +501,20 @@ static int l2tp_ip_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *m
 					   sk->sk_bound_dev_if);
 		if (IS_ERR(rt))
 			goto no_route;
-		sk_setup_caps(sk, &rt->dst);
+		if (connected)
+			sk_setup_caps(sk, &rt->dst);
+		else
+			dst_release(&rt->dst); /* safe since we hold rcu_read_lock */
 	}
-	skb_dst_set(skb, dst_clone(&rt->dst));
+
+	/* We dont need to clone dst here, it is guaranteed to not disappear.
+	 *  __dev_xmit_skb() might force a refcount if needed.
+	 */
+	skb_dst_set_noref(skb, &rt->dst);
 
 	/* Queue the packet to IP for output */
 	rc = ip_queue_xmit(skb, &inet->cork.fl);
+	rcu_read_unlock();
 
 error:
 	/* Update stats */
@@ -525,6 +531,7 @@ out:
 	return rc;
 
 no_route:
+	rcu_read_unlock();
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	rc = -EHOSTUNREACH;



^ permalink raw reply related

* Re: Question about LRO/GRO and TCP acknowledgements
From: Eric Dumazet @ 2011-06-12  9:07 UTC (permalink / raw)
  To: Joris van Rantwijk; +Cc: Ben Hutchings, netdev
In-Reply-To: <20110612095131.6d924082@konijn>

Le dimanche 12 juin 2011 à 09:51 +0200, Joris van Rantwijk a écrit :
> On 2011-06-12, Ben Hutchings <bhutchings@solarflare.com> wrote:
> > LRO implementations (and GRO) are expected to put the actual segment
> > size in skb_shared_info(skb)->gso_size on the aggregated skb.  TCP
> > will then use that rather than the aggregated payload size when
> > deciding whether to defer an ACK.
> 
> Thanks. I see that indeed gso_size is being used for MSS calculations
> instead of the total GRO size.
> 
> However, I'm not sure that this completely answers my question.
> I am not so much concerned about quick ACK vs delayed ACK.
> Instead, I'm looking at the total number of ACKs transmitted.
> The sender depends on the _number_ of ACKs to update its congestion
> window.
> 


> As far as I can see, current code will send just one ACK per coalesced
> GRO bundle, while the sender expects one ACK per two segments.
> 

One ACK carries an implicit ack for _all_ previous segments. If sender
only 'counts' ACKs, it is a bit dumb...


10:05:02.755146 IP 192.168.20.110.57736 > 192.168.20.108.53563: SWE 96444459:96444459(0) win 14600 <mss 1460,sackOK,timestamp 12174491 
0,nop,wscale 8>
10:05:02.755242 IP 192.168.20.108.53563 > 192.168.20.110.57736: SE 1849523184:1849523184(0) ack 96444460 win 14480 <mss 1460,sackOK,tim
estamp 15334585 12174491,nop,wscale 7>
10:05:02.755310 IP 192.168.20.110.57736 > 192.168.20.108.53563: . ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755369 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 1:1449(1448) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755417 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 1449 win 136 <nop,nop,timestamp 15334585 12174491>
10:05:02.755428 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 1449:8689(7240) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755476 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 8689 win 159 <nop,nop,timestamp 15334585 12174491>
10:05:02.755482 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 8689:13033(4344) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755529 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 13033 win 181 <nop,nop,timestamp 15334585 12174491>
10:05:02.755535 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 13033:14481(1448) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755582 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 14481 win 204 <nop,nop,timestamp 15334585 12174491>
10:05:02.755588 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 14481:16385(1904) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755635 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 16385 win 227 <nop,nop,timestamp 15334585 12174491>
10:05:02.755641 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 16385:23625(7240) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755689 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 23625 win 249 <nop,nop,timestamp 15334585 12174491>
10:05:02.755695 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 23625:26521(2896) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755742 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 26521 win 272 <nop,nop,timestamp 15334585 12174491>
10:05:02.755750 IP 192.168.20.110.57736 > 192.168.20.108.53563: P 26521:33761(7240) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755796 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 33761 win 295 <nop,nop,timestamp 15334585 12174491>
10:05:02.755802 IP 192.168.20.110.57736 > 192.168.20.108.53563: . 33761:39553(5792) ack 1 win 58 <nop,nop,timestamp 12174491 15334585>
10:05:02.755849 IP 192.168.20.108.53563 > 192.168.20.110.57736: . ack 39553 win 259 <nop,nop,timestamp 15334585 12174491>




^ permalink raw reply

* Re: [PATCH] drivers/ssb/driver_chipcommon_pmu.c: uninitilized warning
From: Borislav Petkov @ 2011-06-12  9:07 UTC (permalink / raw)
  To: Connor Hansen; +Cc: Michael Büsch, mb, netdev, linux-kernel
In-Reply-To: <20110611205057.4b9bf462@maggie>

On Sat, Jun 11, 2011 at 08:50:57PM +0200, Michael Büsch wrote:
> On Sat, 11 Jun 2011 11:14:45 -0700
> Connor Hansen <cmdkhh@gmail.com> wrote:
> 
> > warning message
> > drivers/ssb/driver_chipcommon_pmu.c: In function ssb_pmu_resources_init
> > drivers/ssb/driver_chipcommon_pmu.c:420:15: warning: updown_tab_size may
> > be used uninitilized in this function.
> > 
> > updown_tab_size and depend_tab_size may not be set in the bus->chip_id
> > switch statement, so set to 0 by default to avoid using uninitialized
> > stack space.
> 
> We wouldn't be using uninitialized stack space or uninitialized variables,
> without this patch. However, for the sake of shutting up the compiler, ACK.

Also, Connor you have to be more specific in your commit messages.
Especially if this warning is triggered only with the extended build
checks aka make W=[123].

Please take a look at the code and check whether this is actually
fixing anything first because Michael is correct - there's no case
where {depend,uptown}_tab_size will be used uninitialized. So having
unnecessary churn when the compiler bitches about something it can't
understand yet is not something we want.

Thanks.

> 
> > Signed-off-by: Connor Hansen <cmdkhh@gmail.com>
> > ---
> >  drivers/ssb/driver_chipcommon_pmu.c |    4 ++--
> >  1 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/ssb/driver_chipcommon_pmu.c b/drivers/ssb/driver_chipcommon_pmu.c
> > index 305ade7..a7aef47 100644
> > --- a/drivers/ssb/driver_chipcommon_pmu.c
> > +++ b/drivers/ssb/driver_chipcommon_pmu.c
> > @@ -417,9 +417,9 @@ static void ssb_pmu_resources_init(struct ssb_chipcommon *cc)
> >  	u32 min_msk = 0, max_msk = 0;
> >  	unsigned int i;
> >  	const struct pmu_res_updown_tab_entry *updown_tab = NULL;
> > -	unsigned int updown_tab_size;
> > +	unsigned int updown_tab_size = 0;
> >  	const struct pmu_res_depend_tab_entry *depend_tab = NULL;
> > -	unsigned int depend_tab_size;
> > +	unsigned int depend_tab_size = 0;
> >  
> >  	switch (bus->chip_id) {
> >  	case 0x4312:
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply

* Re: Question about LRO/GRO and TCP acknowledgements
From: Joris van Rantwijk @ 2011-06-12  9:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1307869632.2872.106.camel@edumazet-laptop>

On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > As far as I can see, current code will send just one ACK per
> > coalesced GRO bundle, while the sender expects one ACK per two
> > segments.

> One ACK carries an implicit ack for _all_ previous segments. If sender
> only 'counts' ACKs, it is a bit dumb...

It may be dumb, but it's what the RFCs recommend and it's what Linux
implements.

RFC 5681:
  "During slow start, a TCP increments cwnd by at most SMSS bytes for
   each ACK received that cumulatively acknowledges new data."

In Linux, each incoming ACK causes one call to tcp_cong_avoid(),
which causes one call to tcp_slow_start() - assuming the connection is
in slow start - which increases the congestion window by one MSS.
Am I mistaken?

Please note I'm talking about managing the congestion window.
Of course I agree that each ACK implicitly covers all previous segments
for the purpose of retransmission management. But congestion
management is a different story.

Joris.

^ permalink raw reply

* Re: Question about LRO/GRO and TCP acknowledgements
From: Eric Dumazet @ 2011-06-12 10:48 UTC (permalink / raw)
  To: Joris van Rantwijk; +Cc: netdev
In-Reply-To: <20110612113004.79f48f40@konijn>

Le dimanche 12 juin 2011 à 11:30 +0200, Joris van Rantwijk a écrit :
> On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > As far as I can see, current code will send just one ACK per
> > > coalesced GRO bundle, while the sender expects one ACK per two
> > > segments.
> 
> > One ACK carries an implicit ack for _all_ previous segments. If sender
> > only 'counts' ACKs, it is a bit dumb...
> 
> It may be dumb, but it's what the RFCs recommend and it's what Linux
> implements.
> 
> RFC 5681:
>   "During slow start, a TCP increments cwnd by at most SMSS bytes for
>    each ACK received that cumulatively acknowledges new data."
> 

Note also RFC says:

The RECOMMENDED way to increase cwnd during congestion avoidance is
   to count the number of bytes that have been acknowledged by ACKs for
   new data. 

So your concern is more a Sender side implementation missing this
recommendation, not GRO per se...

GRO kicks when receiver receives a train of consecutive frames in his
NAPI run. In order to really reduce number of ACKS, you need to receive
3 frames in a very short time.

This leads to the RTT rule : "Note that during congestion avoidance,
cwnd MUST NOT be increased by more than SMSS bytes per RTT"

So GRO, lowering number of ACKS, can help sender to not waste its time
on extra ACKS.



^ permalink raw reply

* [PATCH] Revert "net: minor cleanup to net_namespace.c."
From: Alexey Dobriyan @ 2011-06-12 11:09 UTC (permalink / raw)
  To: davem; +Cc: netdev, rlandley, jpirko

git revert 911cb193f3eb0370f20fbba712211e55ffede4de

	commit 911cb193f3eb0370f20fbba712211e55ffede4de
	Author: Rob Landley <rlandley@parallels.com>
	Date:   Fri Apr 15 02:26:25 2011 +0000
	
	    net: minor cleanup to net_namespace.c.
	    
	    Inline a small static function that's only ever called from one place.
	    
	    Signed-off-by: Rob Landley <rlandley@parallels.com>
	    Reviewed-by: Jiri Pirko <jpirko@redhat.com>
	    Signed-off-by: David S. Miller <davem@davemloft.net>

Rationale for net_create() still holds:
* C/R and other out-of-tree code uses or will use it,
* permissions checks are separate thing (and different in C/R case),
* one function doesn't hurt anything.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 net/core/net_namespace.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -210,14 +210,11 @@ static void net_free(struct net *net)
 	kmem_cache_free(net_cachep, net);
 }
 
-struct net *copy_net_ns(unsigned long flags, struct net *old_net)
+static struct net *net_create(void)
 {
 	struct net *net;
 	int rv;
 
-	if (!(flags & CLONE_NEWNET))
-		return get_net(old_net);
-
 	net = net_alloc();
 	if (!net)
 		return ERR_PTR(-ENOMEM);
@@ -236,6 +233,13 @@ struct net *copy_net_ns(unsigned long flags, struct net *old_net)
 	return net;
 }
 
+struct net *copy_net_ns(unsigned long flags, struct net *old_net)
+{
+	if (!(flags & CLONE_NEWNET))
+		return get_net(old_net);
+	return net_create();
+}
+
 static DEFINE_SPINLOCK(cleanup_list_lock);
 static LIST_HEAD(cleanup_list);  /* Must hold cleanup_list_lock to touch */
 

^ permalink raw reply

* Re: [RFC] procfs: add hidepid and hidenet modes
From: Alexey Dobriyan @ 2011-06-12 11:12 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: linux-kernel, David S. Miller, Andrew Morton, Linus Torvalds,
	Nikanth Karthikesan, David Rientjes, Greg Kroah-Hartman, Al Viro,
	Eric Dumazet, netdev, kernel-hardening
In-Reply-To: <20110612075100.GA4459@albatros>

On Sun, Jun 12, 2011 at 11:51:01AM +0400, Vasiliy Kulikov wrote:
> hidenet means /proc/PID/net will be accessible to processes with
> CAP_NET_ADMIN capability or to members of a special group.
> 
> gid=XXX defines a group that will be able to gather all processes' info
> and network connections info.
> 
> Similar features are implemented for old kernels in -ow patches (for
> Linux 2.2 and 2.4) and for Linux 2.6 in -grsecurity (but both of them
> are implemented as configure options, not cofigurable in runtime).
> 
> 
> In current version hidenet works for CONFIG_NET_NS=y via creating a
> "fake" net namespace and slipping it to nonauthorized users, resulting
> in users observing blank net files (like nobody use the network).  If
> CONFIG_NET_NS=n I don't see anything better than just fully denying
> access to /proc/<pid>/net.  More elegant ideas are welcome.

This fake netns concept is ugly.
If you wan't deny something, why don't you return -E?

Regardless, these should be separate patch from PID stuff.

^ permalink raw reply

* Re: Question about LRO/GRO and TCP acknowledgements
From: Joris van Rantwijk @ 2011-06-12 11:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1307875698.2872.130.camel@edumazet-laptop>

On 2011-06-12, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le dimanche 12 juin 2011 à 11:30 +0200, Joris van Rantwijk a écrit :
> > > > As far as I can see, current code will send just one ACK per
> > > > coalesced GRO bundle, while the sender expects one ACK per two
> > > > segments.

> Note also RFC says:
> The RECOMMENDED way to increase cwnd during congestion avoidance is
>    to count the number of bytes that have been acknowledged by ACKs
> for new data. 

This is during the congestion avoidance phase. I'm actually more
concerned about the slow start phase, but congestion avoidance may also
be an issue.

By the way, Linux does not implement the recommended (byte-counting)
method by default. It can be enabled through sysctl tcp_abc, which is
off by default.

Also:
  Byte counting during congestion avoidance is also recommended,
  while the method from [RFC2581] and other safe methods are still
  allowed.

> So your concern is more a Sender side implementation missing this
> recommendation, not GRO per se...

Not really. The same RFC says:
  Specifically, an ACK SHOULD be generated for at least every
  second full-sized segment, ...

Sender side behaviour is just my argument for the practical importance
of this issue. But sender side arguments are not an excuse for the
receiver to deviate from its own recommended behaviour.

> GRO kicks when receiver receives a train of consecutive frames in his
> NAPI run. In order to really reduce number of ACKS, you need to
> receive 3 frames in a very short time.
> 
> This leads to the RTT rule : "Note that during congestion avoidance,
> cwnd MUST NOT be increased by more than SMSS bytes per RTT"

But this RTT rule is already taken into account in the code which
increases cwnd during congestion avoidance. This code _assumes_ that
the receiver sends one ACK per two segments. If the receiver sends
fewer ACKs, the congestion window will grow too slowly.

> So GRO, lowering number of ACKS, can help sender to not waste its time
> on extra ACKS.

I can see how the world may have been a better place if every sender
implemented Appropriate Byte Counting and TCP receivers were allowed to
send fewer ACKs. However, current reality is that ABC is optional,
disabled by default in Linux, and receivers are recommended to send one
ACK per two segments.

I suspect that GRO currently hurts throughput of isolated TCP
connections. This is based on a purely theoretic argument. I may be
wrong and I have absolutely no data to confirm my suspicion.

If you can point out the flaw in my reasoning, I would be greatly
relieved. Until then, I remain concerned that there may be something
wrong with GRO and TCP ACKs.

Joris.

^ permalink raw reply

* Re: [PATCH] Revert "net: minor cleanup to net_namespace.c."
From: David Miller @ 2011-06-12 11:29 UTC (permalink / raw)
  To: adobriyan; +Cc: netdev, rlandley, jpirko
In-Reply-To: <20110612110950.GA11321@p183.telecom.by>

From: Alexey Dobriyan <adobriyan@gmail.com>
Date: Sun, 12 Jun 2011 14:09:50 +0300

> * C/R and other out-of-tree code uses or will use it,

Submit this with the out-of-tree code that needs it,
we never cater to external bits like this.

^ permalink raw reply

* Re: [PATCH] Revert "net: minor cleanup to net_namespace.c."
From: Alexey Dobriyan @ 2011-06-12 11:39 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, rlandley, jpirko
In-Reply-To: <20110612.042911.1713846819610012420.davem@davemloft.net>

On Sun, Jun 12, 2011 at 04:29:11AM -0700, David Miller wrote:
> From: Alexey Dobriyan <adobriyan@gmail.com>
> Date: Sun, 12 Jun 2011 14:09:50 +0300
> 
> > * C/R and other out-of-tree code uses or will use it,
> 
> Submit this with the out-of-tree code that needs it,
> we never cater to external bits like this.

It is not like I'm asking for an ugly knob or something.

net_create() creates netns, and permission checks are done earlier in
some other place.

Why would anyone merge them back, I can't understand.

^ permalink raw reply

* Re: Question about LRO/GRO and TCP acknowledgements
From: Alexander Zimmermann @ 2011-06-12 12:01 UTC (permalink / raw)
  To: Joris van Rantwijk; +Cc: Eric Dumazet, netdev
In-Reply-To: <20110612132428.3e1a4593@konijn>

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

Hi Joris,

Am 12.06.2011 um 13:24 schrieb Joris van Rantwijk:

> 
> By the way, Linux does not implement the recommended (byte-counting)
> method by default. It can be enabled through sysctl tcp_abc, which is
> off by default.
> 
> 

See http://kerneltrap.org/mailarchive/linux-netdev/2010/3/3/6271114


//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//


[-- Attachment #2: Signierter Teil der Nachricht --]
[-- Type: application/pgp-signature, Size: 243 bytes --]

^ permalink raw reply

* Re: [PATCH 01/10] net: introduce time stamping wrapper for netif_rx.
From: Richard Cochran @ 2011-06-12 12:04 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev
In-Reply-To: <20110611.161025.317157517912316585.davem@davemloft.net>

On Sat, Jun 11, 2011 at 04:10:25PM -0700, David Miller wrote:
> 
> Also, it makes no sense to add this for obsolete RX processing such
> that netif_rx() is.
> 
> If drivers want to add fancy features like this timestamping stuff,
> they better move on to NAPI, GRO, etc. first.  Putting support for
> new features into deprecating things like netif_rx() makes no
> sense at all.

Okay, I see your point. I won't bother trying to improve the "academy
of ancient drivers," and I'll repost without the netif_rx wrapper.

However, I do want to support the coldfire fec driver, since Freescale
is selling two coldfire development boards with the dp83640 phy. But I
don't think it makes sense to try and upgrade the fec driver to napi,
when a simple "if !skb_defer_rx_timestamp" will do there.

Thanks,
Richard

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox