Netdev List
 help / color / mirror / Atom feed
* [PATCH V2 wireless-next] iwlwifi: iwlagn_request_scan: Fix check for priv->scan_request
From: Tim Gardner @ 2012-12-07 13:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tim Gardner, Johannes Berg, Wey-Yi Guy, Intel Linux Wireless,
	John W. Linville, Emmanuel Grumbach, Don Fry, linux-wireless,
	netdev
In-Reply-To: <1354830114.18211.4.camel@jlt4.sipsolutions.net>

The WARN_ON_ONCE() check for scan_request will not correctly detect
a NULL pointer for scan_type == IWL_SCAN_NORMAL. Make it explicit
that the check only applies to normal scans.

Convert WARN_ON_ONCE to WARN_ON since priv->scan_request really _can't_
be NULL for normal scans. If it is then we should emit frequent warnings.

This smatch warning led to scrutiny of iwlagn_request_scan():

drivers/net/wireless/iwlwifi/dvm/scan.c:894 iwlagn_request_scan() error: we previously assumed 'priv->scan_request' could be null (see line 792)

Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Wey-Yi Guy <wey-yi.w.guy@intel.com>
Cc: Intel Linux Wireless <ilw@linux.intel.com>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Don Fry <donald.h.fry@intel.com>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
---

This patch does apply to 3.6.y, but it doesn't fix an existing
bug so I don't think it qualifies. This patch simply makes
the driver more robust for future development.

V2 - corrected indentation more like the rest of the source
in this file.

 drivers/net/wireless/iwlwifi/dvm/scan.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/dvm/scan.c b/drivers/net/wireless/iwlwifi/dvm/scan.c
index bb9f625..fe91c5a 100644
--- a/drivers/net/wireless/iwlwifi/dvm/scan.c
+++ b/drivers/net/wireless/iwlwifi/dvm/scan.c
@@ -673,8 +673,9 @@ static int iwlagn_request_scan(struct iwl_priv *priv, struct ieee80211_vif *vif)
 	const u8 *ssid = NULL;
 	u8 ssid_len = 0;
 
-	if (WARN_ON_ONCE(priv->scan_request &&
-			 priv->scan_request->n_channels > MAX_SCAN_CHANNEL))
+	if (WARN_ON(priv->scan_type == IWL_SCAN_NORMAL &&
+		    (!priv->scan_request ||
+		    priv->scan_request->n_channels > MAX_SCAN_CHANNEL)))
 		return -EINVAL;
 
 	lockdep_assert_held(&priv->mutex);
-- 
1.7.9.5

^ permalink raw reply related

* GPF in ip6_dst_lookup_tail
From: Dave Jones @ 2012-12-07 14:15 UTC (permalink / raw)
  To: netdev

I just hit this gpf in overnight testing.

general protection fault: 0000 [#1] PREEMPT SMP 
Modules linked in: sctp libcrc32c ipt_ULOG fuse binfmt_misc nfnetlink nfc caif_socket caif phonet bluetooth rfkill can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds af_key decnet rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 nfsv3 nfs_acl nfs fscache lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables usb_debug microcode serio_raw pcspkr i2c_piix4 i2c_core r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm
CPU 3 
Pid: 19371, comm: trinity-child3 Not tainted 3.7.0-rc8+ #6 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<ffffffff815fe038>]  [<ffffffff815fe038>] ip6_dst_lookup_tail+0xe8/0x200
RSP: 0018:ffff880046277968  EFLAGS: 00010206
RAX: 2000000000000011 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff880046277b00 RSI: ffff880070a6cd80 RDI: ffff8800bb22d000
RBP: ffff8800462779f8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880046277a10
R13: ffff880046277b00 R14: ffff8800bb22d000 R15: ffffffff81cb40c0
FS:  00007f0817b3f740(0000) GS:ffff88012b400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000004 CR3: 00000000a7dfb000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child3 (pid: 19371, threadinfo ffff880046276000, task ffff880044dca450)
Stack:
 ffff8800bb22d000 0000000000000002 0000000000000001 0000000000000000
 ffff880046277998 ffffffff81335b56 ffff8800462779b8 ffffffff81074cec
 0000000000000001 ffff880070a6cd80 ffff8800462779f8 00000000ae4317df
Call Trace:
 [<ffffffff81335b56>] ? __const_udelay+0x36/0x40
 [<ffffffff81074cec>] ? __rcu_read_unlock+0x5c/0xa0
 [<ffffffff81527945>] ? sk_dst_check+0x5/0x260
 [<ffffffff815fe3fb>] ip6_sk_dst_lookup_flow+0xcb/0x1b0
 [<ffffffff8161f45e>] udpv6_sendmsg+0x66e/0xb90
 [<ffffffff81335b56>] ? __const_udelay+0x36/0x40
 [<ffffffff81074cec>] ? __rcu_read_unlock+0x5c/0xa0
 [<ffffffff815bd4e4>] inet_sendmsg+0x114/0x230
 [<ffffffff815bd3d5>] ? inet_sendmsg+0x5/0x230
 [<ffffffff815261e9>] ? sock_update_classid+0xa9/0x2d0
 [<ffffffff81526270>] ? sock_update_classid+0x130/0x2d0
 [<ffffffff81520d1c>] sock_sendmsg+0xbc/0xf0
 [<ffffffff810b7ab2>] ? get_lock_stats+0x22/0x70
 [<ffffffff810bd037>] ? lock_release_non_nested+0x2b7/0x2f0
 [<ffffffff815210fc>] __sys_sendmsg+0x3ac/0x3c0
 [<ffffffff810b7b88>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff810b7ab2>] ? get_lock_stats+0x22/0x70
 [<ffffffff810b7f1e>] ? put_lock_stats.isra.23+0xe/0x40
 [<ffffffff8100a136>] ? native_sched_clock+0x26/0x90
 [<ffffffff810b7b88>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff81055e54>] ? do_setitimer+0x1c4/0x300
 [<ffffffff810b7f1e>] ? put_lock_stats.isra.23+0xe/0x40
 [<ffffffff811d6bea>] ? fget_light+0x3ca/0x500
 [<ffffffff810bdf9d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff815237d9>] sys_sendmsg+0x49/0x90
 [<ffffffff81684b42>] system_call_fastpath+0x16/0x1b
Code: 00 00 48 8b 5d d8 4c 8b 65 e0 4c 8b 6d e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 0f 1f 00 49 8b 34 24 48 8b 86 98 00 00 00 48 85 c0 74 c2 <f6> 80 6d 01 00 00 de 75 b9 48 8b 56 18 49 8d 75 24 b9 01 00 00 
RIP  [<ffffffff815fe038>] ip6_dst_lookup_tail+0xe8/0x200
 RSP <ffff880046277968>
---[ end trace f7dde22e5674fdd6 ]---



0000000000000000 <.text>:
   0:	00 00                	add    %al,(%rax)
   2:	48 8b 5d d8          	mov    -0x28(%rbp),%rbx
   6:	4c 8b 65 e0          	mov    -0x20(%rbp),%r12
   a:	4c 8b 6d e8          	mov    -0x18(%rbp),%r13
   e:	4c 8b 75 f0          	mov    -0x10(%rbp),%r14
  12:	4c 8b 7d f8          	mov    -0x8(%rbp),%r15
  16:	c9                   	leaveq 
  17:	c3                   	retq   
  18:	0f 1f 00             	nopl   (%rax)
  1b:	49 8b 34 24          	mov    (%r12),%rsi
  1f:	48 8b 86 98 00 00 00 	mov    0x98(%rsi),%rax
  26:	48 85 c0             	test   %rax,%rax
  29:	74 c2                	je     0xffffffffffffffed

/home/davej/tmp/tmp.YOesIDyrIr.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	f6 80 6d 01 00 00 de 	testb  $0xde,0x16d(%rax)
   7:	75 b9                	jne    0xffffffffffffffc2
   9:	48 8b 56 18          	mov    0x18(%rsi),%rdx
   d:	49 8d 75 24          	lea    0x24(%r13),%rsi
  11:	b9                   	.byte 0xb9
  12:	01 00                	add    %eax,(%rax)
	...


which looks like an inlined copy of ipv6_addr_any

         * marked as OPTIMISTIC, we release the found
         * dst entry and replace it instead with the
         * dst entry of the nexthop router
         */
        rt = (struct rt6_info *) *dst;
        n = rt->n;
      dc:       48 8b 86 98 00 00 00    mov    0x98(%rsi),%rax
        if (n && !(n->nud_state & NUD_VALID)) {
      e3:       48 85 c0                test   %rax,%rax
      e6:       74 c2                   je     aa <ip6_dst_lookup_tail+0xaa>
      e8:       f6 80 6d 01 00 00 de    testb  $0xde,0x16d(%rax)
      ef:       75 b9                   jne    aa <ip6_dst_lookup_tail+0xaa>


RAX here is 2000000000000011 , which clearly isn't a valid rt address to dereference.

Suddenly this starts to look familiar..  I started seeing this around 3.6-rc7
back in September.  http://www.spinics.net/lists/netdev/msg211894.html

	Dave

^ permalink raw reply

* [PATCHv6] virtio-spec: virtio network device multiqueue support
From: Michael S. Tsirkin @ 2012-12-07 14:18 UTC (permalink / raw)
  To: Jason Wang; +Cc: rusty, virtualization, netdev, kvm, bhutchings

Add multiqueue support to virtio network device.
Add a new feature flag VIRTIO_NET_F_MQ for this feature, a new
configuration field max_virtqueue_pairs to detect supported number of
virtqueues as well as a new command VIRTIO_NET_CTRL_MQ to program
packet steering for unidirectional protocols.

---

Changes in v6:
- rename RFS -> multiqueue to avoid confusion with RFS in linux
  mention automatic receive steering as Rusty suggested

Changes in v5:
- Address Rusty's comments.
  Changes are only in the text, not the ideas.
- Some minor formatting changes.

Changes in v4:
- address Jason's comments
- have configuration specify the number of VQ pairs and not pairs - 1

Changes in v3:
- rename multiqueue -> rfs this is what we support
- Be more explicit about what driver should do.
- Simplify layout making VQs functionality depend on feature.
- Remove unused commands, only leave in programming # of queues

Changes in v2:
Address Jason's comments on v2:
- Changed STEERING_HOST to STEERING_RX_FOLLOWS_TX:
   this is both clearer and easier to support.
   It does not look like we need a separate steering command
   since host can just watch tx packets as they go.
- Moved RX and TX steering sections near each other.
- Add motivation for other changes in v2

Changes in v1 (from Jason's rfc):
- reserved vq 3: this makes all rx vqs even and tx vqs odd, which
   looks nicer to me.
- documented packet steering, added a generalized steering programming
   command. Current modes are single queue and host driven multiqueue,
   but I envision support for guest driven multiqueue in the future.
- make default vqs unused when in mq mode - this wastes some memory
   but makes it more efficient to switch between modes as
   we can avoid this causing packet reordering.

Rusty, could you please take a look and comment soon?
If this looks OK to everyone, we can proceed with finalizing the
implementation. Would be nice to try and put it in 3.8.

diff --git a/virtio-spec.lyx b/virtio-spec.lyx
index 83f2771..c5b32c4 100644
--- a/virtio-spec.lyx
+++ b/virtio-spec.lyx
@@ -59,6 +59,7 @@
 \author -608949062 "Rusty Russell,,," 
 \author -385801441 "Cornelia Huck" cornelia.huck@de.ibm.com
 \author 1531152142 "Paolo Bonzini,,," 
+\author 1986246365 "Michael S. Tsirkin" 
 \end_header
 
 \begin_body
@@ -4170,9 +4171,46 @@ ID 1
 \end_layout
 
 \begin_layout Description
-Virtqueues 0:receiveq.
- 1:transmitq.
- 2:controlq
+Virtqueues 0:receiveq
+\change_inserted 1986246365 1352742829
+0
+\change_unchanged
+.
+ 1:transmitq
+\change_inserted 1986246365 1352742832
+0
+\change_deleted 1986246365 1352742947
+.
+ 
+\change_inserted 1986246365 1352742952
+.
+ ....
+ 2N
+\begin_inset Foot
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1354531595
+N=0 if VIRTIO_NET_F_MQ is not negotiated, otherwise N is derived from 
+\emph on
+max_virtqueue_pairs
+\emph default
+ control
+\emph on
+ 
+\emph default
+field.
+ 
+\end_layout
+
+\end_inset
+
+: receivqN.
+ 2N+1: transmitqN.
+ 2N+
+\change_unchanged
+2:controlq
 \begin_inset Foot
 status open
 
@@ -4343,6 +4381,16 @@ VIRTIO_NET_F_CTRL_VLAN
 
 \begin_layout Description
 VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous packets.
+\change_inserted 1986246365 1352742767
+
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1986246365 1352742808
+VIRTIO_NET_F_MQ(22) Device supports multiqueue with automatic receive steering.
+\change_unchanged
+
 \end_layout
 
 \end_deeper
@@ -4355,11 +4403,45 @@ configuration
 \begin_inset space ~
 \end_inset
 
-layout Two configuration fields are currently defined.
+layout 
+\change_deleted 1986246365 1352743300
+Two
+\change_inserted 1986246365 1354531413
+Three
+\change_unchanged
+ configuration fields are currently defined.
  The mac address field always exists (though is only valid if VIRTIO_NET_F_MAC
  is set), and the status field only exists if VIRTIO_NET_F_STATUS is set.
  Two read-only bits are currently defined for the status field: VIRTIO_NET_S_LIN
 K_UP and VIRTIO_NET_S_ANNOUNCE.
+
+\change_inserted 1986246365 1354531470
+ The following read-only field, 
+\emph on
+max_virtqueue_pairs
+\emph default
+ only exists if VIRTIO_NET_F_MQ is set.
+ This field specifies the maximum number of each of transmit and receive
+ virtqueues (receiveq0..receiveq
+\emph on
+N
+\emph default
+ and transmitq0..transmitq
+\emph on
+N
+\emph default
+ respectively; 
+\emph on
+N
+\emph default
+=
+\emph on
+max_virtqueue_pairs - 1
+\emph default
+) that can be configured once VIRTIO_NET_F_MQ is negotiated.
+ Legal values for this field are 1 to 8000h.
+
+\change_unchanged
  
 \begin_inset listings
 inline false
@@ -4392,6 +4474,17 @@ struct virtio_net_config {
 \begin_layout Plain Layout
 
     u16 status;
+\change_inserted 1986246365 1354531427
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1354531437
+
+    u16 max_virtqueue_pairs;
+\change_unchanged
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -4410,7 +4503,24 @@ Device Initialization
 
 \begin_layout Enumerate
 The initialization routine should identify the receive and transmission
- virtqueues.
+ virtqueues
+\change_inserted 1986246365 1352744077
+, up to N+1 of each kind
+\change_unchanged
+.
+
+\change_inserted 1986246365 1352743942
+ If VIRTIO_NET_F_MQ feature bit is negotiated, 
+\emph on
+N=max_virtqueue_pairs-1
+\emph default
+, otherwise identify 
+\emph on
+N=0
+\emph default
+.
+\change_unchanged
+
 \end_layout
 
 \begin_layout Enumerate
@@ -4452,10 +4562,33 @@ status
 
  config field.
  Otherwise, the link should be assumed active.
+\change_inserted 1986246365 1354529306
+
 \end_layout
 
 \begin_layout Enumerate
-The receive virtqueue should be filled with receive buffers.
+
+\change_inserted 1986246365 1354531717
+Only receiveq0, transmitq0 and controlq are used by default.
+ To use more queues driver must negotiate the VIRTIO_NET_F_MQ feature;
+ initialize up to 
+\emph on
+max_virtqueue_pairs
+\emph default
+ of each of transmit and receive queues; execute_VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SE
+T command specifying the number of the transmit and receive queues that
+ is going to be used and wait until the device consumes the controlq buffer
+ and acks this command.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Enumerate
+The receive virtqueue
+\change_inserted 1986246365 1352743953
+s
+\change_unchanged
+ should be filled with receive buffers.
  This is described in detail below in 
 \begin_inset Quotes eld
 \end_inset
@@ -4550,8 +4683,15 @@ Device Operation
 \end_layout
 
 \begin_layout Standard
-Packets are transmitted by placing them in the transmitq, and buffers for
- incoming packets are placed in the receiveq.
+Packets are transmitted by placing them in the transmitq
+\change_inserted 1986246365 1353593685
+0..transmitqN
+\change_unchanged
+, and buffers for incoming packets are placed in the receiveq
+\change_inserted 1986246365 1353593692
+0..receiveqN
+\change_unchanged
+.
  In each case, the packet itself is preceeded by a header:
 \end_layout
 
@@ -4861,6 +5001,17 @@ If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at least the
 struct virtio_net_hdr
 \family default
 .
+\change_inserted 1986246365 1353594518
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353594638
+If VIRTIO_NET_F_MQ is negotiated, each of receiveq0...receiveqN that will
+ be used should be populated with receive buffers.
+\change_unchanged
+
 \end_layout
 
 \begin_layout Subsection*
@@ -5293,8 +5444,151 @@ Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control vq.
  
 \end_layout
 
-\begin_layout Enumerate
+\begin_layout Subsection*
+
+\change_inserted 1986246365 1353593879
+Automatic receive steering in multiqueue mode
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1354528882
+If the driver negotiates the VIRTIO_NET_F_MQ feature bit (depends on VIRTIO_NET
+_F_CTRL_VQ), it can transmit outgoing packets on one of the multiple transmitq0..t
+ransmitqN and ask the device to queue incoming packets into one the multiple
+ receiveq0..receiveqN depending on the packet flow.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353594292
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594178
+
+struct virtio_net_ctrl_mq {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594212
+
+	u16 virtqueue_pairs;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594172
+
+};
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594172
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594263
+
+#define VIRTIO_NET_CTRL_MQ    1
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594273
+
+ #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET        0 
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594273
+
+ #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MIN        1 
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594273
+
+ #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MAX        0x8000
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1354531492
+Multiqueue is disabled by default.
+ Driver enables multiqueue by executing the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command,
+ specifying the number of the transmit and receive queues that will be used;
+ thus transmitq0..transmitqn and receiveq0..receiveqn where 
+\emph on
+n=virtqueue_pairs-1
+\emph default
+ will be used.
+ All these virtqueues must have been pre-configured in advance.
+ The range of legal values for the
+\emph on
+ virtqueue_pairs
+\emph off
+ field is between 1 and 
+\emph on
+max_virtqueue_pairs
+\emph off
+.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353595328
+When multiqueue is enabled, device uses automatic receive
+steering based on packet flow.
+Programming of the receive steering classificator is implicit.
+ Transmitting a packet of a specific flow on transmitqX will cause incoming
+ packets for this flow to be steered to receiveqX.
+ For uni-directional protocols, or where no packets have been transmitted
+ yet, device will steer a packet to a random queue out of the specified
+ receiveq0..receiveqn.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1354528710
+Multiqueue is disabled by setting 
+\emph on
+virtqueue_pairs = 1
+\emph default
+ (this is the default).
+ After the command is consumed by the device, the device will not steer
+ new packets on virtqueues receveq1..receiveqN (i.e.
+ other than receiveq0) nor read from transmitq1..transmitqN (i.e.
+ other than transmitq0); accordingly, driver should not transmit new packets
+ on virtqueues other than transmitq0.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Standard
+
+\change_deleted 1986246365 1353593873
 .
+
+\change_unchanged
  
 \end_layout
 

^ permalink raw reply related

* Re: [PATCH RFC 0/5] Containerize syslog
From: Serge Hallyn @ 2012-12-07 14:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rui Xiang, netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman
In-Reply-To: <20121207010355.c809b3f7.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

Quoting Andrew Morton (akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org):
> On Mon, 19 Nov 2012 01:51:09 -0800 ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) wrote:
> 
> > Are there any kernel print statements besides networking stack printks
> > that we want to move to show up in a new "kernel log" namespace?
> 
> That's a good question, and afaict it remains unanswered.

There are some other (not *terribly* compelling) cases.  For instance
selinux hooks, if you say mount an fs without xattr support or with
unsupported options, will printk a warning.  Things like stat.c and
capabilities and syslog print out warnings when userspace uses a
deprecated somethingorother - old stat syscall or sys_syslog without
CAP_SYSLOG.  That should go to the container.  Filesystems may give
warnings (bad mount options for tmpfs, bad uid owner for many of them,
etc) which belong in the container.  Obviously some belong on the host -
if they show a corrupt superblock which may indicate an attempt by the
container to crash the kernel.

> As so often happens, this patchset's changelogs forgot to describe the
> reason for the existence of this patchset.  Via a bit of lwn reading

Not as a separate justification admittedly, but the description was
meant to explain it:  right now /dev/kmsg and sys_syslog are not safe
and useful in a container;  syslog messages from host and containers
can be confusingly intermixed;  and helpful printks are not seen in
the container.

> and my awesome telepathic skills, I divine that something in networking
> is using syslog for kernel->userspace communications.
> 
> wtf?

Well, syslog is the kernel->userspace channel of last resort.

> Wouldn't it be better to just stop doing that, and to implement a
> respectable and reliable kernel->userspace messaging scheme?

Convenience functions on top of netlink?

> And leave syslog alone - it's a crude low-level thing for random
> unexpected things which operators might want to know about.

That sentence is a result of not calling a container admin an operator.
I can't argue it because I'm not sure whether to agree with that
classification.

-serge

^ permalink raw reply

* [PATCH net-next 00/10] tipc: more updates for the v3.8 content
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker

Here is another batch of TIPC changes.  The most interesting
thing is probably the non-blocking socket connect - I'm told
there were several users looking forward to seeing this.

Also there were some resource limitation changes that had
the right intent back in 2005, but were now apparently causing
needless limitations to people's real use cases; those have
been relaxed/removed.

There is a lockdep splat fix, but no need for a stable backport,
since it is virtually impossible to trigger in mainline; you
have to essentially modify code to force the probabilities
in your favour to see it.

The rest can largely be categorized as general cleanup of things
seen in the process of getting the above changes done.

Tested between 64 and 32 bit nodes with the test suite.  I've
also compile tested all the individual commits on the chain.

I'd originally figured on this queue not being ready for 3.8, but
the extended stabilization window of 3.7 has changed that.  On
the other hand, this can still be 3.9 material, if that simply
works better for folks - no problem for me to defer it to 2013.
If anyone spots any problems then I'll definitely defer it,
rather than rush a last minute respin.

Thanks,
Paul.
---

The following changes since commit b93196dc5af7729ff7cc50d3d322ab1a364aa14f:

  net: fix some compiler warning in net/core/neighbour.c (2012-12-05 21:50:37 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux.git tipc_net-next

for you to fetch changes up to c8c0055db5be549dbd7eef2f207c53bcd556de9e:

  tipc: refactor accept() code for improved readability (2012-12-06 17:29:53 -0500)

----------------------------------------------------------------
Erik Hugne (1):
      tipc: remove obsolete flush of stale reassembly buffer

Jon Maloy (1):
      tipc: change sk_receive_queue upper limit

Paul Gortmaker (2):
      tipc: standardize across connect/disconnect function naming
      tipc: refactor accept() code for improved readability

Ying Xue (6):
      tipc: eliminate aggregate sk_receive_queue limit
      tipc: sk_recv_queue size check only for connectionless sockets
      tipc: consolidate connection-oriented message reception in one function
      tipc: introduce non-blocking socket connect
      tipc: eliminate connection setup for implied connect in recv_msg()
      tipc: add lock nesting notation to quiet lockdep warning

 net/tipc/link.c   |  44 -------
 net/tipc/port.c   |  32 +++--
 net/tipc/port.h   |   6 +-
 net/tipc/socket.c | 360 +++++++++++++++++++++++++++++++-----------------------
 net/tipc/subscr.c |   2 +-
 5 files changed, 232 insertions(+), 212 deletions(-)

^ permalink raw reply

* [PATCH net-next 01/10] tipc: remove obsolete flush of stale reassembly buffer
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Erik Hugne, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Erik Hugne <erik.hugne@ericsson.com>

Each link instance has a periodic job checking if there is a stale
ongoing message reassembly associated to the link. If no new
fragment has been received during the last 4*[link_tolerance] period,
it is assumed the missing fragment will never arrive. As a consequence,
the reassembly buffer is discarded, and a gap in the message sequence
occurs.

This assumption is wrong. After we abandoned our ambition to develop
packet routing for multi-cluster networks, only single-hop packet
transfer remains as an option. For those, all packets are guaranteed
to be delivered in sequence to the defragmentation layer. Any failure
to achieve sequenced delivery will eventually lead to link reset, and
the reassembly buffer will be flushed anyway.

So we just remove this periodic check, which is now obsolete.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: also delete get/inc_timer count, since they are now unused]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/link.c | 44 --------------------------------------------
 1 file changed, 44 deletions(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index 87bf5aa..daa6080 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -97,7 +97,6 @@ static int  link_send_sections_long(struct tipc_port *sender,
 				    struct iovec const *msg_sect,
 				    u32 num_sect, unsigned int total_len,
 				    u32 destnode);
-static void link_check_defragm_bufs(struct tipc_link *l_ptr);
 static void link_state_event(struct tipc_link *l_ptr, u32 event);
 static void link_reset_statistics(struct tipc_link *l_ptr);
 static void link_print(struct tipc_link *l_ptr, const char *str);
@@ -271,7 +270,6 @@ static void link_timeout(struct tipc_link *l_ptr)
 	}
 
 	/* do all other link processing performed on a periodic basis */
-	link_check_defragm_bufs(l_ptr);
 
 	link_state_event(l_ptr, TIMEOUT_EVT);
 
@@ -2497,16 +2495,6 @@ static void set_expected_frags(struct sk_buff *buf, u32 exp)
 	msg_set_bcast_ack(buf_msg(buf), exp);
 }
 
-static u32 get_timer_cnt(struct sk_buff *buf)
-{
-	return msg_reroute_cnt(buf_msg(buf));
-}
-
-static void incr_timer_cnt(struct sk_buff *buf)
-{
-	msg_incr_reroute_cnt(buf_msg(buf));
-}
-
 /*
  * tipc_link_recv_fragment(): Called with node lock on. Returns
  * the reassembled buffer if message is complete.
@@ -2585,38 +2573,6 @@ int tipc_link_recv_fragment(struct sk_buff **pending, struct sk_buff **fb,
 	return 0;
 }
 
-/**
- * link_check_defragm_bufs - flush stale incoming message fragments
- * @l_ptr: pointer to link
- */
-static void link_check_defragm_bufs(struct tipc_link *l_ptr)
-{
-	struct sk_buff *prev = NULL;
-	struct sk_buff *next = NULL;
-	struct sk_buff *buf = l_ptr->defragm_buf;
-
-	if (!buf)
-		return;
-	if (!link_working_working(l_ptr))
-		return;
-	while (buf) {
-		u32 cnt = get_timer_cnt(buf);
-
-		next = buf->next;
-		if (cnt < 4) {
-			incr_timer_cnt(buf);
-			prev = buf;
-		} else {
-			if (prev)
-				prev->next = buf->next;
-			else
-				l_ptr->defragm_buf = buf->next;
-			kfree_skb(buf);
-		}
-		buf = next;
-	}
-}
-
 static void link_set_supervision_props(struct tipc_link *l_ptr, u32 tolerance)
 {
 	if ((tolerance < TIPC_MIN_LINK_TOL) || (tolerance > TIPC_MAX_LINK_TOL))
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

The sk_receive_queue limit control is currently performed for
all arriving messages, disregarding socket and message type.
But for connected sockets this check is redundant, since the protocol
flow control already makes queue overflow impossible.

We move the sk_receive_queue limit control so that it is only performed
for connectionless sockets, i.e. SOCK_RDM and SOCK_DGRAM type sockets.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index a059ed0..4d6a448 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1188,9 +1188,6 @@ static int rx_queue_full(struct tipc_msg *msg, u32 queue_size, u32 base)
 	else
 		return 0;
 
-	if (msg_connected(msg))
-		threshold *= 4;
-
 	return queue_size >= threshold;
 }
 
@@ -1219,6 +1216,15 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	if (sock->state == SS_READY) {
 		if (msg_connected(msg))
 			return TIPC_ERR_NO_PORT;
+		/* Reject SOCK_DGRAM and SOCK_RDM messages if there isn't room
+		 * to queue it.
+		 */
+		recv_q_len = skb_queue_len(&sk->sk_receive_queue);
+		if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
+			if (rx_queue_full(msg, recv_q_len,
+			    OVERLOAD_LIMIT_BASE / 2))
+				return TIPC_ERR_OVERLOAD;
+		}
 	} else {
 		if (msg_mcast(msg))
 			return TIPC_ERR_NO_PORT;
@@ -1240,13 +1246,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 		}
 	}
 
-	/* Reject message if there isn't room to queue it */
-	recv_q_len = skb_queue_len(&sk->sk_receive_queue);
-	if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
-		if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
-			return TIPC_ERR_OVERLOAD;
-	}
-
 	/* Enqueue message (finally!) */
 	TIPC_SKB_CB(buf)->handle = 0;
 	atomic_inc(&tipc_queue_size);
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 04/10] tipc: change sk_receive_queue upper limit
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Jon Maloy <jon.maloy@ericsson.com>

The sk_recv_queue upper limit for connectionless sockets has empirically
turned out to be too low. When we double the current limit we get much
fewer rejected messages and no noticable negative side-effects.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 4d6a448..65a6bfc 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1,7 +1,7 @@
 /*
  * net/tipc/socket.c: TIPC socket API
  *
- * Copyright (c) 2001-2007, Ericsson AB
+ * Copyright (c) 2001-2007, 2012 Ericsson AB
  * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
  * All rights reserved.
  *
@@ -43,7 +43,7 @@
 #define SS_LISTENING	-1	/* socket is listening */
 #define SS_READY	-2	/* socket is connectionless */
 
-#define OVERLOAD_LIMIT_BASE	5000
+#define OVERLOAD_LIMIT_BASE	10000
 #define CONN_TIMEOUT_DEFAULT	8000	/* default connect timeout = 8s */
 
 struct tipc_sock {
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
global atomic counter for the sum of sk_recv_queue sizes across all
tipc sockets. When incremented, the counter is compared to an upper
threshold value, and if this is reached, the message is rejected
with error code TIPC_OVERLOAD.

This check was originally meant to protect the node against
buffer exhaustion and general CPU overload. However, all experience
indicates that the feature not only is redundant on Linux, but even
harmful. Users run into the limit very often, causing disturbances
for their applications, while removing it seems to have no negative
effects at all. We have also seen that overall performance is
boosted significantly when this bottleneck is removed.

Furthermore, we don't see any other network protocols maintaining
such a mechanism, something strengthening our conviction that this
control can be eliminated.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1a720c8..a059ed0 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2,7 +2,7 @@
  * net/tipc/socket.c: TIPC socket API
  *
  * Copyright (c) 2001-2007, Ericsson AB
- * Copyright (c) 2004-2008, 2010-2011, Wind River Systems
+ * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -1241,11 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	}
 
 	/* Reject message if there isn't room to queue it */
-	recv_q_len = (u32)atomic_read(&tipc_queue_size);
-	if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
-		if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
-			return TIPC_ERR_OVERLOAD;
-	}
 	recv_q_len = skb_queue_len(&sk->sk_receive_queue);
 	if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
 		if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 05/10] tipc: standardize across connect/disconnect function naming
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

Currently we have tipc_disconnect and tipc_disconnect_port.  It is
not clear from the names alone, what they do or how they differ.
It turns out that tipc_disconnect just deals with the port locking
and then calls tipc_disconnect_port which does all the work.

If we rename as follows: tipc_disconnect_port --> __tipc_disconnect
then we will be following typical linux convention, where:

   __tipc_disconnect: "raw" function that does all the work.

   tipc_disconnect: wrapper that deals with locking and then calls
		    the real core __tipc_disconnect function

With this, the difference is immediately evident, and locking
violations are more apt to be spotted by chance while working on,
or even just while reading the code.

On the connect side of things, we currently only have the single
"tipc_connect2port" function.  It does both the locking at enter/exit,
and the core of the work.  Pending changes will make it desireable to
have the connect be a two part locking wrapper + worker function,
just like the disconnect is already.

Here, we make the connect look just like the updated disconnect case,
for the above reason, and for consistency.  In the process, we also
get rid of the "2port" suffix that was on the original name, since
it adds no descriptive value.

On close examination, one might notice that the above connect
changes implicitly move the call to tipc_link_get_max_pkt() to be
within the scope of tipc_port_lock() protected region; when it was
not previously.  We don't see any issues with this, and it is in
keeping with __tipc_connect doing the work and tipc_connect just
handling the locking.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/port.c   | 32 +++++++++++++++++++++++---------
 net/tipc/port.h   |  6 ++++--
 net/tipc/socket.c |  6 +++---
 net/tipc/subscr.c |  2 +-
 4 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/net/tipc/port.c b/net/tipc/port.c
index 07c42fb..18098ca 100644
--- a/net/tipc/port.c
+++ b/net/tipc/port.c
@@ -726,7 +726,7 @@ static void port_dispatcher_sigh(void *dummy)
 				if (unlikely(!cb))
 					goto reject;
 				if (unlikely(!connected)) {
-					if (tipc_connect2port(dref, &orig))
+					if (tipc_connect(dref, &orig))
 						goto reject;
 				} else if (peer_invalid)
 					goto reject;
@@ -1036,15 +1036,30 @@ int tipc_withdraw(u32 ref, unsigned int scope, struct tipc_name_seq const *seq)
 	return res;
 }
 
-int tipc_connect2port(u32 ref, struct tipc_portid const *peer)
+int tipc_connect(u32 ref, struct tipc_portid const *peer)
 {
 	struct tipc_port *p_ptr;
-	struct tipc_msg *msg;
-	int res = -EINVAL;
+	int res;
 
 	p_ptr = tipc_port_lock(ref);
 	if (!p_ptr)
 		return -EINVAL;
+	res = __tipc_connect(ref, p_ptr, peer);
+	tipc_port_unlock(p_ptr);
+	return res;
+}
+
+/*
+ * __tipc_connect - connect to a remote peer
+ *
+ * Port must be locked.
+ */
+int __tipc_connect(u32 ref, struct tipc_port *p_ptr,
+			struct tipc_portid const *peer)
+{
+	struct tipc_msg *msg;
+	int res = -EINVAL;
+
 	if (p_ptr->published || p_ptr->connected)
 		goto exit;
 	if (!peer->ref)
@@ -1067,17 +1082,16 @@ int tipc_connect2port(u32 ref, struct tipc_portid const *peer)
 			  (net_ev_handler)port_handle_node_down);
 	res = 0;
 exit:
-	tipc_port_unlock(p_ptr);
 	p_ptr->max_pkt = tipc_link_get_max_pkt(peer->node, ref);
 	return res;
 }
 
-/**
- * tipc_disconnect_port - disconnect port from peer
+/*
+ * __tipc_disconnect - disconnect port from peer
  *
  * Port must be locked.
  */
-int tipc_disconnect_port(struct tipc_port *tp_ptr)
+int __tipc_disconnect(struct tipc_port *tp_ptr)
 {
 	int res;
 
@@ -1104,7 +1118,7 @@ int tipc_disconnect(u32 ref)
 	p_ptr = tipc_port_lock(ref);
 	if (!p_ptr)
 		return -EINVAL;
-	res = tipc_disconnect_port(p_ptr);
+	res = __tipc_disconnect(p_ptr);
 	tipc_port_unlock(p_ptr);
 	return res;
 }
diff --git a/net/tipc/port.h b/net/tipc/port.h
index 4660e30..fb66e2e 100644
--- a/net/tipc/port.h
+++ b/net/tipc/port.h
@@ -190,7 +190,7 @@ int tipc_publish(u32 portref, unsigned int scope,
 int tipc_withdraw(u32 portref, unsigned int scope,
 		struct tipc_name_seq const *name_seq);
 
-int tipc_connect2port(u32 portref, struct tipc_portid const *port);
+int tipc_connect(u32 portref, struct tipc_portid const *port);
 
 int tipc_disconnect(u32 portref);
 
@@ -200,7 +200,9 @@ int tipc_shutdown(u32 ref);
 /*
  * The following routines require that the port be locked on entry
  */
-int tipc_disconnect_port(struct tipc_port *tp_ptr);
+int __tipc_disconnect(struct tipc_port *tp_ptr);
+int __tipc_connect(u32 ref, struct tipc_port *p_ptr,
+		   struct tipc_portid const *peer);
 int tipc_port_peer_msg(struct tipc_port *p_ptr, struct tipc_msg *msg);
 
 /*
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 65a6bfc..4d56eae 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -791,7 +791,7 @@ static int auto_connect(struct socket *sock, struct tipc_msg *msg)
 
 	tsock->peer_name.ref = msg_origport(msg);
 	tsock->peer_name.node = msg_orignode(msg);
-	tipc_connect2port(tsock->p->ref, &tsock->peer_name);
+	tipc_connect(tsock->p->ref, &tsock->peer_name);
 	tipc_set_portimportance(tsock->p->ref, msg_importance(msg));
 	sock->state = SS_CONNECTED;
 	return 0;
@@ -1254,7 +1254,7 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	/* Initiate connection termination for an incoming 'FIN' */
 	if (unlikely(msg_errcode(msg) && (sock->state == SS_CONNECTED))) {
 		sock->state = SS_DISCONNECTING;
-		tipc_disconnect_port(tipc_sk_port(sk));
+		__tipc_disconnect(tipc_sk_port(sk));
 	}
 
 	sk->sk_data_ready(sk, 0);
@@ -1514,7 +1514,7 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
 		/* Connect new socket to it's peer */
 		new_tsock->peer_name.ref = msg_origport(msg);
 		new_tsock->peer_name.node = msg_orignode(msg);
-		tipc_connect2port(new_ref, &new_tsock->peer_name);
+		tipc_connect(new_ref, &new_tsock->peer_name);
 		new_sock->state = SS_CONNECTED;
 
 		tipc_set_portimportance(new_ref, msg_importance(msg));
diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index 0f7d0d0..6b42d47 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -462,7 +462,7 @@ static void subscr_named_msg_event(void *usr_handle,
 		kfree(subscriber);
 		return;
 	}
-	tipc_connect2port(subscriber->port_ref, orig);
+	tipc_connect(subscriber->port_ref, orig);
 
 	/* Lock server port (& save lock address for future use) */
 	subscriber->lock = tipc_port_lock(subscriber->port_ref)->lock;
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 06/10] tipc: consolidate connection-oriented message reception in one function
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

Handling of connection-related message reception is currently scattered
around at different places in the code. This makes it harder to verify
that things are handled correctly in all possible scenarios.
So we consolidate the existing processing of connection-oriented
message reception in a single routine.  In the process, we convert the
chain of if/else into a switch/case for improved readability.

A cast on the socket_state in the switch is needed to avoid compile
warnings on 32 bit, like "net/tipc/socket.c:1252:2: warning: case value
‘4294967295’ not in enumerated type".  This happens because existing
tipc code pseudo extends the default linux socket state values with:

	#define SS_LISTENING    -1      /* socket is listening */
	#define SS_READY        -2      /* socket is connectionless */

It may make sense to add these as _positive_ values to the existing
socket state enum list someday, vs. these already existing defines.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: add cast to fix warning; remove returns from middle of switch]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 75 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 24 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 4d56eae..fc0aa4f 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1192,6 +1192,53 @@ static int rx_queue_full(struct tipc_msg *msg, u32 queue_size, u32 base)
 }
 
 /**
+ * filter_connect - Handle all incoming messages for a connection-based socket
+ * @tsock: TIPC socket
+ * @msg: message
+ *
+ * Returns TIPC error status code and socket error status code
+ * once it encounters some errors
+ */
+static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
+{
+	struct socket *sock = tsock->sk.sk_socket;
+	struct tipc_msg *msg = buf_msg(*buf);
+	u32 retval = TIPC_ERR_NO_PORT;
+
+	if (msg_mcast(msg))
+		return retval;
+
+	switch ((int)sock->state) {
+	case SS_CONNECTED:
+		/* Accept only connection-based messages sent by peer */
+		if (msg_connected(msg) && tipc_port_peer_msg(tsock->p, msg)) {
+			if (unlikely(msg_errcode(msg))) {
+				sock->state = SS_DISCONNECTING;
+				__tipc_disconnect(tsock->p);
+			}
+			retval = TIPC_OK;
+		}
+		break;
+	case SS_CONNECTING:
+		/* Accept only ACK or NACK message */
+		if (msg_connected(msg) || (msg_errcode(msg)))
+			retval = TIPC_OK;
+		break;
+	case SS_LISTENING:
+	case SS_UNCONNECTED:
+		/* Accept only SYN message */
+		if (!msg_connected(msg) && !(msg_errcode(msg)))
+			retval = TIPC_OK;
+		break;
+	case SS_DISCONNECTING:
+		break;
+	default:
+		pr_err("Unknown socket state %u\n", sock->state);
+	}
+	return retval;
+}
+
+/**
  * filter_rcv - validate incoming message
  * @sk: socket
  * @buf: message
@@ -1208,6 +1255,7 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	struct socket *sock = sk->sk_socket;
 	struct tipc_msg *msg = buf_msg(buf);
 	u32 recv_q_len;
+	u32 res = TIPC_OK;
 
 	/* Reject message if it is wrong sort of message for socket */
 	if (msg_type(msg) > TIPC_DIRECT_MSG)
@@ -1226,24 +1274,9 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 				return TIPC_ERR_OVERLOAD;
 		}
 	} else {
-		if (msg_mcast(msg))
-			return TIPC_ERR_NO_PORT;
-		if (sock->state == SS_CONNECTED) {
-			if (!msg_connected(msg) ||
-			    !tipc_port_peer_msg(tipc_sk_port(sk), msg))
-				return TIPC_ERR_NO_PORT;
-		} else if (sock->state == SS_CONNECTING) {
-			if (!msg_connected(msg) && (msg_errcode(msg) == 0))
-				return TIPC_ERR_NO_PORT;
-		} else if (sock->state == SS_LISTENING) {
-			if (msg_connected(msg) || msg_errcode(msg))
-				return TIPC_ERR_NO_PORT;
-		} else if (sock->state == SS_DISCONNECTING) {
-			return TIPC_ERR_NO_PORT;
-		} else /* (sock->state == SS_UNCONNECTED) */ {
-			if (msg_connected(msg) || msg_errcode(msg))
-				return TIPC_ERR_NO_PORT;
-		}
+		res = filter_connect(tipc_sk(sk), &buf);
+		if (res != TIPC_OK || buf == NULL)
+			return res;
 	}
 
 	/* Enqueue message (finally!) */
@@ -1251,12 +1284,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 	atomic_inc(&tipc_queue_size);
 	__skb_queue_tail(&sk->sk_receive_queue, buf);
 
-	/* Initiate connection termination for an incoming 'FIN' */
-	if (unlikely(msg_errcode(msg) && (sock->state == SS_CONNECTED))) {
-		sock->state = SS_DISCONNECTING;
-		__tipc_disconnect(tipc_sk_port(sk));
-	}
-
 	sk->sk_data_ready(sk, 0);
 	return TIPC_OK;
 }
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 08/10] tipc: eliminate connection setup for implied connect in recv_msg()
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

As connection setup is now completed asynchronously in BH context,
in the function filter_connect(), the corresponding code in recv_msg()
becomes redundant.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1ba3b6f..0df42fa 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -954,13 +954,6 @@ restart:
 	sz = msg_data_sz(msg);
 	err = msg_errcode(msg);
 
-	/* Complete connection setup for an implied connect */
-	if (unlikely(sock->state == SS_CONNECTING)) {
-		res = auto_connect(sock, msg);
-		if (res)
-			goto exit;
-	}
-
 	/* Discard an empty non-errored message & try again */
 	if ((!sz) && (!err)) {
 		advance_rx_queue(sk);
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 07/10] tipc: introduce non-blocking socket connect
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

TIPC has so far only supported blocking connect(), meaning that a call
to connect() doesn't return until either the connection is fully
established, or an error occurs. This has proved insufficient for many
users, so we now introduce non-blocking connect(), analogous to how
this is done in TCP and other protocols.

With this feature, if a connection cannot be established instantly,
connect() will return the error code "-EINPROGRESS".
If the user later calls connect() again, he will either have the
return code "-EALREADY" or "-EISCONN", depending on whether the
connection has been established or not.

The user must have explicitly set the socket to be non-blocking
(SOCK_NONBLOCK or O_NONBLOCK, depending on method used), so unless
for some reason they had set this already (the socket would anyway
remain blocking in current TIPC) this change should be completely
backwards compatible.

It is also now possible to call select() or poll() to wait for the
completion of a connection.

An effect of the above is that the actual completion of a connection
may now be performed asynchronously, independent of the calls from
user space. Therefore, we now execute this code in BH context, in
the function filter_rcv(), which is executed upon reception of
messages in the socket.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
[PG: minor refactoring for improved connect/disconnect function names]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 158 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 93 insertions(+), 65 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index fc0aa4f..1ba3b6f 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -783,16 +783,19 @@ exit:
 static int auto_connect(struct socket *sock, struct tipc_msg *msg)
 {
 	struct tipc_sock *tsock = tipc_sk(sock->sk);
-
-	if (msg_errcode(msg)) {
-		sock->state = SS_DISCONNECTING;
-		return -ECONNREFUSED;
-	}
+	struct tipc_port *p_ptr;
 
 	tsock->peer_name.ref = msg_origport(msg);
 	tsock->peer_name.node = msg_orignode(msg);
-	tipc_connect(tsock->p->ref, &tsock->peer_name);
-	tipc_set_portimportance(tsock->p->ref, msg_importance(msg));
+	p_ptr = tipc_port_deref(tsock->p->ref);
+	if (!p_ptr)
+		return -EINVAL;
+
+	__tipc_connect(tsock->p->ref, p_ptr, &tsock->peer_name);
+
+	if (msg_importance(msg) > TIPC_CRITICAL_IMPORTANCE)
+		return -EINVAL;
+	msg_set_importance(&p_ptr->phdr, (u32)msg_importance(msg));
 	sock->state = SS_CONNECTED;
 	return 0;
 }
@@ -1203,7 +1206,9 @@ static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
 {
 	struct socket *sock = tsock->sk.sk_socket;
 	struct tipc_msg *msg = buf_msg(*buf);
+	struct sock *sk = &tsock->sk;
 	u32 retval = TIPC_ERR_NO_PORT;
+	int res;
 
 	if (msg_mcast(msg))
 		return retval;
@@ -1221,8 +1226,36 @@ static u32 filter_connect(struct tipc_sock *tsock, struct sk_buff **buf)
 		break;
 	case SS_CONNECTING:
 		/* Accept only ACK or NACK message */
-		if (msg_connected(msg) || (msg_errcode(msg)))
+		if (unlikely(msg_errcode(msg))) {
+			sock->state = SS_DISCONNECTING;
+			sk->sk_err = -ECONNREFUSED;
+			retval = TIPC_OK;
+			break;
+		}
+
+		if (unlikely(!msg_connected(msg)))
+			break;
+
+		res = auto_connect(sock, msg);
+		if (res) {
+			sock->state = SS_DISCONNECTING;
+			sk->sk_err = res;
 			retval = TIPC_OK;
+			break;
+		}
+
+		/* If an incoming message is an 'ACK-', it should be
+		 * discarded here because it doesn't contain useful
+		 * data. In addition, we should try to wake up
+		 * connect() routine if sleeping.
+		 */
+		if (msg_data_sz(msg) == 0) {
+			kfree_skb(*buf);
+			*buf = NULL;
+			if (waitqueue_active(sk_sleep(sk)))
+				wake_up_interruptible(sk_sleep(sk));
+		}
+		retval = TIPC_OK;
 		break;
 	case SS_LISTENING:
 	case SS_UNCONNECTED:
@@ -1369,8 +1402,6 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 	struct sock *sk = sock->sk;
 	struct sockaddr_tipc *dst = (struct sockaddr_tipc *)dest;
 	struct msghdr m = {NULL,};
-	struct sk_buff *buf;
-	struct tipc_msg *msg;
 	unsigned int timeout;
 	int res;
 
@@ -1382,26 +1413,6 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 		goto exit;
 	}
 
-	/* For now, TIPC does not support the non-blocking form of connect() */
-	if (flags & O_NONBLOCK) {
-		res = -EOPNOTSUPP;
-		goto exit;
-	}
-
-	/* Issue Posix-compliant error code if socket is in the wrong state */
-	if (sock->state == SS_LISTENING) {
-		res = -EOPNOTSUPP;
-		goto exit;
-	}
-	if (sock->state == SS_CONNECTING) {
-		res = -EALREADY;
-		goto exit;
-	}
-	if (sock->state != SS_UNCONNECTED) {
-		res = -EISCONN;
-		goto exit;
-	}
-
 	/*
 	 * Reject connection attempt using multicast address
 	 *
@@ -1413,49 +1424,66 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 		goto exit;
 	}
 
-	/* Reject any messages already in receive queue (very unlikely) */
-	reject_rx_queue(sk);
+	timeout = (flags & O_NONBLOCK) ? 0 : tipc_sk(sk)->conn_timeout;
+
+	switch (sock->state) {
+	case SS_UNCONNECTED:
+		/* Send a 'SYN-' to destination */
+		m.msg_name = dest;
+		m.msg_namelen = destlen;
+
+		/* If connect is in non-blocking case, set MSG_DONTWAIT to
+		 * indicate send_msg() is never blocked.
+		 */
+		if (!timeout)
+			m.msg_flags = MSG_DONTWAIT;
+
+		res = send_msg(NULL, sock, &m, 0);
+		if ((res < 0) && (res != -EWOULDBLOCK))
+			goto exit;
 
-	/* Send a 'SYN-' to destination */
-	m.msg_name = dest;
-	m.msg_namelen = destlen;
-	res = send_msg(NULL, sock, &m, 0);
-	if (res < 0)
+		/* Just entered SS_CONNECTING state; the only
+		 * difference is that return value in non-blocking
+		 * case is EINPROGRESS, rather than EALREADY.
+		 */
+		res = -EINPROGRESS;
+		break;
+	case SS_CONNECTING:
+		res = -EALREADY;
+		break;
+	case SS_CONNECTED:
+		res = -EISCONN;
+		break;
+	default:
+		res = -EINVAL;
 		goto exit;
+	}
 
-	/* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
-	timeout = tipc_sk(sk)->conn_timeout;
-	release_sock(sk);
-	res = wait_event_interruptible_timeout(*sk_sleep(sk),
-			(!skb_queue_empty(&sk->sk_receive_queue) ||
-			(sock->state != SS_CONNECTING)),
-			timeout ? (long)msecs_to_jiffies(timeout)
-				: MAX_SCHEDULE_TIMEOUT);
-	lock_sock(sk);
+	if (sock->state == SS_CONNECTING) {
+		if (!timeout)
+			goto exit;
 
-	if (res > 0) {
-		buf = skb_peek(&sk->sk_receive_queue);
-		if (buf != NULL) {
-			msg = buf_msg(buf);
-			res = auto_connect(sock, msg);
-			if (!res) {
-				if (!msg_data_sz(msg))
-					advance_rx_queue(sk);
-			}
-		} else {
-			if (sock->state == SS_CONNECTED)
-				res = -EISCONN;
+		/* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
+		release_sock(sk);
+		res = wait_event_interruptible_timeout(*sk_sleep(sk),
+				sock->state != SS_CONNECTING,
+				timeout ? (long)msecs_to_jiffies(timeout)
+					: MAX_SCHEDULE_TIMEOUT);
+		lock_sock(sk);
+		if (res <= 0) {
+			if (res == 0)
+				res = -ETIMEDOUT;
 			else
-				res = -ECONNREFUSED;
+				; /* leave "res" unchanged */
+			goto exit;
 		}
-	} else {
-		if (res == 0)
-			res = -ETIMEDOUT;
-		else
-			; /* leave "res" unchanged */
-		sock->state = SS_DISCONNECTING;
 	}
 
+	if (unlikely(sock->state == SS_DISCONNECTING))
+		res = sock_error(sk);
+	else
+		res = 0;
+
 exit:
 	release_sock(sk);
 	return res;
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 09/10] tipc: add lock nesting notation to quiet lockdep warning
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Ying Xue, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

From: Ying Xue <ying.xue@windriver.com>

TIPC accept() call grabs the socket lock on a newly allocated
socket while holding the socket lock on an old socket. But lockdep
worries that this might be a recursive lock attempt:

  [ INFO: possible recursive locking detected ]
  ---------------------------------------------
  kworker/u:0/6 is trying to acquire lock:
  (sk_lock-AF_TIPC){+.+.+.}, at: [<c8c1226c>] accept+0x15c/0x310 [tipc]

  but task is already holding lock:
  (sk_lock-AF_TIPC){+.+.+.}, at: [<c8c12138>] accept+0x28/0x310 [tipc]

  other info that might help us debug this:
  Possible unsafe locking scenario:

          CPU0
          ----
          lock(sk_lock-AF_TIPC);
          lock(sk_lock-AF_TIPC);

          *** DEADLOCK ***

  May be due to missing lock nesting notation
  [...]

Tell lockdep that this locking is safe by using lock_sock_nested().
This is similar to what was done in commit 5131a184a3458d9 for
SCTP code ("SCTP: lock_sock_nested in sctp_sock_migrate").

Also note that this is isn't something that is seen normally,
as it was uncovered with some experimental work-in-progress
code not yet ready for mainline.  So no need for stable
backports or similar of this commit.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 0df42fa..38613cf 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1551,7 +1551,8 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
 		u32 new_ref = new_tport->ref;
 		struct tipc_msg *msg = buf_msg(buf);
 
-		lock_sock(new_sk);
+		/* we lock on new_sk; but lockdep sees the lock on sk */
+		lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
 
 		/*
 		 * Reject any stray messages received by new socket
-- 
1.7.12.1

^ permalink raw reply related

* [PATCH net-next 10/10] tipc: refactor accept() code for improved readability
From: Paul Gortmaker @ 2012-12-07 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jon Maloy, Paul Gortmaker
In-Reply-To: <1354890498-6448-1-git-send-email-paul.gortmaker@windriver.com>

In TIPC's accept() routine, there is a large block of code relating
to initialization of a new socket, all within an if condition checking
if the allocation succeeded.

Here, we simply factor out that init code within the accept() function
to its own separate function, which improves readability, and makes
it easier to track which lock_sock() calls are operating on existing
vs. new sockets.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 net/tipc/socket.c | 93 +++++++++++++++++++++++++++++--------------------------
 1 file changed, 49 insertions(+), 44 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 38613cf..56661c8 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1507,6 +1507,53 @@ static int listen(struct socket *sock, int len)
 	return res;
 }
 
+static void tipc_init_socket(struct sock *sk, struct socket *new_sock,
+			     struct sk_buff *buf)
+{
+
+	struct sock *new_sk = new_sock->sk;
+	struct tipc_sock *new_tsock = tipc_sk(new_sk);
+	struct tipc_port *new_tport = new_tsock->p;
+	u32 new_ref = new_tport->ref;
+	struct tipc_msg *msg = buf_msg(buf);
+
+	/* we lock on new_sk; but lockdep sees accept's lock on sk */
+	lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
+
+	/*
+	 * Reject any stray messages received by new socket
+	 * before the socket lock was taken (very, very unlikely)
+	 */
+	reject_rx_queue(new_sk);
+
+	/* Connect new socket to it's peer */
+	new_tsock->peer_name.ref = msg_origport(msg);
+	new_tsock->peer_name.node = msg_orignode(msg);
+	tipc_connect(new_ref, &new_tsock->peer_name);
+	new_sock->state = SS_CONNECTED;
+
+	tipc_set_portimportance(new_ref, msg_importance(msg));
+	if (msg_named(msg)) {
+		new_tport->conn_type = msg_nametype(msg);
+		new_tport->conn_instance = msg_nameinst(msg);
+	}
+
+	/*
+	 * Respond to 'SYN-' by discarding it & returning 'ACK'-.
+	 * Respond to 'SYN+' by queuing it on new socket.
+	 */
+	if (!msg_data_sz(msg)) {
+		struct msghdr m = {NULL,};
+
+		advance_rx_queue(sk);
+		send_packet(NULL, new_sock, &m, 0);
+	} else {
+		__skb_dequeue(&sk->sk_receive_queue);
+		__skb_queue_head(&new_sk->sk_receive_queue, buf);
+	}
+	release_sock(new_sk);
+}
+
 /**
  * accept - wait for connection request
  * @sock: listening socket
@@ -1542,51 +1589,9 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
 	}
 
 	buf = skb_peek(&sk->sk_receive_queue);
-
 	res = tipc_create(sock_net(sock->sk), new_sock, 0, 0);
-	if (!res) {
-		struct sock *new_sk = new_sock->sk;
-		struct tipc_sock *new_tsock = tipc_sk(new_sk);
-		struct tipc_port *new_tport = new_tsock->p;
-		u32 new_ref = new_tport->ref;
-		struct tipc_msg *msg = buf_msg(buf);
-
-		/* we lock on new_sk; but lockdep sees the lock on sk */
-		lock_sock_nested(new_sk, SINGLE_DEPTH_NESTING);
-
-		/*
-		 * Reject any stray messages received by new socket
-		 * before the socket lock was taken (very, very unlikely)
-		 */
-		reject_rx_queue(new_sk);
-
-		/* Connect new socket to it's peer */
-		new_tsock->peer_name.ref = msg_origport(msg);
-		new_tsock->peer_name.node = msg_orignode(msg);
-		tipc_connect(new_ref, &new_tsock->peer_name);
-		new_sock->state = SS_CONNECTED;
-
-		tipc_set_portimportance(new_ref, msg_importance(msg));
-		if (msg_named(msg)) {
-			new_tport->conn_type = msg_nametype(msg);
-			new_tport->conn_instance = msg_nameinst(msg);
-		}
-
-		/*
-		 * Respond to 'SYN-' by discarding it & returning 'ACK'-.
-		 * Respond to 'SYN+' by queuing it on new socket.
-		 */
-		if (!msg_data_sz(msg)) {
-			struct msghdr m = {NULL,};
-
-			advance_rx_queue(sk);
-			send_packet(NULL, new_sock, &m, 0);
-		} else {
-			__skb_dequeue(&sk->sk_receive_queue);
-			__skb_queue_head(&new_sk->sk_receive_queue, buf);
-		}
-		release_sock(new_sk);
-	}
+	if (!res)
+		tipc_init_socket(sk, new_sock, buf);
 exit:
 	release_sock(sk);
 	return res;
-- 
1.7.12.1

^ permalink raw reply related

* Re: [PATCH RFC 0/5] Containerize syslog
From: Glauber Costa @ 2012-12-07 14:30 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Rui Xiang, netdev-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Eric W. Biederman
In-Reply-To: <20121207142331.GC4004@sergelap>

On 12/07/2012 06:23 PM, Serge Hallyn wrote:
> Quoting Andrew Morton (akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org):
>> On Mon, 19 Nov 2012 01:51:09 -0800 ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) wrote:
>>
>>> Are there any kernel print statements besides networking stack printks
>>> that we want to move to show up in a new "kernel log" namespace?
>>
>> That's a good question, and afaict it remains unanswered.
> 
> There are some other (not *terribly* compelling) cases.  For instance
> selinux hooks, if you say mount an fs without xattr support or with
> unsupported options, will printk a warning.  Things like stat.c and
> capabilities and syslog print out warnings when userspace uses a
> deprecated somethingorother - old stat syscall or sys_syslog without
> CAP_SYSLOG.  That should go to the container.  Filesystems may give
> warnings (bad mount options for tmpfs, bad uid owner for many of them,
> etc) which belong in the container.  Obviously some belong on the host -
> if they show a corrupt superblock which may indicate an attempt by the
> container to crash the kernel.
> 
>> As so often happens, this patchset's changelogs forgot to describe the
>> reason for the existence of this patchset.  Via a bit of lwn reading
> 
> Not as a separate justification admittedly, but the description was
> meant to explain it:  right now /dev/kmsg and sys_syslog are not safe
> and useful in a container;  syslog messages from host and containers
> can be confusingly intermixed;  and helpful printks are not seen in
> the container.
> 
>> and my awesome telepathic skills, I divine that something in networking
>> is using syslog for kernel->userspace communications.
>>
>> wtf?
> 
> Well, syslog is the kernel->userspace channel of last resort.
> 
>> Wouldn't it be better to just stop doing that, and to implement a
>> respectable and reliable kernel->userspace messaging scheme?
> 
> Convenience functions on top of netlink?
> 
>> And leave syslog alone - it's a crude low-level thing for random
>> unexpected things which operators might want to know about.
> 
> That sentence is a result of not calling a container admin an operator.
> I can't argue it because I'm not sure whether to agree with that
> classification.
> 

I keep asking myself if it isn't the case of forwarding to a container
all messages printed in process context. That will obviously exclude all
messages resulting from kthreads - that will always be in the initial
namespace anyway, interrupts, etc. There is no harm, for instance, in
delivering the same message twice: one to the container, and the other
to the host system.

Isn't it natural that if the kernel printed something on behalf of a
process, whoever is the admin of the machine that process lives on
should see what it is about?

^ permalink raw reply

* Re: [PATCH net-next] rps: overflow prevention for saturated cpus
From: Ben Hutchings @ 2012-12-07 14:51 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, davem, edumazet, therbert
In-Reply-To: <1354826194-9289-1-git-send-email-willemb@google.com>

On Thu, 2012-12-06 at 15:36 -0500, Willem de Bruijn wrote:
> RPS and RFS balance load across cpus with flow affinity. This can
> cause local bottlenecks, where a small number or single large flow
> (DoS) can saturate one CPU while others are idle.
> 
> This patch maintains flow affinity in normal conditions, but
> trades it for throughput when a cpu becomes saturated. Then, packets
> destined to that cpu (only) are redirected to the lightest loaded cpu
> in the rxqueue's rps_map. This breaks flow affinity under high load
> for some flows, in favor of processing packets up to the capacity
> of the complete rps_map cpuset in all circumstances.
[...]
> --- a/Documentation/networking/scaling.txt
> +++ b/Documentation/networking/scaling.txt
> @@ -135,6 +135,18 @@ packets have been queued to their backlog queue. The IPI wakes backlog
>  processing on the remote CPU, and any queued packets are then processed
>  up the networking stack.
>  
> +==== RPS Overflow Protection
> +
> +By selecting the same cpu from the cpuset for each packet in the same
> +flow, RPS will cause load imbalance when input flows are not uniformly
> +random. In the extreme case, a single flow, all packets are handled on a
> +single CPU, which limits the throughput of the machine to the throughput
> +of that CPU. RPS has optional overflow protection, which disables flow
> +affinity when an RPS CPU becomes saturated: during overload, its packets
> +will be sent to the least loaded other CPU in the RPS cpuset. To enable
> +this option, set sysctl net.core.netdev_max_rps_backlog to be smaller than
> +net.core.netdev_max_backlog. Setting it to half is a reasonable heuristic.
[...]

This only seems to be suitable for specialised applications where a high
degree of reordering is tolerable.  This documentation should make that
very clear.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [RFC PATCH] dynamic_queue_limit.h: Make the struct ___cacheline_aligned_on_smp
From: Joe Perches @ 2012-12-07 14:58 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Eric Dumazet, netdev

Given that the struct will always have limit at the start of
a cacheline, why not make  struct ___cacheline_aligned_on_smp
and make limit the first member?

It could make other structs that use struct dql a bit more
predictable or efficient to pack.

(netdev_queue is size reduced from 256 to 192 on x86-32)

---

 include/linux/dynamic_queue_limits.h |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index 5621547..2d20b22 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -38,6 +38,8 @@
 #ifdef __KERNEL__
 
 struct dql {
+	unsigned int	limit; /* Must be first field - Current limit */
+
 	/* Fields accessed in enqueue path (dql_queued) */
 	unsigned int	num_queued;		/* Total ever queued */
 	unsigned int	adj_limit;		/* limit + num_completed */
@@ -45,7 +47,6 @@ struct dql {
 
 	/* Fields accessed only by completion path (dql_completed) */
 
-	unsigned int	limit ____cacheline_aligned_in_smp; /* Current limit */
 	unsigned int	num_completed;		/* Total ever completed */
 
 	unsigned int	prev_ovlimit;		/* Previous over limit */
@@ -59,7 +60,7 @@ struct dql {
 	unsigned int	max_limit;		/* Max limit */
 	unsigned int	min_limit;		/* Minimum limit */
 	unsigned int	slack_hold_time;	/* Time to measure slack */
-};
+} ____cacheline_aligned_in_smp;
 
 /* Set some static maximums */
 #define DQL_MAX_OBJECT (UINT_MAX / 16)

^ permalink raw reply related

* Re: [patch net-next] net: call notifiers for mtu change even if iface is not up
From: Neil Horman @ 2012-12-07 15:07 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, edumazet, bhutchings, psimerda
In-Reply-To: <20121207122920.GB1631@minipsycho.zyxel.com>

On Fri, Dec 07, 2012 at 01:29:20PM +0100, Jiri Pirko wrote:
> Mon, Dec 03, 2012 at 04:22:50PM CET, nhorman@tuxdriver.com wrote:
> >On Mon, Dec 03, 2012 at 03:22:29PM +0100, Jiri Pirko wrote:
> >> Mon, Dec 03, 2012 at 03:18:23PM CET, nhorman@tuxdriver.com wrote:
> >> >On Mon, Dec 03, 2012 at 12:16:32PM +0100, Jiri Pirko wrote:
> >> >> Do the same thing as in set mac. Call notifiers every time.
> >> >> 
> >> >> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> >> >> ---
> >> >>  net/core/dev.c | 2 +-
> >> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >> >> 
> >> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> >> index 2f94df2..0685a72 100644
> >> >> --- a/net/core/dev.c
> >> >> +++ b/net/core/dev.c
> >> >> @@ -4971,7 +4971,7 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
> >> >>  	else
> >> >>  		dev->mtu = new_mtu;
> >> >>  
> >> >> -	if (!err && dev->flags & IFF_UP)
> >> >> +	if (!err)
> >> >>  		call_netdevice_notifiers(NETDEV_CHANGEMTU, dev);
> >> >>  	return err;
> >> >>  }
> >> >
> >> >I'm not opposed to this change, but is there something that it expressly fixes?
> >> 
> >> This is about a consistency. To have the same behaviour as set_mac
> >> for example.
> >> 
> >> >While it doesn't hurt to send around mtu change events, one would presume that
> >> >listeners would pick up mtu changes when the NETDEV_UP event went' around.
> >> >
> >> >Neil
> >> >
> >I figured, I just wanted to be sure that there wasn't a bug fix that needed
> >documenting in the changelog, or that listeners for this event didn't expect the
> >interface to already be up.  It looks pretty clean, with the exception of the
> >TIPC protocol.  It appears that it might require enable_bearer to be called
> >prior to any netdevice CHANGEMTU events, so that the eth_bearers array gets
> >filled out, and it looks like that has to get triggered either by user space, or
> >the receipt of a message over the network.  Let me know what you think
> 
> I'm looking at the code. What is the difference between CHANGEMTU and
> CHANGEADDR here? It looks to me that it should work for both in the same
> way.
> 
Hmm, they do work in the same way, which likely means that tipc is probably
begging to oops there regardless of having your patch or not :)

I'll get around to fixing that shortly.  Since your patch then isn't going to
break anything that isn't already broken
Acked-by: Neil Horman <nhorman@tuxdriver.com>

> 
> >
> >Neil
> >
> >> >--
> >> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> >the body of a message to majordomo@vger.kernel.org
> >> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> 

^ permalink raw reply

* Re: [PATCH 2/4] net: remove obsolete simple_strto<foo>
From: Neil Horman @ 2012-12-07 15:22 UTC (permalink / raw)
  To: Abhijit Pawar
  Cc: David S. Miller, Pablo Neira Ayuso, Patrick McHardy,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	John W. Linville, Johannes Berg, Cong Wang, Eric Dumazet,
	Joe Perches, netdev, linux-kernel, netfilter-devel, netfilter,
	coreteam, linux-wireless
In-Reply-To: <1354880998-23417-1-git-send-email-abhi.c.pawar@gmail.com>

On Fri, Dec 07, 2012 at 05:19:58PM +0530, Abhijit Pawar wrote:
> This patch replace the obsolete simple_strto<foo> with kstrto<foo>
> 
> Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
> ---
>  net/core/netpoll.c                 |    9 +++++++--
>  net/ipv4/netfilter/ipt_CLUSTERIP.c |    9 +++++++--
>  net/mac80211/debugfs_sta.c         |    4 +++-
>  net/netfilter/nf_conntrack_core.c  |    6 ++++--
>  4 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 77a0388..596b127 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -668,13 +668,16 @@ EXPORT_SYMBOL(netpoll_print_options);
>  
>  int netpoll_parse_options(struct netpoll *np, char *opt)
>  {
> +	int rc;
>  	char *cur=opt, *delim;
>  
>  	if (*cur != '@') {
>  		if ((delim = strchr(cur, '@')) == NULL)
>  			goto parse_failed;
>  		*delim = 0;
> -		np->local_port = simple_strtol(cur, NULL, 10);
> +		rc = kstrtol(cur, 10, &np->local_port);
> +		if (rc)
> +			goto parse_failed;
Perhaps consolidate this to:
if (kstrtol(cur, 10, &np->local_port)
	goto parse_failed

Then you don't have to declare the new stack variable

>  		cur = delim;
>  	}
>  	cur++;
> @@ -705,7 +708,9 @@ int netpoll_parse_options(struct netpoll *np, char *opt)
>  		*delim = 0;
>  		if (*cur == ' ' || *cur == '\t')
>  			np_info(np, "warning: whitespace is not allowed\n");
> -		np->remote_port = simple_strtol(cur, NULL, 10);
> +		rc = kstrtol(cur, 10, &np->remote_port);
> +		if (rc)
> +			goto parse_failed;
>  		cur = delim;
Ditto

>  	}
>  	cur++;
> diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c
> index fe5daea..55e7b73 100644
> --- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
> +++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
> @@ -661,6 +661,7 @@ static ssize_t clusterip_proc_write(struct file *file, const char __user *input,
>  #define PROC_WRITELEN	10
>  	char buffer[PROC_WRITELEN+1];
>  	unsigned long nodenum;
> +	int rc;
>  
>  	if (size > PROC_WRITELEN)
>  		return -EIO;
> @@ -669,11 +670,15 @@ static ssize_t clusterip_proc_write(struct file *file, const char __user *input,
>  	buffer[size] = 0;
>  
>  	if (*buffer == '+') {
> -		nodenum = simple_strtoul(buffer+1, NULL, 10);
> +		rc = kstrtoul(buffer+1, 10, &nodenum);
> +		if (rc)
> +			return -EINVAL;
>  		if (clusterip_add_node(c, nodenum))
>  			return -ENOMEM;
>  	} else if (*buffer == '-') {
> -		nodenum = simple_strtoul(buffer+1, NULL,10);
> +		rc = kstrtoul(buffer+1, 10, &nodenum);
> +		if (rc)
> +			return -EINVAL;
>  		if (clusterip_del_node(c, nodenum))
>  			return -ENOENT;
Same deal with the rc variable, although in this case it might make sense to
return rc if kstrtoul fails, instead of just filtering it all down to -EINVAL.

>  	} else
> diff --git a/net/mac80211/debugfs_sta.c b/net/mac80211/debugfs_sta.c
> index 89281d2..18754fd 100644
> --- a/net/mac80211/debugfs_sta.c
> +++ b/net/mac80211/debugfs_sta.c
> @@ -219,7 +219,9 @@ static ssize_t sta_agg_status_write(struct file *file, const char __user *userbu
>  	} else
>  		return -EINVAL;
>  
> -	tid = simple_strtoul(buf, NULL, 0);
> +	ret = kstrtoul(buf, 0, &tid);
> +	if (ret)
> +		return -EINVAL;
>  
>  	if (tid >= IEEE80211_NUM_TIDS)
>  		return -EINVAL;
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index af17516..18ce24b 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1409,7 +1409,7 @@ EXPORT_SYMBOL_GPL(nf_ct_alloc_hashtable);
>  
>  int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
>  {
> -	int i, bucket;
> +	int i, bucket, rc;
>  	unsigned int hashsize, old_size;
>  	struct hlist_nulls_head *hash, *old_hash;
>  	struct nf_conntrack_tuple_hash *h;
> @@ -1422,7 +1422,9 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
>  	if (!nf_conntrack_htable_size)
>  		return param_set_uint(val, kp);
>  
> -	hashsize = simple_strtoul(val, NULL, 0);
> +	rc = kstrtouint(val, 0, &hashsize);
> +	if (rc)
> +		return -EINVAL;
>  	if (!hashsize)
>  		return -EINVAL;
>  
As above, these call points might benefit from returning rc rather than just
EINVAL.

Neil

> -- 
> 1.7.7.6
> 
> 

^ permalink raw reply

* Tx timestamp for packet mmap.
From: Paul Chavent @ 2012-12-07 15:28 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1814 bytes --]

Hi.

I would like to be able to get tx timestamps of packets sent by the 
packet mmap interface...

Actually, I try to get them with the sample code below.

The problem is that it doesn't work without the joined patch.

I wonder if my current implementation is good. And if not, how should i 
get the timestamps ?

Wouldn't be a good idea to put timestamps in the ring buffer frame 
before give back the frame to the user ?

Thanks for your help and advices.

Paul.

8<-------------------------------


   struct timespec ts = {0,0};
   struct sockaddr from_addr;
   static uint8_t tmp_data[256];
   struct iovec msg_iov = {tmp_data, sizeof(tmp_data)};
   static uint8_t cmsg_buff[256];
   struct msghdr msghdr = {&from_addr, sizeof(from_addr),
                           &msg_iov, 1,
                           cmsg_buff, sizeof(cmsg_buff),
                           0};
   ssize_t err = recvmsg(itf->sock_fd, &msghdr, MSG_ERRQUEUE);
   if(err < 0)
     {
       perror("recvmsg failed");
       return -1;
     }

   struct cmsghdr *cmsg;
   for(cmsg = CMSG_FIRSTHDR(&msghdr); cmsg != NULL; cmsg = 
CMSG_NXTHDR(&msghdr, cmsg))
     {
       if(cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == 
SCM_TIMESTAMPING)
         {
           ts = *(struct timespec *)CMSG_DATA(cmsg);
#if !defined(NDEBUG)
           if(itf->debug)
             {
               fprintf(stderr, "SCM_TIMESTAMPING available\n");
             }
#else
           break;
#endif
         }
       else if (cmsg->cmsg_level == SOL_PACKET && cmsg->cmsg_type == 
PACKET_TX_TIMESTAMP)
         {
           ts = *(struct timespec *)CMSG_DATA(cmsg);
#if !defined(NDEBUG)
           if(itf->debug)
             {
               fprintf(stderr, "PACKET_TX_TIMESTAMP available\n");
             }
#else
           break;
#endif
         }
     }

[-- Attachment #2: 0002-net-add-tx-timestamp-to-packet-mmap.patch --]
[-- Type: text/x-patch, Size: 763 bytes --]

>From 762a8e89d1453e629bbe9c255c0ba4ec207cca25 Mon Sep 17 00:00:00 2001
From: Paul Chavent <paul.chavent@onera.fr>
Date: Fri, 7 Dec 2012 16:15:44 +0100
Subject: [PATCH 2/2] net : add tx timestamp to packet mmap.

Signed-off-by: Paul Chavent <paul.chavent@onera.fr>
---
 net/packet/af_packet.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e639645..948748b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1857,6 +1857,10 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	void *data;
 	int err;
 
+	err = sock_tx_timestamp(&po->sk, &skb_shinfo(skb)->tx_flags);
+	if (err < 0)
+		return err;
+
 	ph.raw = frame;
 
 	skb->protocol = proto;
-- 
1.7.12.1


^ permalink raw reply related

* [PATCH stable] ipv4: avoid passing NULL to inet_putpeer() in icmpv4_xrlim_allow()
From: CAI Qian @ 2012-12-07 15:46 UTC (permalink / raw)
  To: netdev; +Cc: Neal Cardwell, David S. Miller
In-Reply-To: <1411074275.5265862.1354894974886.JavaMail.root@redhat.com>

David, this patch looks applicable for the stable releases.

>From Neal Cardwell <ncardwell@google.com>

inet_getpeer_v4() can return NULL under OOM conditions, and while
inet_peer_xrlim_allow() is OK with a NULL peer, inet_putpeer() will
crash.

This code path now uses the same idiom as the others from:
1d861aa4b3fb08822055345f480850205ffe6170 ("inet: Minimize use of
cached route inetpeer.").

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Upstream-ID: e1a676424c290b1c8d757e3860170ac7ecd89af4
Stable-trees: 3.6.x
Signed-off-by: CAI Qian <caiqian@redhat.com>

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f2eccd5..17ff9fd 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -257,7 +257,8 @@ static inline bool icmpv4_xrlim_allow(struct net *net, struct rtable *rt,
 		struct inet_peer *peer = inet_getpeer_v4(net->ipv4.peers, fl4->daddr, 1);
 		rc = inet_peer_xrlim_allow(peer,
 					   net->ipv4.sysctl_icmp_ratelimit);
-		inet_putpeer(peer);
+		if (peer)
+			inet_putpeer(peer);
 	}
 out:
 	return rc;

^ permalink raw reply related

* Re: [RFC PATCH] dynamic_queue_limit.h: Make the struct ___cacheline_aligned_on_smp
From: Eric Dumazet @ 2012-12-07 15:55 UTC (permalink / raw)
  To: Joe Perches; +Cc: Tom Herbert, David Miller, netdev
In-Reply-To: <1354892334.29937.14.camel@joe-AO722>

2012/12/7 Joe Perches <joe@perches.com>:
> Given that the struct will always have limit at the start of
> a cacheline, why not make  struct ___cacheline_aligned_on_smp
> and make limit the first member?
>
> It could make other structs that use struct dql a bit more
> predictable or efficient to pack.
>
> (netdev_queue is size reduced from 256 to 192 on x86-32)
>

No, please.

Have you tested this on a range of hardware and check how it can hurt
performance ?

^ permalink raw reply

* [PATCH stable] ipv4: do not cache looped multicasts
From: CAI Qian @ 2012-12-07 15:58 UTC (permalink / raw)
  To: netdev; +Cc: Maxime Bizon, Julian Anastasov, David S. Miller
In-Reply-To: <2138496357.5269916.1354895709722.JavaMail.root@redhat.com>

David, this looks like applicable to the stable releases.

>From Maxime Bizon <mbizon@freebox.fr>,

	Starting from 3.6 we cache output routes for
multicasts only when using route to 224/4. For local receivers
we can set RTCF_LOCAL flag depending on the membership but
in such case we use maddr and saddr which are not caching
keys as before. Additionally, we can not use same place to
cache routes that differ in RTCF_LOCAL flag value.

	Fix it by caching only RTCF_MULTICAST entries
without RTCF_LOCAL (send-only, no loopback). As a side effect,
we avoid unneeded lookup for fnhe when not caching because
multicasts are not redirected and they do not learn PMTU.

	Thanks to Maxime Bizon for showing the caching
problems in __mkroute_output for 3.6 kernels: different
RTCF_LOCAL flag in cache can lead to wrong ip_mc_output or
ip_output call and the visible problem is that traffic can
not reach local receivers via loopback.

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Tested-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>

Upstream-ID: 636174219b52b5a8bc51bc23bbcba97cd30a65e3
Stable-trees: 3.6.x
Signed-off-by: CAI Qian <caiqian@redhat.com>

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 200d287..df25142 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1785,6 +1785,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	if (dev_out->flags & IFF_LOOPBACK)
 		flags |= RTCF_LOCAL;
 
+	do_cache = true;
 	if (type == RTN_BROADCAST) {
 		flags |= RTCF_BROADCAST | RTCF_LOCAL;
 		fi = NULL;
@@ -1793,6 +1794,8 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 		if (!ip_check_mc_rcu(in_dev, fl4->daddr, fl4->saddr,
 				     fl4->flowi4_proto))
 			flags &= ~RTCF_LOCAL;
+		else
+			do_cache = false;
 		/* If multicast route do not exist use
 		 * default one, but do not gateway in this case.
 		 * Yes, it is hack.
@@ -1802,8 +1805,8 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 	fnhe = NULL;
-	do_cache = fi != NULL;
-	if (fi) {
+	do_cache &= fi != NULL;
+	if (do_cache) {
 		struct rtable __rcu **prth;
 		struct fib_nh *nh = &FIB_RES_NH(*res);

^ permalink raw reply related

* Re: [PATCH net-next] rps: overflow prevention for saturated cpus
From: Willem de Bruijn @ 2012-12-07 16:04 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, David Miller, Eric Dumazet, Tom Herbert
In-Reply-To: <50C12E22.3030206@hp.com>

On Thu, Dec 6, 2012 at 6:45 PM, Rick Jones <rick.jones2@hp.com> wrote:
> On 12/06/2012 03:04 PM, Willem de Bruijn wrote:
>>
>> On Thu, Dec 6, 2012 at 5:25 PM, Rick Jones <rick.jones2@hp.com> wrote:
>>>
>>> I thought (one of) the ideas behind RFS at least was to give the CPU
>>> scheduler control over where network processing took place instead of it
>>> being dictated solely by the addressing. I would have expected the CPU
>>> scheduler to migrate some work off the saturated CPU.  Or will this only
>>> affect RPS and not RFS?
>>
>>
>> I wrote it with RPS in mind, indeed. With RFS, for sufficiently
>> multithreaded applications that are unpinned, the scheduler will
>> likely spread the threads across as many cpus as possible. In that
>> case, the mechanism will not kick in, or as quickly. Even with RFS,
>> pinned threads and single-threaded applications will likely also
>> benefit during high load from redirecting kernel receive
>> processing away from the cpu that runs the application thread. I
>> haven't tested that case independently.
>
>
> Unless that single-threaded application (or single receiving thread) is
> pinned to a CPU, isn't there a non-trivial chance that incoming traffic
> flowing up different CPUs will cause it to be bounced from one CPU to
> another, taking its cache lines with it and not just the "intra-stack" cache
> lines?

Yes. The patch restricts the offload cpus to rps_cpus, with the assumption
that this is a small subset of all cpus. In that case, other workloads will
eventually migrate to the remainder. I previously tested spreading across
all cpus, which indeed did interfere with the userspace threads.

> Long (?) ago and far away it was possible to say that a given IRQ should be
> potentially serviced by more than one CPU (if I recall though not phrase
> correctly).  Didn't that get taken away because it did such nasty things
> like reordering and such?  (Admittedly, I'm really stretching the limits of
> my dimm memory there)

Sounds familiar. Wasn't there a mechanism to periodically switch the
destination cpu? If at HZ granularity, that is very coarse grain compared to
Mpps, but out of order does seem likely. I assume that this patch will lead
to a steady state where userspace and kernel receive run on disjoint cpusets,
due to the rps_cpus set being hot with kernel receive processing. That said,
I can run a test with RFS enabled to see whether that actually holds.

>>> What kind of workload is this targeting that calls for
>>> such intra-flow parallelism?
>>
>>
>> Packet processing middeboxes that rather operate in degraded mode
>> (reordering) than drop packets. Intrusion detection systems and proxies,
>> for instance. These boxes are actually likely to have RPS enabled and
>> RFS disabled.
>>
>>> With respect to the examples given, what happens when it is TCP traffic
>>> rather than UDP?
>>
>>
>> That should be identical. RFS is supported for both protocols. In the
>> test, it is turned off to demonstrate the effect solely with RPS.
>
>
> Will it be identical with TCP?  If anything, I would think causing
> reordering of the TCP segments within flows would only further increase the
> workload of the middlebox because it will increase the ACK rates. Perhaps
> quite significantly if GRO was effective at the receivers before the
> reordering started.
>
> At least unless/until the reordering is bad enough to cause the sending TCPs
> to fast retransmit and so throttle back.  And unless we are talking about
> being overloaded by massive herds of "mice" I'd think that the TCP flows
> would be throttling back to what the single CPU in the middlebox could
> handle.

Agreed, I will try to get some data on the interaction with TCP flows. My
hunch is that they throttle down due to the reordering, but data is more useful.
The initial increase in ACKs, if any, will likely not increase rate beyond a
small factor.

The situations that this patch mean to address are more straightforward
DoS attacks, where a box can handle normal load with a big safety margin,
but falls over at a 10x or 100x flood of TCP SYN or similar packets.

> rick

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox