Netdev List
 help / color / mirror / Atom feed
* [PATCHv6] virtio-spec: virtio network device multiqueue support
From: Michael S. Tsirkin @ 2012-12-07 14:18 UTC (permalink / raw)
  To: Jason Wang; +Cc: rusty, virtualization, netdev, kvm, bhutchings

Add multiqueue support to virtio network device.
Add a new feature flag VIRTIO_NET_F_MQ for this feature, a new
configuration field max_virtqueue_pairs to detect supported number of
virtqueues as well as a new command VIRTIO_NET_CTRL_MQ to program
packet steering for unidirectional protocols.

---

Changes in v6:
- rename RFS -> multiqueue to avoid confusion with RFS in linux
  mention automatic receive steering as Rusty suggested

Changes in v5:
- Address Rusty's comments.
  Changes are only in the text, not the ideas.
- Some minor formatting changes.

Changes in v4:
- address Jason's comments
- have configuration specify the number of VQ pairs and not pairs - 1

Changes in v3:
- rename multiqueue -> rfs this is what we support
- Be more explicit about what driver should do.
- Simplify layout making VQs functionality depend on feature.
- Remove unused commands, only leave in programming # of queues

Changes in v2:
Address Jason's comments on v2:
- Changed STEERING_HOST to STEERING_RX_FOLLOWS_TX:
   this is both clearer and easier to support.
   It does not look like we need a separate steering command
   since host can just watch tx packets as they go.
- Moved RX and TX steering sections near each other.
- Add motivation for other changes in v2

Changes in v1 (from Jason's rfc):
- reserved vq 3: this makes all rx vqs even and tx vqs odd, which
   looks nicer to me.
- documented packet steering, added a generalized steering programming
   command. Current modes are single queue and host driven multiqueue,
   but I envision support for guest driven multiqueue in the future.
- make default vqs unused when in mq mode - this wastes some memory
   but makes it more efficient to switch between modes as
   we can avoid this causing packet reordering.

Rusty, could you please take a look and comment soon?
If this looks OK to everyone, we can proceed with finalizing the
implementation. Would be nice to try and put it in 3.8.

diff --git a/virtio-spec.lyx b/virtio-spec.lyx
index 83f2771..c5b32c4 100644
--- a/virtio-spec.lyx
+++ b/virtio-spec.lyx
@@ -59,6 +59,7 @@
 \author -608949062 "Rusty Russell,,," 
 \author -385801441 "Cornelia Huck" cornelia.huck@de.ibm.com
 \author 1531152142 "Paolo Bonzini,,," 
+\author 1986246365 "Michael S. Tsirkin" 
 \end_header
 
 \begin_body
@@ -4170,9 +4171,46 @@ ID 1
 \end_layout
 
 \begin_layout Description
-Virtqueues 0:receiveq.
- 1:transmitq.
- 2:controlq
+Virtqueues 0:receiveq
+\change_inserted 1986246365 1352742829
+0
+\change_unchanged
+.
+ 1:transmitq
+\change_inserted 1986246365 1352742832
+0
+\change_deleted 1986246365 1352742947
+.
+ 
+\change_inserted 1986246365 1352742952
+.
+ ....
+ 2N
+\begin_inset Foot
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1354531595
+N=0 if VIRTIO_NET_F_MQ is not negotiated, otherwise N is derived from 
+\emph on
+max_virtqueue_pairs
+\emph default
+ control
+\emph on
+ 
+\emph default
+field.
+ 
+\end_layout
+
+\end_inset
+
+: receivqN.
+ 2N+1: transmitqN.
+ 2N+
+\change_unchanged
+2:controlq
 \begin_inset Foot
 status open
 
@@ -4343,6 +4381,16 @@ VIRTIO_NET_F_CTRL_VLAN
 
 \begin_layout Description
 VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous packets.
+\change_inserted 1986246365 1352742767
+
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1986246365 1352742808
+VIRTIO_NET_F_MQ(22) Device supports multiqueue with automatic receive steering.
+\change_unchanged
+
 \end_layout
 
 \end_deeper
@@ -4355,11 +4403,45 @@ configuration
 \begin_inset space ~
 \end_inset
 
-layout Two configuration fields are currently defined.
+layout 
+\change_deleted 1986246365 1352743300
+Two
+\change_inserted 1986246365 1354531413
+Three
+\change_unchanged
+ configuration fields are currently defined.
  The mac address field always exists (though is only valid if VIRTIO_NET_F_MAC
  is set), and the status field only exists if VIRTIO_NET_F_STATUS is set.
  Two read-only bits are currently defined for the status field: VIRTIO_NET_S_LIN
 K_UP and VIRTIO_NET_S_ANNOUNCE.
+
+\change_inserted 1986246365 1354531470
+ The following read-only field, 
+\emph on
+max_virtqueue_pairs
+\emph default
+ only exists if VIRTIO_NET_F_MQ is set.
+ This field specifies the maximum number of each of transmit and receive
+ virtqueues (receiveq0..receiveq
+\emph on
+N
+\emph default
+ and transmitq0..transmitq
+\emph on
+N
+\emph default
+ respectively; 
+\emph on
+N
+\emph default
+=
+\emph on
+max_virtqueue_pairs - 1
+\emph default
+) that can be configured once VIRTIO_NET_F_MQ is negotiated.
+ Legal values for this field are 1 to 8000h.
+
+\change_unchanged
  
 \begin_inset listings
 inline false
@@ -4392,6 +4474,17 @@ struct virtio_net_config {
 \begin_layout Plain Layout
 
     u16 status;
+\change_inserted 1986246365 1354531427
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1354531437
+
+    u16 max_virtqueue_pairs;
+\change_unchanged
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -4410,7 +4503,24 @@ Device Initialization
 
 \begin_layout Enumerate
 The initialization routine should identify the receive and transmission
- virtqueues.
+ virtqueues
+\change_inserted 1986246365 1352744077
+, up to N+1 of each kind
+\change_unchanged
+.
+
+\change_inserted 1986246365 1352743942
+ If VIRTIO_NET_F_MQ feature bit is negotiated, 
+\emph on
+N=max_virtqueue_pairs-1
+\emph default
+, otherwise identify 
+\emph on
+N=0
+\emph default
+.
+\change_unchanged
+
 \end_layout
 
 \begin_layout Enumerate
@@ -4452,10 +4562,33 @@ status
 
  config field.
  Otherwise, the link should be assumed active.
+\change_inserted 1986246365 1354529306
+
 \end_layout
 
 \begin_layout Enumerate
-The receive virtqueue should be filled with receive buffers.
+
+\change_inserted 1986246365 1354531717
+Only receiveq0, transmitq0 and controlq are used by default.
+ To use more queues driver must negotiate the VIRTIO_NET_F_MQ feature;
+ initialize up to 
+\emph on
+max_virtqueue_pairs
+\emph default
+ of each of transmit and receive queues; execute_VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SE
+T command specifying the number of the transmit and receive queues that
+ is going to be used and wait until the device consumes the controlq buffer
+ and acks this command.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Enumerate
+The receive virtqueue
+\change_inserted 1986246365 1352743953
+s
+\change_unchanged
+ should be filled with receive buffers.
  This is described in detail below in 
 \begin_inset Quotes eld
 \end_inset
@@ -4550,8 +4683,15 @@ Device Operation
 \end_layout
 
 \begin_layout Standard
-Packets are transmitted by placing them in the transmitq, and buffers for
- incoming packets are placed in the receiveq.
+Packets are transmitted by placing them in the transmitq
+\change_inserted 1986246365 1353593685
+0..transmitqN
+\change_unchanged
+, and buffers for incoming packets are placed in the receiveq
+\change_inserted 1986246365 1353593692
+0..receiveqN
+\change_unchanged
+.
  In each case, the packet itself is preceeded by a header:
 \end_layout
 
@@ -4861,6 +5001,17 @@ If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at least the
 struct virtio_net_hdr
 \family default
 .
+\change_inserted 1986246365 1353594518
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353594638
+If VIRTIO_NET_F_MQ is negotiated, each of receiveq0...receiveqN that will
+ be used should be populated with receive buffers.
+\change_unchanged
+
 \end_layout
 
 \begin_layout Subsection*
@@ -5293,8 +5444,151 @@ Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control vq.
  
 \end_layout
 
-\begin_layout Enumerate
+\begin_layout Subsection*
+
+\change_inserted 1986246365 1353593879
+Automatic receive steering in multiqueue mode
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1354528882
+If the driver negotiates the VIRTIO_NET_F_MQ feature bit (depends on VIRTIO_NET
+_F_CTRL_VQ), it can transmit outgoing packets on one of the multiple transmitq0..t
+ransmitqN and ask the device to queue incoming packets into one the multiple
+ receiveq0..receiveqN depending on the packet flow.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353594292
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594178
+
+struct virtio_net_ctrl_mq {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594212
+
+	u16 virtqueue_pairs;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594172
+
+};
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594172
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594263
+
+#define VIRTIO_NET_CTRL_MQ    1
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594273
+
+ #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET        0 
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594273
+
+ #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MIN        1 
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 1986246365 1353594273
+
+ #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MAX        0x8000
+\end_layout
+
+\end_inset
+
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1354531492
+Multiqueue is disabled by default.
+ Driver enables multiqueue by executing the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command,
+ specifying the number of the transmit and receive queues that will be used;
+ thus transmitq0..transmitqn and receiveq0..receiveqn where 
+\emph on
+n=virtqueue_pairs-1
+\emph default
+ will be used.
+ All these virtqueues must have been pre-configured in advance.
+ The range of legal values for the
+\emph on
+ virtqueue_pairs
+\emph off
+ field is between 1 and 
+\emph on
+max_virtqueue_pairs
+\emph off
+.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1353595328
+When multiqueue is enabled, device uses automatic receive
+steering based on packet flow.
+Programming of the receive steering classificator is implicit.
+ Transmitting a packet of a specific flow on transmitqX will cause incoming
+ packets for this flow to be steered to receiveqX.
+ For uni-directional protocols, or where no packets have been transmitted
+ yet, device will steer a packet to a random queue out of the specified
+ receiveq0..receiveqn.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 1986246365 1354528710
+Multiqueue is disabled by setting 
+\emph on
+virtqueue_pairs = 1
+\emph default
+ (this is the default).
+ After the command is consumed by the device, the device will not steer
+ new packets on virtqueues receveq1..receiveqN (i.e.
+ other than receiveq0) nor read from transmitq1..transmitqN (i.e.
+ other than transmitq0); accordingly, driver should not transmit new packets
+ on virtqueues other than transmitq0.
+\change_unchanged
+
+\end_layout
+
+\begin_layout Standard
+
+\change_deleted 1986246365 1353593873
 .
+
+\change_unchanged
  
 \end_layout
 

^ permalink raw reply related

* GPF in ip6_dst_lookup_tail
From: Dave Jones @ 2012-12-07 14:15 UTC (permalink / raw)
  To: netdev

I just hit this gpf in overnight testing.

general protection fault: 0000 [#1] PREEMPT SMP 
Modules linked in: sctp libcrc32c ipt_ULOG fuse binfmt_misc nfnetlink nfc caif_socket caif phonet bluetooth rfkill can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds af_key decnet rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 nfsv3 nfs_acl nfs fscache lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables usb_debug microcode serio_raw pcspkr i2c_piix4 i2c_core r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm
CPU 3 
Pid: 19371, comm: trinity-child3 Not tainted 3.7.0-rc8+ #6 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<ffffffff815fe038>]  [<ffffffff815fe038>] ip6_dst_lookup_tail+0xe8/0x200
RSP: 0018:ffff880046277968  EFLAGS: 00010206
RAX: 2000000000000011 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff880046277b00 RSI: ffff880070a6cd80 RDI: ffff8800bb22d000
RBP: ffff8800462779f8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff880046277a10
R13: ffff880046277b00 R14: ffff8800bb22d000 R15: ffffffff81cb40c0
FS:  00007f0817b3f740(0000) GS:ffff88012b400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000004 CR3: 00000000a7dfb000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child3 (pid: 19371, threadinfo ffff880046276000, task ffff880044dca450)
Stack:
 ffff8800bb22d000 0000000000000002 0000000000000001 0000000000000000
 ffff880046277998 ffffffff81335b56 ffff8800462779b8 ffffffff81074cec
 0000000000000001 ffff880070a6cd80 ffff8800462779f8 00000000ae4317df
Call Trace:
 [<ffffffff81335b56>] ? __const_udelay+0x36/0x40
 [<ffffffff81074cec>] ? __rcu_read_unlock+0x5c/0xa0
 [<ffffffff81527945>] ? sk_dst_check+0x5/0x260
 [<ffffffff815fe3fb>] ip6_sk_dst_lookup_flow+0xcb/0x1b0
 [<ffffffff8161f45e>] udpv6_sendmsg+0x66e/0xb90
 [<ffffffff81335b56>] ? __const_udelay+0x36/0x40
 [<ffffffff81074cec>] ? __rcu_read_unlock+0x5c/0xa0
 [<ffffffff815bd4e4>] inet_sendmsg+0x114/0x230
 [<ffffffff815bd3d5>] ? inet_sendmsg+0x5/0x230
 [<ffffffff815261e9>] ? sock_update_classid+0xa9/0x2d0
 [<ffffffff81526270>] ? sock_update_classid+0x130/0x2d0
 [<ffffffff81520d1c>] sock_sendmsg+0xbc/0xf0
 [<ffffffff810b7ab2>] ? get_lock_stats+0x22/0x70
 [<ffffffff810bd037>] ? lock_release_non_nested+0x2b7/0x2f0
 [<ffffffff815210fc>] __sys_sendmsg+0x3ac/0x3c0
 [<ffffffff810b7b88>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff810b7ab2>] ? get_lock_stats+0x22/0x70
 [<ffffffff810b7f1e>] ? put_lock_stats.isra.23+0xe/0x40
 [<ffffffff8100a136>] ? native_sched_clock+0x26/0x90
 [<ffffffff810b7b88>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff81055e54>] ? do_setitimer+0x1c4/0x300
 [<ffffffff810b7f1e>] ? put_lock_stats.isra.23+0xe/0x40
 [<ffffffff811d6bea>] ? fget_light+0x3ca/0x500
 [<ffffffff810bdf9d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff815237d9>] sys_sendmsg+0x49/0x90
 [<ffffffff81684b42>] system_call_fastpath+0x16/0x1b
Code: 00 00 48 8b 5d d8 4c 8b 65 e0 4c 8b 6d e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 0f 1f 00 49 8b 34 24 48 8b 86 98 00 00 00 48 85 c0 74 c2 <f6> 80 6d 01 00 00 de 75 b9 48 8b 56 18 49 8d 75 24 b9 01 00 00 
RIP  [<ffffffff815fe038>] ip6_dst_lookup_tail+0xe8/0x200
 RSP <ffff880046277968>
---[ end trace f7dde22e5674fdd6 ]---



0000000000000000 <.text>:
   0:	00 00                	add    %al,(%rax)
   2:	48 8b 5d d8          	mov    -0x28(%rbp),%rbx
   6:	4c 8b 65 e0          	mov    -0x20(%rbp),%r12
   a:	4c 8b 6d e8          	mov    -0x18(%rbp),%r13
   e:	4c 8b 75 f0          	mov    -0x10(%rbp),%r14
  12:	4c 8b 7d f8          	mov    -0x8(%rbp),%r15
  16:	c9                   	leaveq 
  17:	c3                   	retq   
  18:	0f 1f 00             	nopl   (%rax)
  1b:	49 8b 34 24          	mov    (%r12),%rsi
  1f:	48 8b 86 98 00 00 00 	mov    0x98(%rsi),%rax
  26:	48 85 c0             	test   %rax,%rax
  29:	74 c2                	je     0xffffffffffffffed

/home/davej/tmp/tmp.YOesIDyrIr.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	f6 80 6d 01 00 00 de 	testb  $0xde,0x16d(%rax)
   7:	75 b9                	jne    0xffffffffffffffc2
   9:	48 8b 56 18          	mov    0x18(%rsi),%rdx
   d:	49 8d 75 24          	lea    0x24(%r13),%rsi
  11:	b9                   	.byte 0xb9
  12:	01 00                	add    %eax,(%rax)
	...


which looks like an inlined copy of ipv6_addr_any

         * marked as OPTIMISTIC, we release the found
         * dst entry and replace it instead with the
         * dst entry of the nexthop router
         */
        rt = (struct rt6_info *) *dst;
        n = rt->n;
      dc:       48 8b 86 98 00 00 00    mov    0x98(%rsi),%rax
        if (n && !(n->nud_state & NUD_VALID)) {
      e3:       48 85 c0                test   %rax,%rax
      e6:       74 c2                   je     aa <ip6_dst_lookup_tail+0xaa>
      e8:       f6 80 6d 01 00 00 de    testb  $0xde,0x16d(%rax)
      ef:       75 b9                   jne    aa <ip6_dst_lookup_tail+0xaa>


RAX here is 2000000000000011 , which clearly isn't a valid rt address to dereference.

Suddenly this starts to look familiar..  I started seeing this around 3.6-rc7
back in September.  http://www.spinics.net/lists/netdev/msg211894.html

	Dave

^ permalink raw reply

* [PATCH V2 wireless-next] iwlwifi: iwlagn_request_scan: Fix check for priv->scan_request
From: Tim Gardner @ 2012-12-07 13:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Tim Gardner, Johannes Berg, Wey-Yi Guy, Intel Linux Wireless,
	John W. Linville, Emmanuel Grumbach, Don Fry, linux-wireless,
	netdev
In-Reply-To: <1354830114.18211.4.camel@jlt4.sipsolutions.net>

The WARN_ON_ONCE() check for scan_request will not correctly detect
a NULL pointer for scan_type == IWL_SCAN_NORMAL. Make it explicit
that the check only applies to normal scans.

Convert WARN_ON_ONCE to WARN_ON since priv->scan_request really _can't_
be NULL for normal scans. If it is then we should emit frequent warnings.

This smatch warning led to scrutiny of iwlagn_request_scan():

drivers/net/wireless/iwlwifi/dvm/scan.c:894 iwlagn_request_scan() error: we previously assumed 'priv->scan_request' could be null (see line 792)

Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Wey-Yi Guy <wey-yi.w.guy@intel.com>
Cc: Intel Linux Wireless <ilw@linux.intel.com>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Don Fry <donald.h.fry@intel.com>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
---

This patch does apply to 3.6.y, but it doesn't fix an existing
bug so I don't think it qualifies. This patch simply makes
the driver more robust for future development.

V2 - corrected indentation more like the rest of the source
in this file.

 drivers/net/wireless/iwlwifi/dvm/scan.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/dvm/scan.c b/drivers/net/wireless/iwlwifi/dvm/scan.c
index bb9f625..fe91c5a 100644
--- a/drivers/net/wireless/iwlwifi/dvm/scan.c
+++ b/drivers/net/wireless/iwlwifi/dvm/scan.c
@@ -673,8 +673,9 @@ static int iwlagn_request_scan(struct iwl_priv *priv, struct ieee80211_vif *vif)
 	const u8 *ssid = NULL;
 	u8 ssid_len = 0;
 
-	if (WARN_ON_ONCE(priv->scan_request &&
-			 priv->scan_request->n_channels > MAX_SCAN_CHANNEL))
+	if (WARN_ON(priv->scan_type == IWL_SCAN_NORMAL &&
+		    (!priv->scan_request ||
+		    priv->scan_request->n_channels > MAX_SCAN_CHANNEL)))
 		return -EINVAL;
 
 	lockdep_assert_held(&priv->mutex);
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH rfc] netfilter: two xtables matches
From: Pablo Neira Ayuso @ 2012-12-07 13:20 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jan Engelhardt, netfilter-devel, netdev, Eric Dumazet,
	David Miller, Patrick McHardy
In-Reply-To: <CA+FuTSf+fh21koZRJ_ZDsr8yiiwcw2-pJ4uTyS2pOHZGdBM1-w@mail.gmail.com>

On Thu, Dec 06, 2012 at 04:12:10PM -0500, Willem de Bruijn wrote:
> On Thu, Dec 6, 2012 at 12:22 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Wed, Dec 05, 2012 at 09:00:36PM +0100, Jan Engelhardt wrote:
> >> On Wednesday 2012-12-05 20:28, Willem de Bruijn wrote:
> >>
> >> >Somehow, the first part of this email went missing. Not critical,
> >> >but for completeness:
> >> >
> >> >These two patches each add an xtables match.
> >> >
> >> >The xt_priority match is a straighforward addition in the style of
> >> >xt_mark, adding the option to filter on one more sk_buff field. I
> >> >have an immediate application for this. The amount of code (in
> >> >kernel + userspace) to add a single check proved quite large.
> >>
> >> Hm so yeah, can't we just place this in xt_mark.c?
> >
> > I don't feel this belongs to xt_mark at all.
> 
> Do you have other concerns, or can I resubmit as is for merging in a
> few days if no one raises additional issues?

In nftables we have the 'meta' extension that allows to match all
skbuff fields (among other things):

http://1984.lsi.us.es/git/nf-next/tree/net/netfilter/nft_meta.c?h=nf_tables8

I think it's the way to go so we stop adding small matches for each
skbuff field.

I don't mind the name if it's xt_skbuff or xt_meta.

^ permalink raw reply

* Re: [PATCH 2/2] netfilter: add xt_bpf xtables match
From: Pablo Neira Ayuso @ 2012-12-07 13:16 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netfilter-devel, netdev, Eric Dumazet, David Miller, kaber
In-Reply-To: <CA+FuTSfoYmTNyJnUW4shzb7V1TXpSunbOOORdViXw4DoZ4aibA@mail.gmail.com>

On Wed, Dec 05, 2012 at 03:10:13PM -0500, Willem de Bruijn wrote:
> On Wed, Dec 5, 2012 at 2:48 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Hi Willem,
> >
> > On Wed, Dec 05, 2012 at 02:22:19PM -0500, Willem de Bruijn wrote:
> >> A new match that executes sk_run_filter on every packet. BPF filters
> >> can access skbuff fields that are out of scope for existing iptables
> >> rules, allow more expressive logic, and on platforms with JIT support
> >> can even be faster.
> >>
> >> I have a corresponding iptables patch that takes `tcpdump -ddd`
> >> output, as used in the examples below. The two parts communicate
> >> using a variable length structure. This is similar to ebt_among,
> >> but new for iptables.
> >>
> >> Verified functionality by inserting an ip source filter on chain
> >> INPUT and an ip dest filter on chain OUTPUT and noting that ping
> >> failed while a rule was active:
> >>
> >> iptables -v -A INPUT -m bpf --bytecode '4,32 0 0 12,21 0 1 $SADDR,6 0 0 96,6 0 0 0,' -j DROP
> >> iptables -v -A OUTPUT -m bpf --bytecode '4,32 0 0 16,21 0 1 $DADDR,6 0 0 96,6 0 0 0,' -j DROP
> >
> > I like this BPF idea for iptables.
> >
> > I made a similar extension time ago, but it was taking a file as
> > parameter. That file contained in BPF code. I made a simple bison
> > parser that takes BPF code and put it into the bpf array of
> > instructions. It would be a bit more intuitive to define a filter and
> > we can distribute it with iptables.
> 
> That's cleaner, indeed. I actually like how tcpdump operates as a
> code generator if you pass -ddd. Unfortunately, it generates code only
> for link layer types of its supported devices, such as DLT_EN10MB and
> DLT_LINUX_SLL. The network layer interface of basic iptables
> (forgetting device dependent mechanisms as used in xt_mac) is DLT_RAW,
> but that is rarely supported.

Indeed, you'll have to hack on tcpdump to select the offset. In
iptables the base is the layer 3 header. With that change you could
use tcpdump for generate code automagically from their syntax.

> > Let me check on my internal trees, I can put that user-space code
> > somewhere in case you're interested.
> 
> Absolutely. I'll be happy to revise to get it in. I'm also considering
> sending a patch to tcpdump to make it generate code independent of the
> installed hardware when specifying -y.

I found a version of the old parser code I made:

http://1984.lsi.us.es/git/nfbpf/

It interprets a filter expressed in a similar way to tcpdump -dd but
it's using the BPF constants. It's quite preliminary and simple if you
look at the code.

Extending it to interpret some syntax similar to tcpdump -d would even
make more readable the BPF filter.

Time ago I also thought about taking the kernel code that checks that
the filter is correct. Currently you get -EINVAL if you pass a
handcrafted filter which is incorrect, so it's hard task to debug what
you made wrong.

It could be added to the iptables tree. Or if generic enough for BPF
and the effort is worth, just provide some small library that iptables
can link with and a small compiler/checker to help people develop BPF
filters.

Back to your xt_bpf thing, we can use the file containing the code
instead:

iptables -v -A INPUT -m bpf --bytecode-file filter1.bpf -j DROP
iptables -v -A OUTPUT -m bpf --bytecode-file filter2.bpf -j DROP

We can still allow the inlined filter via --bytecode if you want.

^ permalink raw reply

* Re: [patch net-next] net: call notifiers for mtu change even if iface is not up
From: Jiri Pirko @ 2012-12-07 12:29 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem, edumazet, bhutchings, psimerda
In-Reply-To: <20121203152250.GB21816@hmsreliant.think-freely.org>

Mon, Dec 03, 2012 at 04:22:50PM CET, nhorman@tuxdriver.com wrote:
>On Mon, Dec 03, 2012 at 03:22:29PM +0100, Jiri Pirko wrote:
>> Mon, Dec 03, 2012 at 03:18:23PM CET, nhorman@tuxdriver.com wrote:
>> >On Mon, Dec 03, 2012 at 12:16:32PM +0100, Jiri Pirko wrote:
>> >> Do the same thing as in set mac. Call notifiers every time.
>> >> 
>> >> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>> >> ---
>> >>  net/core/dev.c | 2 +-
>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >> 
>> >> diff --git a/net/core/dev.c b/net/core/dev.c
>> >> index 2f94df2..0685a72 100644
>> >> --- a/net/core/dev.c
>> >> +++ b/net/core/dev.c
>> >> @@ -4971,7 +4971,7 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
>> >>  	else
>> >>  		dev->mtu = new_mtu;
>> >>  
>> >> -	if (!err && dev->flags & IFF_UP)
>> >> +	if (!err)
>> >>  		call_netdevice_notifiers(NETDEV_CHANGEMTU, dev);
>> >>  	return err;
>> >>  }
>> >
>> >I'm not opposed to this change, but is there something that it expressly fixes?
>> 
>> This is about a consistency. To have the same behaviour as set_mac
>> for example.
>> 
>> >While it doesn't hurt to send around mtu change events, one would presume that
>> >listeners would pick up mtu changes when the NETDEV_UP event went' around.
>> >
>> >Neil
>> >
>I figured, I just wanted to be sure that there wasn't a bug fix that needed
>documenting in the changelog, or that listeners for this event didn't expect the
>interface to already be up.  It looks pretty clean, with the exception of the
>TIPC protocol.  It appears that it might require enable_bearer to be called
>prior to any netdevice CHANGEMTU events, so that the eth_bearers array gets
>filled out, and it looks like that has to get triggered either by user space, or
>the receipt of a message over the network.  Let me know what you think

I'm looking at the code. What is the difference between CHANGEMTU and
CHANGEADDR here? It looks to me that it should work for both in the same
way.


>
>Neil
>
>> >--
>> >To unsubscribe from this list: send the line "unsubscribe netdev" in
>> >the body of a message to majordomo@vger.kernel.org
>> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 

^ permalink raw reply

* Re: [RFC PATCH v2 3/3] tun: fix LSM/SELinux labeling of tun/tap devices
From: Michael S. Tsirkin @ 2012-12-07 12:25 UTC (permalink / raw)
  To: Paul Moore; +Cc: netdev, linux-security-module, selinux, jasowang
In-Reply-To: <1761265.vEQbM1ySnW@sifl>

On Thu, Dec 06, 2012 at 04:09:51PM -0500, Paul Moore wrote:
> On Thursday, December 06, 2012 10:57:16 PM Michael S. Tsirkin wrote:
> > On Thu, Dec 06, 2012 at 11:56:45AM -0500, Paul Moore wrote:
> > > The SETQUEUE/tun_socket:create_queue permissions do not yet exist in any
> > > released SELinux policy as we are just now adding them with this patchset.
> > > With current policies loaded into a kernel with this patchset applied the
> > > SETQUEUE/tun_socket:create_queue permission would be treated according to
> > > the policy's unknown permission setting.
> > 
> > OK I think we need to rethink what we are doing here: what you sent
> > addresses the problem as stated but I think we mis-stated it.  Let me
> > try to restate the problem: it is not just selinux problem. Let's assume
> > qemu wants to use tun, I (libvirt) don't want to run it as root.
> > 
> > 1. TUNSETIFF: I can open tun, attach an fd and pass it to qemu.
> > Now, qemu does not invoke TUNSETIFF so it can run without
> > kernel priveledges.
> 
> Correct me if I'm wrong, but I believe libvirt does this while running as 
> root.  Assuming that is the case, why not simply setuid()/setgid() to the same 
> credentials as the QEMU instance before creating the TUN device?  You can 
> always (re)configure the device afterwards while running as 
> root/CAP_NET_ADMIN.

We want isolation between qemu instances.
Giving qemu right to open tun and SETIFF would give it rights
to access any tun device.

There could also be user tun users we want them isolated from qemu.

> > 2. TUNSETQUEUE - I can open tun and attach a queue but this
> > is not what is needed since this automatically switches
> > to multiqueue mode - we want to change number of queues
> > on the fly.
> > So qemu needs to be allowed to run TUNSETQUEUE.
> > Since this checks tun_not_capable(tun) we would need
> > to give qemu these priveledges, and we want to avoid this
> > (I can go into why if it's not obvious).
> 
> If libvirt creates the TUN device while its effective credentials match those 
> of the QEMU instance then the QEMU instance should be able to perform a 
> TUNSETQUEUE, yes?
> 
> > How can we slove this?
> > I don't see a way without extending the interface.
> > Here's a simple way to extend it: pass a flag to TUNSETQUEUE
> > that enables/disables TX on this queue.
> > If TX is disabled, ignore this queue for flow steering decisions.
> > Allow TUNSETQUEUE for a non priveledged user if it
> > it already bound to the currect tun and only changes this flag.
> > 
> > Now I open tun and SETQUEUE with TX disabled flag. Pass it to qemu.
> > qemu calls SETQUEUE with TX enabled flag.
> > 
> > Jason? Want to try implementing and see what people think?
> 
> -- 
> paul moore
> security and virtualization @ redhat

^ permalink raw reply

* [RFC] wireless: check against default_ethtool_ops
From: Stanislaw Gruszka @ 2012-12-07 12:16 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: Ben Greear, linux-wireless, Bjørn Mork
In-Reply-To: <87zk1q3xrv.fsf@nemi.mork.no>

Since:

commit 2c60db037034d27f8c636403355d52872da92f81
Author: Eric Dumazet <edumazet@google.com>
Date:   Sun Sep 16 09:17:26 2012 +0000

    net: provide a default dev->ethtool_ops

wireless core does not correctly assign ethtool_ops. In order to fix
the problem, and avoid assigning ethtool_ops on each individual cfg80211
driver, we check against default_ethool_ops pointer instead of NULL in
wireless core.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 include/linux/netdevice.h |    2 ++
 net/core/dev.c            |    3 ++-
 net/wireless/core.c       |    2 +-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f8eda02..c98e1c3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -60,6 +60,8 @@ struct wireless_dev;
 #define SET_ETHTOOL_OPS(netdev,ops) \
 	( (netdev)->ethtool_ops = (ops) )
 
+extern const struct ethtool_ops default_ethtool_ops;
+
 /* hardware address assignment types */
 #define NET_ADDR_PERM		0	/* address is permanent (default) */
 #define NET_ADDR_RANDOM		1	/* address is generated randomly */
diff --git a/net/core/dev.c b/net/core/dev.c
index c0946cb..4cd2168 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6008,7 +6008,8 @@ struct netdev_queue *dev_ingress_queue_create(struct net_device *dev)
 	return queue;
 }
 
-static const struct ethtool_ops default_ethtool_ops;
+const struct ethtool_ops default_ethtool_ops;
+EXPORT_SYMBOL_GPL(default_ethtool_ops);
 
 /**
  *	alloc_netdev_mqs - allocate network device
diff --git a/net/wireless/core.c b/net/wireless/core.c
index 14d9904..90915d4 100644
--- a/net/wireless/core.c
+++ b/net/wireless/core.c
@@ -866,7 +866,7 @@ static int cfg80211_netdev_notifier_call(struct notifier_block *nb,
 		/* allow mac80211 to determine the timeout */
 		wdev->ps_timeout = -1;
 
-		if (!dev->ethtool_ops)
+		if (dev->ethtool_ops == &default_ethtool_ops)
 			dev->ethtool_ops = &cfg80211_ethtool_ops;
 
 		if ((wdev->iftype == NL80211_IFTYPE_STATION ||
-- 
1.7.1

^ permalink raw reply related

* [PATCH 2/4] net: remove obsolete simple_strto<foo>
From: Abhijit Pawar @ 2012-12-07 11:49 UTC (permalink / raw)
  To: David S. Miller, Pablo Neira Ayuso, Patrick McHardy,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	John W. Linville, Johannes Berg, Cong Wang, Eric Dumazet,
	Neil Horman, Joe Perches
  Cc: netdev, linux-kernel, netfilter-devel, netfilter, coreteam,
	linux-wireless, Abhijit Pawar

This patch replace the obsolete simple_strto<foo> with kstrto<foo>

Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
---
 net/core/netpoll.c                 |    9 +++++++--
 net/ipv4/netfilter/ipt_CLUSTERIP.c |    9 +++++++--
 net/mac80211/debugfs_sta.c         |    4 +++-
 net/netfilter/nf_conntrack_core.c  |    6 ++++--
 4 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 77a0388..596b127 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -668,13 +668,16 @@ EXPORT_SYMBOL(netpoll_print_options);
 
 int netpoll_parse_options(struct netpoll *np, char *opt)
 {
+	int rc;
 	char *cur=opt, *delim;
 
 	if (*cur != '@') {
 		if ((delim = strchr(cur, '@')) == NULL)
 			goto parse_failed;
 		*delim = 0;
-		np->local_port = simple_strtol(cur, NULL, 10);
+		rc = kstrtol(cur, 10, &np->local_port);
+		if (rc)
+			goto parse_failed;
 		cur = delim;
 	}
 	cur++;
@@ -705,7 +708,9 @@ int netpoll_parse_options(struct netpoll *np, char *opt)
 		*delim = 0;
 		if (*cur == ' ' || *cur == '\t')
 			np_info(np, "warning: whitespace is not allowed\n");
-		np->remote_port = simple_strtol(cur, NULL, 10);
+		rc = kstrtol(cur, 10, &np->remote_port);
+		if (rc)
+			goto parse_failed;
 		cur = delim;
 	}
 	cur++;
diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index fe5daea..55e7b73 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -661,6 +661,7 @@ static ssize_t clusterip_proc_write(struct file *file, const char __user *input,
 #define PROC_WRITELEN	10
 	char buffer[PROC_WRITELEN+1];
 	unsigned long nodenum;
+	int rc;
 
 	if (size > PROC_WRITELEN)
 		return -EIO;
@@ -669,11 +670,15 @@ static ssize_t clusterip_proc_write(struct file *file, const char __user *input,
 	buffer[size] = 0;
 
 	if (*buffer == '+') {
-		nodenum = simple_strtoul(buffer+1, NULL, 10);
+		rc = kstrtoul(buffer+1, 10, &nodenum);
+		if (rc)
+			return -EINVAL;
 		if (clusterip_add_node(c, nodenum))
 			return -ENOMEM;
 	} else if (*buffer == '-') {
-		nodenum = simple_strtoul(buffer+1, NULL,10);
+		rc = kstrtoul(buffer+1, 10, &nodenum);
+		if (rc)
+			return -EINVAL;
 		if (clusterip_del_node(c, nodenum))
 			return -ENOENT;
 	} else
diff --git a/net/mac80211/debugfs_sta.c b/net/mac80211/debugfs_sta.c
index 89281d2..18754fd 100644
--- a/net/mac80211/debugfs_sta.c
+++ b/net/mac80211/debugfs_sta.c
@@ -219,7 +219,9 @@ static ssize_t sta_agg_status_write(struct file *file, const char __user *userbu
 	} else
 		return -EINVAL;
 
-	tid = simple_strtoul(buf, NULL, 0);
+	ret = kstrtoul(buf, 0, &tid);
+	if (ret)
+		return -EINVAL;
 
 	if (tid >= IEEE80211_NUM_TIDS)
 		return -EINVAL;
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index af17516..18ce24b 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1409,7 +1409,7 @@ EXPORT_SYMBOL_GPL(nf_ct_alloc_hashtable);
 
 int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 {
-	int i, bucket;
+	int i, bucket, rc;
 	unsigned int hashsize, old_size;
 	struct hlist_nulls_head *hash, *old_hash;
 	struct nf_conntrack_tuple_hash *h;
@@ -1422,7 +1422,9 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	if (!nf_conntrack_htable_size)
 		return param_set_uint(val, kp);
 
-	hashsize = simple_strtoul(val, NULL, 0);
+	rc = kstrtouint(val, 0, &hashsize);
+	if (rc)
+		return -EINVAL;
 	if (!hashsize)
 		return -EINVAL;
 
-- 
1.7.7.6

^ permalink raw reply related

* [patch v2] bridge: make buffer larger in br_setlink()
From: Dan Carpenter @ 2012-12-07 11:10 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Thomas Graf, netdev, bridge, kernel-janitors, David S. Miller
In-Reply-To: <20121207093107.GA2996@casper.infradead.org>

We pass IFLA_BRPORT_MAX to nla_parse_nested() so we need
IFLA_BRPORT_MAX + 1 elements.  Also Smatch complains that we read past
the end of the array when in br_set_port_flag() when it's called with
IFLA_BRPORT_FAST_LEAVE.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
v2: Style tweak.

Only needed in linux-next.

diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 850b7d1..cfc5cfe 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -239,7 +239,7 @@ int br_setlink(struct net_device *dev, struct nlmsghdr *nlh)
 	struct ifinfomsg *ifm;
 	struct nlattr *protinfo;
 	struct net_bridge_port *p;
-	struct nlattr *tb[IFLA_BRPORT_MAX];
+	struct nlattr *tb[IFLA_BRPORT_MAX + 1];
 	int err;
 
 	ifm = nlmsg_data(nlh);

^ permalink raw reply related

* Re: [PATCH net-next 0/7] Allow to monitor multicast cache event via rtnetlink
From: Thomas Graf @ 2012-12-07 10:58 UTC (permalink / raw)
  To: David Laight; +Cc: Nicolas Dichtel, David Miller, netdev
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B70E4@saturn3.aculab.com>

On 12/07/12 at 10:38am, David Laight wrote:
> What are you going to align the data with respect to?
> I doubt you can assume that the start of the netlink
> message itself is 8 byte aligned - so any attempt
> to 8 byte align an item is probably doomed to failure.

Correct me if I'm wrong but skb->head will be aligned to
SKB_DATA_ALIGN() which as I understand is guaranted to
be 8 bytse aligned. So we can just add needed padding if
skb->data would not be aligned correctly after the next
netlink attribute header was added.

All user space needs to do is use a receive buffer that
is 8 byte aligned as well and we are good to go.

^ permalink raw reply

* RE: [net-next 2/7] bna: Tx and Rx Optimizations
From: David Laight @ 2012-12-07 10:46 UTC (permalink / raw)
  To: Rasesh Mody, davem, netdev; +Cc: bhutchings, adapter_linux_open_src_team
In-Reply-To: <1354835868-27016-3-git-send-email-rmody@brocade.com>

>  #define BNA_QE_INDX_ADD(_qe_idx, _qe_num, _q_depth)			\
>  	((_qe_idx) = ((_qe_idx) + (_qe_num)) & ((_q_depth) - 1))
> 
> +#define BNA_QE_INDX_INC(_idx, _q_depth)					\
> +do {									\
> +	(_idx)++;							\
> +	if ((_idx) == (_q_depth))					\
> +		(_idx) = 0;						\
> +} while (0)
> +

If q_depth has to be a power of 2 (implied by BNA_QE_IND_ADD())
then you should mask in BNA_QE_INDX_INC() to save the conditional.
Or just:
#define BNA_QE_INDX_INC(_idx, _q_depth) BNA_QE_INDX_ADD(_idx, 1, _q_depth)

	David

^ permalink raw reply

* loan offer
From: SEC Capitals Loan @ 2012-12-07  9:40 UTC (permalink / raw)




Loan Offer at 3%, Feel Free to REPLY back to us for more info

^ permalink raw reply

* Re: [PATCH] net: core: fix unused variable sparse warning
From: Shan Wei @ 2012-12-07 10:45 UTC (permalink / raw)
  To: Cong Ding
  Cc: David S. Miller, Eric W. Biederman, Eric Dumazet, Pavel Emelyanov,
	netdev, linux-kernel
In-Reply-To: <20121207001545.GE2510@gmail.com>

Cong Ding said, at 2012/12/7 8:15:
> On Fri, Dec 07, 2012 at 12:06:44AM +0000, Cong Ding wrote:
>> the variables zero and unres_qlen_max are only used when CONFIG_SYSCTL is
>> defined, otherwise it causes the following sparse warning when we turn on
>> CONFIG_SYSCTL.
> sorry for disturbing again, the sparse warning is
> net/core/neighbour.c:65:12: warning: ‘zero’ defined but not used [-Wunused-variable]
> net/core/neighbour.c:66:12: warning: ‘unres_qlen_max’ defined but not used [-Wunused-variable]
> 
> it happens if we turn off CONFIG_SYSCTL.

It has been fixed yesterday.
see b93196dc5af7729ff7cc50d3d322ab1a364aa14f

> 
>>
>> Signed-off-by: Cong Ding <dinggnu@gmail.com>
>> ---
>>  net/core/neighbour.c |    2 ++
>>  1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
>> index 36fc692..4a15278 100644
>> --- a/net/core/neighbour.c
>> +++ b/net/core/neighbour.c
>> @@ -62,8 +62,10 @@ static void __neigh_notify(struct neighbour *n, int type, int flags);
>>  static void neigh_update_notify(struct neighbour *neigh);
>>  static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev);
>>  
>> +#ifdef CONFIG_SYSCTL
>>  static int zero;
>>  static int unres_qlen_max = INT_MAX / SKB_TRUESIZE(ETH_FRAME_LEN);
>> +#endif
>>  
>>  static struct neigh_table *neigh_tables;
>>  #ifdef CONFIG_PROC_FS
>> -- 
>> 1.7.4.5
>>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* RE: [PATCH net-next 0/7] Allow to monitor multicast cache event via rtnetlink
From: David Laight @ 2012-12-07 10:38 UTC (permalink / raw)
  To: Thomas Graf, Nicolas Dichtel; +Cc: David Miller, netdev
In-Reply-To: <20121206174905.GC16122@casper.infradead.org>

> On 12/06/12 at 09:43am, Nicolas Dichtel wrote:
> > Le 05/12/2012 18:54, David Miller a écrit :
> > >From: "David Laight" <David.Laight@ACULAB.COM>
> > >Date: Wed, 5 Dec 2012 11:41:33 -0000
> > >
> > >>Probably worth commenting that the 64bit items might only be 32bit aligned.
> > >>Just to stop anyone trying to read/write them with pointer casts.
> > >
> > >Rather, let's not create this situation at all.
> > >
> > >It's totally inappropriate to have special code to handle every single
> > >time we want to put 64-bit values into netlink messages.
> > >
> > >We need a real solution to this issue.
> > >
> > The easiest way is to update *_ALIGNTO values (maybe we can keep
> > NLMSG_ALIGNTO to 4). But I think that many userland apps have these
> > values hardcoded and, the most important thing, this may increase
> > size of many netlink messages. Hence we need probably to find
> > something better.
> 
> We can't do this, as you say, ALIGNTO is compiled into all the
> binaries.
> 
> A simple backwards compatible workaround would be to include an
> unknown, empty padding attribute if needed. That would be 4 bytes
> in size and could be used to include padding as needed.

What are you going to align the data with respect to?
I doubt you can assume that the start of the netlink
message itself is 8 byte aligned - so any attempt
to 8 byte align an item is probably doomed to failure.

	David

^ permalink raw reply

* Re: [PATCH net-next v5] bridge: export multicast database via netlink
From: Thomas Graf @ 2012-12-07 10:36 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, bridge, Herbert Xu, Stephen Hemminger, David S. Miller,
	Jesper Dangaard Brouer
In-Reply-To: <1354874688-24564-1-git-send-email-amwang@redhat.com>

On 12/07/12 at 06:04pm, Cong Wang wrote:
> From: Cong Wang <amwang@redhat.com>
> 
> V5: fix two bugs pointed out by Thomas
>     remove seq check for now, mark it as TODO
> 
> V4: remove some useless #include
>     some coding style fix
> 
> V3: drop debugging printk's
>     update selinux perm table as well
> 
> V2: drop patch 1/2, export ifindex directly
>     Redesign netlink attributes
>     Improve netlink seq check
>     Handle IPv6 addr as well
> 
> This patch exports bridge multicast database via netlink
> message type RTM_GETMDB. Similar to fdb, but currently bridge-specific.
> We may need to support modify multicast database too (RTM_{ADD,DEL}MDB).
> 
> (Thanks to Thomas for patient reviews)
> 
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Thomas Graf <tgraf@suug.ch>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Cong Wang <amwang@redhat.com>

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply

* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: Ben Hutchings @ 2012-12-07 10:07 UTC (permalink / raw)
  To: Joseph Gasparakis
  Cc: davem, shemminger, chrisw, gospo, netdev, linux-kernel, dmitry,
	saeed.bishara, Peter P Waskiewicz Jr, Alexander Duyck
In-Reply-To: <1354845419-22483-2-git-send-email-joseph.gasparakis@intel.com>

On Thu, 2012-12-06 at 17:56 -0800, Joseph Gasparakis wrote:
> This patch adds support in the kernel for offloading in the NIC Tx and Rx
> checksumming for encapsulated packets (such as VXLAN and IP GRE).
[...]
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1063,6 +1063,8 @@ struct net_device {
>  	netdev_features_t	wanted_features;
>  	/* mask of features inheritable by VLAN devices */
>  	netdev_features_t	vlan_features;
> +	/* mask of features inherited by encapsulating devices */
> +	netdev_features_t	hw_enc_features;
[...]

How will the networking core know *which* encapsulations this applies
to?  I notice that your implementation in ixgbe does not set
NETIF_F_HW_CSUM here, so presumably the hardware will parse headers to
find which ranges should be checksummed and it won't cover the next
encapsulation protocol that comes along.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH net-next v5] bridge: export multicast database via netlink
From: Cong Wang @ 2012-12-07 10:04 UTC (permalink / raw)
  To: netdev
  Cc: Cong Wang, bridge, Herbert Xu, Jesper Dangaard Brouer,
	Thomas Graf, Stephen Hemminger, David S. Miller

From: Cong Wang <amwang@redhat.com>

V5: fix two bugs pointed out by Thomas
    remove seq check for now, mark it as TODO

V4: remove some useless #include
    some coding style fix

V3: drop debugging printk's
    update selinux perm table as well

V2: drop patch 1/2, export ifindex directly
    Redesign netlink attributes
    Improve netlink seq check
    Handle IPv6 addr as well

This patch exports bridge multicast database via netlink
message type RTM_GETMDB. Similar to fdb, but currently bridge-specific.
We may need to support modify multicast database too (RTM_{ADD,DEL}MDB).

(Thanks to Thomas for patient reviews)

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
    
---
 include/uapi/linux/if_bridge.h |   55 ++++++++++++++
 include/uapi/linux/rtnetlink.h |    3 +
 net/bridge/Makefile            |    2 +-
 net/bridge/br_mdb.c            |  163 ++++++++++++++++++++++++++++++++++++++++
 net/bridge/br_multicast.c      |    1 +
 net/bridge/br_private.h        |    1 +
 security/selinux/nlmsgtab.c    |    1 +
 7 files changed, 225 insertions(+), 1 deletions(-)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index b388579..9a0f6ff 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -116,4 +116,59 @@ enum {
 	__IFLA_BRIDGE_MAX,
 };
 #define IFLA_BRIDGE_MAX (__IFLA_BRIDGE_MAX - 1)
+
+/* Bridge multicast database attributes
+ * [MDBA_MDB] = {
+ *     [MDBA_MDB_ENTRY] = {
+ *         [MDBA_MDB_ENTRY_INFO]
+ *     }
+ * }
+ * [MDBA_ROUTER] = {
+ *    [MDBA_ROUTER_PORT]
+ * }
+ */
+enum {
+	MDBA_UNSPEC,
+	MDBA_MDB,
+	MDBA_ROUTER,
+	__MDBA_MAX,
+};
+#define MDBA_MAX (__MDBA_MAX - 1)
+
+enum {
+	MDBA_MDB_UNSPEC,
+	MDBA_MDB_ENTRY,
+	__MDBA_MDB_MAX,
+};
+#define MDBA_MDB_MAX (__MDBA_MDB_MAX - 1)
+
+enum {
+	MDBA_MDB_ENTRY_UNSPEC,
+	MDBA_MDB_ENTRY_INFO,
+	__MDBA_MDB_ENTRY_MAX,
+};
+#define MDBA_MDB_ENTRY_MAX (__MDBA_MDB_ENTRY_MAX - 1)
+
+enum {
+	MDBA_ROUTER_UNSPEC,
+	MDBA_ROUTER_PORT,
+	__MDBA_ROUTER_MAX,
+};
+#define MDBA_ROUTER_MAX (__MDBA_ROUTER_MAX - 1)
+
+struct br_port_msg {
+	__u32 ifindex;
+};
+
+struct br_mdb_entry {
+	__u32 ifindex;
+	struct {
+		union {
+			__be32	ip4;
+			struct in6_addr ip6;
+		} u;
+		__be16		proto;
+	} addr;
+};
+
 #endif /* _UAPI_LINUX_IF_BRIDGE_H */
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 33d29ce..354a1e7 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -125,6 +125,9 @@ enum {
 	RTM_GETNETCONF = 82,
 #define RTM_GETNETCONF RTM_GETNETCONF
 
+	RTM_GETMDB = 86,
+#define RTM_GETMDB RTM_GETMDB
+
 	__RTM_MAX,
 #define RTM_MAX		(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/bridge/Makefile b/net/bridge/Makefile
index d0359ea..e859098 100644
--- a/net/bridge/Makefile
+++ b/net/bridge/Makefile
@@ -12,6 +12,6 @@ bridge-$(CONFIG_SYSFS) += br_sysfs_if.o br_sysfs_br.o
 
 bridge-$(CONFIG_BRIDGE_NETFILTER) += br_netfilter.o
 
-bridge-$(CONFIG_BRIDGE_IGMP_SNOOPING) += br_multicast.o
+bridge-$(CONFIG_BRIDGE_IGMP_SNOOPING) += br_multicast.o br_mdb.o
 
 obj-$(CONFIG_BRIDGE_NF_EBTABLES) += netfilter/
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
new file mode 100644
index 0000000..edc0d73
--- /dev/null
+++ b/net/bridge/br_mdb.c
@@ -0,0 +1,163 @@
+#include <linux/err.h>
+#include <linux/igmp.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/rculist.h>
+#include <linux/skbuff.h>
+#include <net/ip.h>
+#include <net/netlink.h>
+#if IS_ENABLED(CONFIG_IPV6)
+#include <net/ipv6.h>
+#endif
+
+#include "br_private.h"
+
+static int br_rports_fill_info(struct sk_buff *skb, struct netlink_callback *cb,
+			       struct net_device *dev)
+{
+	struct net_bridge *br = netdev_priv(dev);
+	struct net_bridge_port *p;
+	struct hlist_node *n;
+	struct nlattr *nest;
+
+	if (!br->multicast_router || hlist_empty(&br->router_list))
+		return 0;
+
+	nest = nla_nest_start(skb, MDBA_ROUTER);
+	if (nest == NULL)
+		return -EMSGSIZE;
+
+	hlist_for_each_entry_rcu(p, n, &br->router_list, rlist) {
+		if (p && nla_put_u32(skb, MDBA_ROUTER_PORT, p->dev->ifindex))
+			goto fail;
+	}
+
+	nla_nest_end(skb, nest);
+	return 0;
+fail:
+	nla_nest_cancel(skb, nest);
+	return -EMSGSIZE;
+}
+
+static int br_mdb_fill_info(struct sk_buff *skb, struct netlink_callback *cb,
+			    struct net_device *dev)
+{
+	struct net_bridge *br = netdev_priv(dev);
+	struct net_bridge_mdb_htable *mdb;
+	struct nlattr *nest, *nest2;
+	int i, err = 0;
+	int idx = 0, s_idx = cb->args[1];
+
+	if (br->multicast_disabled)
+		return 0;
+
+	mdb = rcu_dereference(br->mdb);
+	if (!mdb)
+		return 0;
+
+	nest = nla_nest_start(skb, MDBA_MDB);
+	if (nest == NULL)
+		return -EMSGSIZE;
+
+	for (i = 0; i < mdb->max; i++) {
+		struct hlist_node *h;
+		struct net_bridge_mdb_entry *mp;
+		struct net_bridge_port_group *p, **pp;
+		struct net_bridge_port *port;
+
+		hlist_for_each_entry_rcu(mp, h, &mdb->mhash[i], hlist[mdb->ver]) {
+			if (idx < s_idx)
+				goto skip;
+
+			nest2 = nla_nest_start(skb, MDBA_MDB_ENTRY);
+			if (nest2 == NULL) {
+				err = -EMSGSIZE;
+				goto out;
+			}
+
+			for (pp = &mp->ports;
+			     (p = rcu_dereference(*pp)) != NULL;
+			      pp = &p->next) {
+				port = p->port;
+				if (port) {
+					struct br_mdb_entry e;
+					e.ifindex = port->dev->ifindex;
+					e.addr.u.ip4 = p->addr.u.ip4;
+#if IS_ENABLED(CONFIG_IPV6)
+					e.addr.u.ip6 = p->addr.u.ip6;
+#endif
+					e.addr.proto = p->addr.proto;
+					if (nla_put(skb, MDBA_MDB_ENTRY_INFO, sizeof(e), &e)) {
+						nla_nest_cancel(skb, nest2);
+						err = -EMSGSIZE;
+						goto out;
+					}
+				}
+			}
+			nla_nest_end(skb, nest2);
+		skip:
+			idx++;
+		}
+	}
+
+out:
+	cb->args[1] = idx;
+	nla_nest_end(skb, nest);
+	return err;
+}
+
+static int br_mdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct net_device *dev;
+	struct net *net = sock_net(skb->sk);
+	struct nlmsghdr *nlh = NULL;
+	int idx = 0, s_idx;
+
+	s_idx = cb->args[0];
+
+	rcu_read_lock();
+
+	/* TODO: in case of rehashing, we need to check
+	 * consistency for dumping.
+	 */
+	cb->seq = net->dev_base_seq;
+
+	for_each_netdev_rcu(net, dev) {
+		if (dev->priv_flags & IFF_EBRIDGE) {
+			struct br_port_msg *bpm;
+
+			if (idx < s_idx)
+				goto skip;
+
+			nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid,
+					cb->nlh->nlmsg_seq, RTM_GETMDB,
+					sizeof(*bpm), NLM_F_MULTI);
+			if (nlh == NULL)
+				break;
+
+			bpm = nlmsg_data(nlh);
+			bpm->ifindex = dev->ifindex;
+			if (br_mdb_fill_info(skb, cb, dev) < 0)
+				goto out;
+			if (br_rports_fill_info(skb, cb, dev) < 0)
+				goto out;
+
+			cb->args[1] = 0;
+			nlmsg_end(skb, nlh);
+		skip:
+			idx++;
+		}
+	}
+
+out:
+	if (nlh)
+		nlmsg_end(skb, nlh);
+	rcu_read_unlock();
+	cb->args[0] = idx;
+	return skb->len;
+}
+
+void br_mdb_init(void)
+{
+	rtnl_register(PF_BRIDGE, RTM_GETMDB, NULL, br_mdb_dump, NULL);
+}
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index a2a7a1a..68e375a 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -1605,6 +1605,7 @@ void br_multicast_init(struct net_bridge *br)
 		    br_multicast_querier_expired, (unsigned long)br);
 	setup_timer(&br->multicast_query_timer, br_multicast_query_expired,
 		    (unsigned long)br);
+	br_mdb_init();
 }
 
 void br_multicast_open(struct net_bridge *br)
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index cd86222..ae0a6ec 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -433,6 +433,7 @@ extern int br_multicast_set_port_router(struct net_bridge_port *p,
 extern int br_multicast_toggle(struct net_bridge *br, unsigned long val);
 extern int br_multicast_set_querier(struct net_bridge *br, unsigned long val);
 extern int br_multicast_set_hash_max(struct net_bridge *br, unsigned long val);
+extern void br_mdb_init(void);
 
 static inline bool br_multicast_is_router(struct net_bridge *br)
 {
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index d309e7f..163aaa7 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -67,6 +67,7 @@ static struct nlmsg_perm nlmsg_route_perms[] =
 	{ RTM_GETADDRLABEL,	NETLINK_ROUTE_SOCKET__NLMSG_READ  },
 	{ RTM_GETDCB,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
 	{ RTM_SETDCB,		NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_GETMDB,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
 };
 
 static struct nlmsg_perm nlmsg_tcpdiag_perms[] =

^ permalink raw reply related

* Re: [PATCHv2] smsc: RFC: Workaround for problems with lan8710 phy auto MDI-X
From: Peter Turczak @ 2012-12-07  9:58 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: David Miller, Otavio Salvador, Javier Martinez Canillas,
	Christian Hohnstaedt, Jiri Kosina
In-Reply-To: <alpine.LNX.2.00.1211301622110.25639@pobox.suse.cz>

On Nov 30, 2012, at 4:23 PM, Jiri Kosina <jkosina@suse.cz> wrote:

> I am not sure whether compile-time option for something like this is 
> appropriate. Kernel module parameter, perhaps?
> 
> Of course it'd be far better if faulty hardware can be autodetected in 
> runtime.
Thanks for the input. Currently only the symbol error counter seems to give 
a good indication that there might be a problem. So I suggest monitoring
the symbol error counter. When a certain amount of symbol errors per
poll interval is exceeded the auto MDI-X will be disabled.

BTW: is it okay to hook into read_status or should I use work queues?

Signed-off-by: Peter Turczak <pt@netconsequence.de>
---
drivers/net/phy/smsc.c  |   57 ++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/smscphy.h |    7 +++++
2 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
index 88e3991..0748266 100644
--- a/drivers/net/phy/smsc.c
+++ b/drivers/net/phy/smsc.c
@@ -24,6 +24,16 @@
#include <linux/netdevice.h>
#include <linux/smscphy.h>

+/* Maximum number of symbol errors between two lan8720_read_status calls. */
+#define LAN8720_MAX_SYM_ERR_CNT 100
+
+static struct phy_driver lan8710_driver;
+
+struct smsc_phy_private {
+	/* Keeps track of the number of the broken received packets */
+	int sym_err_count;
+};
+
static int smsc_phy_config_intr(struct phy_device *phydev)
{
	int rc = phy_write (phydev, MII_LAN83C185_IM,
@@ -58,6 +68,7 @@ static int smsc_phy_config_init(struct phy_device *phydev)

static int lan87xx_config_init(struct phy_device *phydev)
{
+	struct smsc_phy_private *privdata;
	/*
	 * Make sure the EDPWRDOWN bit is NOT set. Setting this bit on
	 * LAN8710/LAN8720 PHY causes the PHY to misbehave, likely due
@@ -79,6 +90,12 @@ static int lan87xx_config_init(struct phy_device *phydev)
	if (rc < 0)
		return rc;

+	privdata = kzalloc(sizeof(*privdata), GFP_KERNEL);
+	if (!privdata)
+		return -ENOMEM;
+
+	phydev->priv = privdata;
+
	return smsc_phy_ack_interrupt(phydev);
}

@@ -87,6 +104,44 @@ static int lan911x_config_init(struct phy_device *phydev)
	return smsc_phy_ack_interrupt(phydev);
}

+int lan8720_read_status(struct phy_device *phydev)
+{
+	int err_count, err_since_last, rc;
+	struct smsc_phy_private *priv = phydev->priv;
+
+	if (priv != NULL) {
+
+		err_count = phy_read(phydev, MII_LAN8710_SYM_ERR_CNT);
+		err_since_last = err_count - priv->sym_err_count;
+
+		if (err_since_last < 0)
+				err_since_last += 65535;
+
+		priv->sym_err_count = err_count;
+		if (err_since_last > LAN8720_MAX_SYM_ERR_CNT) {
+			rc = phy_read(phydev, MII_LAN8710_SCSI);
+
+			if (rc < 0)
+				return rc;
+
+			if (!(rc & MII_LAN8710_SCSI_AMDIXCTRL)) {
+
+				pr_warn("%s: Too may RX errors.",
+					phydev->bus->name);
+				pr_warn("Disabling MDI-X\n");
+
+				rc = phy_write(phydev, MII_LAN8710_SCSI,
+					rc | MII_LAN8710_SCSI_AMDIXCTRL);
+				}
+
+			if (rc < 0)
+				return rc;
+			}
+	}
+
+	return genphy_read_status(phydev);
+}
+
static struct phy_driver smsc_phy_driver[] = {
{
	.phy_id		= 0x0007c0a0, /* OUI=0x00800f, Model#=0x0a */
@@ -187,7 +242,7 @@ static struct phy_driver smsc_phy_driver[] = {

	/* basic functions */
	.config_aneg	= genphy_config_aneg,
-	.read_status	= genphy_read_status,
+	.read_status	= lan8720_read_status,
	.config_init	= lan87xx_config_init,

	/* IRQ related */
diff --git a/include/linux/smscphy.h b/include/linux/smscphy.h
index ce718cb..a0d3893 100644
--- a/include/linux/smscphy.h
+++ b/include/linux/smscphy.h
@@ -22,4 +22,11 @@
#define MII_LAN83C185_EDPWRDOWN (1 << 13) /* EDPWRDOWN */
#define MII_LAN83C185_ENERGYON  (1 << 1)  /* ENERGYON */

+#define MII_LAN8710_SCSI  27 /* Special Control/Status register */
+
+#define MII_LAN8710_SCSI_AMDIXCTRL (1<<15) /* Flag to disable Auto-MDIX */
+
+#define MII_LAN8710_SYM_ERR_CNT	26 /* Amount of invalid code symbols received */
+
+
#endif /* __LINUX_SMSCPHY_H__ */
-- 
1.7.0.4

^ permalink raw reply related

* Re: [PATCH net-next v4] bridge: export multicast database via netlink
From: Cong Wang @ 2012-12-07  9:32 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, bridge, Herbert Xu, Jesper Dangaard Brouer,
	Stephen Hemminger, David S. Miller
In-Reply-To: <20121207084833.GH16122@casper.infradead.org>

On Fri, 2012-12-07 at 08:48 +0000, Thomas Graf wrote:
> On 12/07/12 at 11:23am, Cong Wang wrote:
> > +static int br_mdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
> > +{
> > +	struct net_device *dev;
> > +	struct net *net = sock_net(skb->sk);
> > +	struct nlmsghdr *nlh;
> 
> Set nlh = NULL
> 
> > +	int idx = 0, s_idx;
> > +
> > +	s_idx = cb->args[0];
> > +
> > +	rcu_read_lock();
> > +
> > +	for_each_netdev_rcu(net, dev) {
> > +		if (dev->priv_flags & IFF_EBRIDGE) {
> > +			struct br_port_msg *bpm;
> > +
> > +			if (idx < s_idx)
> > +				goto skip;
> > +
> > +			nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid,
> > +					cb->nlh->nlmsg_seq, RTM_GETMDB,
> > +					sizeof(*bpm), NLM_F_MULTI);
> > +			if (nlh == NULL)
> > +				break;
> > +
> > +			bpm = nlmsg_data(nlh);
> > +			bpm->ifindex = dev->ifindex;
> > +			if (br_mdb_fill_info(skb, cb, dev) < 0)
> > +				goto out;
> > +			if (br_rports_fill_info(skb, cb, dev) < 0)
> > +				goto out;
> 
> You need to reset cb->args[1] to 0 here so that when you process the
> next mdb it will not skip any entries.

Good catch! Will fix it.

> 
> > +
> > +			nlmsg_end(skb, nlh);
> > +		skip:
> > +			idx++;
> > +		}
> > +	}
> > +
> > +out:
> 
> You need to call nlmsg_end(skb, nlh) here if nlh != NULL
> because you need to finalize the message in case you come
> from the "goto out" above. Otherwise your partial message
> is corrupt.

Right... I missed this case.

> 
> > +	cb->seq = cb->args[2];
> 
> This can't possibly work if you have multiple bridges unless
> all of them have an identical mdb->seq.
> 

I thought sequence checking is per-message, so I must be wrong, it seems
to be per-dump, then we will need a global seq for all the bridges.

> Maybe leave the consistent dumping problem out for now and just
> set cb->seq = net->dev_base_seq so that you at least cover all
> bridges.
> 
> We don't need to guarantee that no rehash has happened throughout
> the dump, we only need to ensure that no rehash happnened if a
> bridge required more than one netlink message. You could store
> mdb->seq in cb->args[3] and compare it with the current mdb->seq
> after br_rports_fill_info() finished, if they differ you could
> just cb->seq++. I suggst you leave this out for now and work on this
> in a follow-up patch to not complicate this any further.

There is one message per-bridge, but maybe more messages per-dump.

Anyway, I will leave this out for now and put some comment on the code
saying we need to improve this.

Thanks for your review!

^ permalink raw reply

* Re: [patch] bridge: make buffer larger in br_setlink()
From: Thomas Graf @ 2012-12-07  9:31 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Stephen Hemminger, David S. Miller, bridge, netdev,
	kernel-janitors
In-Reply-To: <20121207061854.GB18220@elgon.mountain>

On 12/07/12 at 09:18am, Dan Carpenter wrote:
> __IFLA_BRPORT_MAX is one larger than IFLA_BRPORT_MAX.  We pass
> IFLA_BRPORT_MAX to nla_parse_nested() so we need IFLA_BRPORT_MAX + 1
> elements.  Also Smatch complains that we read past the end of the array
> when in br_set_port_flag() when it's called with IFLA_BRPORT_FAST_LEAVE.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> ---
> Only needed in linux-next.
> 
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index 850b7d1..cfc5cfe 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -239,7 +239,7 @@ int br_setlink(struct net_device *dev, struct nlmsghdr *nlh)
>  	struct ifinfomsg *ifm;
>  	struct nlattr *protinfo;
>  	struct net_bridge_port *p;
> -	struct nlattr *tb[IFLA_BRPORT_MAX];
> +	struct nlattr *tb[__IFLA_BRPORT_MAX];
>  	int err;
>  
>  	ifm = nlmsg_data(nlh);

I know it's nitpicking but could you use IFLA_BRPORT_MAX+1 for
consistency?

^ permalink raw reply

* Re: [PATCH RFC 0/5] Containerize syslog
From: Andrew Morton @ 2012-12-07  9:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Rui Xiang, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <874nklkjjm.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

On Mon, 19 Nov 2012 01:51:09 -0800 ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) wrote:

> Are there any kernel print statements besides networking stack printks
> that we want to move to show up in a new "kernel log" namespace?

That's a good question, and afaict it remains unanswered.

As so often happens, this patchset's changelogs forgot to describe the
reason for the existence of this patchset.  Via a bit of lwn reading
and my awesome telepathic skills, I divine that something in networking
is using syslog for kernel->userspace communications.

wtf?

Wouldn't it be better to just stop doing that, and to implement a
respectable and reliable kernel->userspace messaging scheme?

And leave syslog alone - it's a crude low-level thing for random
unexpected things which operators might want to know about.

^ permalink raw reply

* Re: [PATCH net-next v4] bridge: export multicast database via netlink
From: Thomas Graf @ 2012-12-07  8:48 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, bridge, Herbert Xu, Stephen Hemminger, David S. Miller,
	Jesper Dangaard Brouer
In-Reply-To: <1354850623-31652-1-git-send-email-amwang@redhat.com>

On 12/07/12 at 11:23am, Cong Wang wrote:
> +static int br_mdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
> +{
> +	struct net_device *dev;
> +	struct net *net = sock_net(skb->sk);
> +	struct nlmsghdr *nlh;

Set nlh = NULL

> +	int idx = 0, s_idx;
> +
> +	s_idx = cb->args[0];
> +
> +	rcu_read_lock();
> +
> +	for_each_netdev_rcu(net, dev) {
> +		if (dev->priv_flags & IFF_EBRIDGE) {
> +			struct br_port_msg *bpm;
> +
> +			if (idx < s_idx)
> +				goto skip;
> +
> +			nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid,
> +					cb->nlh->nlmsg_seq, RTM_GETMDB,
> +					sizeof(*bpm), NLM_F_MULTI);
> +			if (nlh == NULL)
> +				break;
> +
> +			bpm = nlmsg_data(nlh);
> +			bpm->ifindex = dev->ifindex;
> +			if (br_mdb_fill_info(skb, cb, dev) < 0)
> +				goto out;
> +			if (br_rports_fill_info(skb, cb, dev) < 0)
> +				goto out;

You need to reset cb->args[1] to 0 here so that when you process the
next mdb it will not skip any entries.

> +
> +			nlmsg_end(skb, nlh);
> +		skip:
> +			idx++;
> +		}
> +	}
> +
> +out:

You need to call nlmsg_end(skb, nlh) here if nlh != NULL
because you need to finalize the message in case you come
from the "goto out" above. Otherwise your partial message
is corrupt.

> +	cb->seq = cb->args[2];

This can't possibly work if you have multiple bridges unless
all of them have an identical mdb->seq.

Maybe leave the consistent dumping problem out for now and just
set cb->seq = net->dev_base_seq so that you at least cover all
bridges.

We don't need to guarantee that no rehash has happened throughout
the dump, we only need to ensure that no rehash happnened if a
bridge required more than one netlink message. You could store
mdb->seq in cb->args[3] and compare it with the current mdb->seq
after br_rports_fill_info() finished, if they differ you could
just cb->seq++. I suggst you leave this out for now and work on this
in a follow-up patch to not complicate this any further.

^ permalink raw reply

* [PATCH 2/2] ping6: Fix -F switch.
From: Jan Synacek @ 2012-12-07  8:25 UTC (permalink / raw)
  To: yoshfuji; +Cc: netdev, Jan Synacek
In-Reply-To: <1354868724-15549-1-git-send-email-jsynacek@redhat.com>

Even when the flowlabel is set correctly, ping6 exits with a warning. For some
reason, the errno is set when it should not be.

Signed-off-by: Jan Synacek <jsynacek@redhat.com>
---
 ping6.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/ping6.c b/ping6.c
index 358a035..85d3782 100644
--- a/ping6.c
+++ b/ping6.c
@@ -725,7 +725,8 @@ int main(int argc, char *argv[])
 		switch(ch) {
 		case 'F':
 			flowlabel = hextoui(optarg);
-			if (errno || (flowlabel & ~IPV6_FLOWINFO_FLOWLABEL)) {
+			if ((flowlabel < 0 && errno) ||
+			    (flowlabel & ~IPV6_FLOWINFO_FLOWLABEL)) {
 				fprintf(stderr, "ping: Invalid flowinfo %s\n", optarg);
 				exit(2);
 			}
-- 
1.8.0.1

^ permalink raw reply related

* [PATCH 1/2] ninfod: Fix more unused variables.
From: Jan Synacek @ 2012-12-07  8:25 UTC (permalink / raw)
  To: yoshfuji; +Cc: netdev, Jan Synacek
In-Reply-To: <1354868724-15549-1-git-send-email-jsynacek@redhat.com>

| ni_ifaddrs.c: In function ‘ni_ifaddrs’:
| ni_ifaddrs.c:389:14: warning: variable ‘nlm_scope’ set but not used [-Wunused-but-set-variable]
| ni_ifaddrs.c:353:13: warning: variable ‘ifflist’ set but not used [-Wunused-but-set-variable]
| ni_ifaddrs.c:322:6: warning: variable ‘result’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Jan Synacek <jsynacek@redhat.com>
---
 ninfod/ni_ifaddrs.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/ninfod/ni_ifaddrs.c b/ninfod/ni_ifaddrs.c
index 4225a5a..0820497 100644
--- a/ninfod/ni_ifaddrs.c
+++ b/ninfod/ni_ifaddrs.c
@@ -319,7 +319,6 @@ int ni_ifaddrs(struct ni_ifaddrs **ifap, sa_family_t family)
 
 	pid_t pid = getpid();
 	int seq = 0;
-	int result;
 	int build;		/* 0 or 1 */
 
 /* ---------------------------------- */
@@ -350,7 +349,6 @@ int ni_ifaddrs(struct ni_ifaddrs **ifap, sa_family_t family)
 		struct ni_ifaddrs *ifl = NULL, *ifa = NULL;
 		struct nlmsghdr *nlh, *nlh0;
 		void *data = NULL, *xdata = NULL;
-		uint16_t *ifflist = NULL;
 #ifndef IFA_LOCAL
 		struct rtmaddr_ifamap ifamap;
 #endif
@@ -362,18 +360,15 @@ int ni_ifaddrs(struct ni_ifaddrs **ifap, sa_family_t family)
 				*ifap = ifa;
 			else {
 				free_data(data);
-				result = 0;
 				break;
 			}
 			if (data == NULL) {
 				free_data(data);
-				result = -1;
 				break;
 			}
 			ifl = NULL;
 			data += NLMSG_ALIGN(sizeof(struct ni_ifaddrs)) * icnt;
 			xdata = data + dlen;
-			ifflist = xdata + xlen;
 		}
 
 		for (nlm = nlmsg_list; nlm; nlm = nlm->nlm_next) {
@@ -386,7 +381,7 @@ int ni_ifaddrs(struct ni_ifaddrs **ifap, sa_family_t family)
 
 				size_t nlm_struct_size = 0;
 				sa_family_t nlm_family = 0;
-				uint32_t nlm_scope = 0, nlm_index = 0;
+				uint32_t nlm_index = 0;
 				unsigned int nlm_flags;
 				size_t rtasize;
 
@@ -405,7 +400,6 @@ int ni_ifaddrs(struct ni_ifaddrs **ifap, sa_family_t family)
 					ifam = (struct ifaddrmsg *) NLMSG_DATA(nlh);
 					nlm_struct_size = sizeof(*ifam);
 					nlm_family = ifam->ifa_family;
-					nlm_scope = ifam->ifa_scope;
 					nlm_index = ifam->ifa_index;
 					nlm_flags = ifam->ifa_flags;
 					if (family && nlm_family != family)
-- 
1.8.0.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox