Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 1/3] [RFC] Change virtqueue structure
From: Krishna Kumar @ 2011-02-28  6:34 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: eric.dumazet, arnd, netdev, horms, avi, anthony, kvm,
	Krishna Kumar
In-Reply-To: <20110228063427.24908.63561.sendpatchset@krkumar2.in.ibm.com>

Move queue_index from virtio_pci_vq_info to virtqueue.  This
allows callback handlers to figure out the queue number for
the vq that needs attention.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/virtio/virtio_pci.c |   10 +++-------
 include/linux/virtio.h      |    1 +
 2 files changed, 4 insertions(+), 7 deletions(-)

diff -ruNp org/include/linux/virtio.h new/include/linux/virtio.h
--- org/include/linux/virtio.h	2010-10-11 10:20:22.000000000 +0530
+++ new/include/linux/virtio.h	2011-02-23 16:26:18.000000000 +0530
@@ -22,6 +22,7 @@ struct virtqueue {
 	void (*callback)(struct virtqueue *vq);
 	const char *name;
 	struct virtio_device *vdev;
+	int queue_index;	/* the index of the queue */
 	void *priv;
 };
 
diff -ruNp org/drivers/virtio/virtio_pci.c new/drivers/virtio/virtio_pci.c
--- org/drivers/virtio/virtio_pci.c	2011-01-28 11:38:24.000000000 +0530
+++ new/drivers/virtio/virtio_pci.c	2011-02-25 10:11:22.000000000 +0530
@@ -75,9 +75,6 @@ struct virtio_pci_vq_info
 	/* the number of entries in the queue */
 	int num;
 
-	/* the index of the queue */
-	int queue_index;
-
 	/* the virtual address of the ring queue */
 	void *queue;
 
@@ -180,11 +177,10 @@ static void vp_reset(struct virtio_devic
 static void vp_notify(struct virtqueue *vq)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
-	struct virtio_pci_vq_info *info = vq->priv;
 
 	/* we write the queue's selector into the notification register to
 	 * signal the other end */
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
+	iowrite16(vq->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
 }
 
 /* Handle a configuration change: Tell driver if it wants to know. */
@@ -380,7 +376,6 @@ static struct virtqueue *setup_vq(struct
 	if (!info)
 		return ERR_PTR(-ENOMEM);
 
-	info->queue_index = index;
 	info->num = num;
 	info->msix_vector = msix_vec;
 
@@ -403,6 +398,7 @@ static struct virtqueue *setup_vq(struct
 		goto out_activate_queue;
 	}
 
+	vq->queue_index = index;
 	vq->priv = info;
 	info->vq = vq;
 
@@ -441,7 +437,7 @@ static void vp_del_vq(struct virtqueue *
 	list_del(&info->node);
 	spin_unlock_irqrestore(&vp_dev->lock, flags);
 
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
+	iowrite16(vq->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
 
 	if (vp_dev->msix_enabled) {
 		iowrite16(VIRTIO_MSI_NO_VECTOR,

^ permalink raw reply

* [PATCH 0/3] [RFC] Implement multiqueue (RX & TX) virtio-net
From: Krishna Kumar @ 2011-02-28  6:34 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: eric.dumazet, arnd, netdev, horms, avi, anthony, kvm,
	Krishna Kumar

This patch series is a continuation of an earlier one that
implemented guest MQ TX functionality.  This new patchset
implements both RX and TX MQ.  Qemu changes are not being
included at this time solely to aid in easier review.
Compatibility testing with old/new combinations of qemu/guest
and vhost was done without any issues.

Some early TCP/UDP test results are at the bottom of this
post, I plan to submit more test results in the coming days.

Please review and provide feedback on what can improve.

Thanks!

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---


Test configuration:
          Host:  8 Intel Xeon, 8 GB memory
          Guest: 4 cpus, 2 GB memory

Each test case runs for 60 secs, results below are average over
two runs.  Bandwidth numbers are in gbps.  I have used default
netperf, and no testing/system tuning other than taskset each
vhost to 0xf (cpus 0-3).  Comparison is testing original kernel
vs new kernel with #txqs=8 ("#" refers to number of netperf
sessions).

_______________________________________________________________________
                TCP: Guest -> Local Host (TCP_STREAM)
#    BW1    BW2 (%)         SD1    SD2 (%)       RSD1    RSD2 (%)
_______________________________________________________________________
1    7190   7170 (-.2)      0      0 (0)          3       4 (33.3)
2    8774   11235 (28.0)    3      3 (0)          16      14 (-12.5)
4    9753   15195 (55.7)    17     21 (23.5)      65      59 (-9.2)
8    10224  18265 (78.6)    71     115 (61.9)     251     240 (-4.3)
16   10749  18123 (68.6)    277    456 (64.6)     985     925 (-6.0)
32   11133  17351 (55.8)    1132   1947 (71.9)    3935    3831 (-2.6)
64   11223  17115 (52.4)    4682   7836 (67.3)    15949   15373 (-3.6)
128  11269  16986 (50.7)    19783  31505 (59.2)   66799   61759 (-7.5)
_______________________________________________________________________
Summary:      BW: 37.6      SD: 61.2      RSD: -6.5


_______________________________________________________________________
                 TCP: Local Host -> Guest (TCP_MAERTS)
#    BW1    BW2 (%)        SD1    SD2 (%)        RSD1    RSD2 (%)
_______________________________________________________________________
1    11490  10870 (-5.3)   0      0 (0)          2       2 (0)
2    10612  10554 (-.5)    2      3 (50.0)       12      12 (0)
4    10047  14320 (42.5)   13     16 (23.0)      53      53 (0)
8    9273   15182 (63.7)   56     84 (50.0)      228     233 (2.1)
16   9291   15853 (70.6)   235    390 (65.9)     934     965 (3.3)
32   9382   15741 (67.7)   969    1823 (88.1)    3868    4037 (4.3)
64   9270   14585 (57.3)   3966   8836 (122.7)   15415   17818 (15.5)
128  8997   14678 (63.1)   17024  36649 (115.2)  64933   72677 (11.9)
_______________________________________________________________________
SUM:      BW: 24.8      SD: 114.6      RSD: 12.1

______________________________________________________
            UDP: Local Host -> Guest (UDP_STREAM)
#      BW1      BW2 (%)        SD1    SD2 (%)
______________________________________________________
1      17236    16585 (-3.7)    1      1 (0)
2      16795    22693 (35.1)    5      6 (20.0)
4      13390    21502 (60.5)    37     36 (-2.7)
8      13261    24361 (83.7)    163    175 (7.3)
16     12772    23796 (86.3)    692    826 (19.3)
32     12832    23880 (86.0)    2812   2871 (2.0)
64     12779    24293 (90.1)    11299  11237 (-.5)
128    13006    24857 (91.1)    44778  43884 (-1.9)
______________________________________________________
Summary:      BW: 37.1      SD: -1.2

^ permalink raw reply

* Re: KIND
From: Mr.David Gurupatham @ 2011-02-28  1:58 UTC (permalink / raw)
  To: netdev@vger.kernel.org

I am Mr.David Gurupatham, an attorney at law. A deceased client of mine,

that shares the same last name as yours,died as the result of a heart-

related condition on March 12th 2005.His heart condition was due to the

death of allthe members of his family in the tsunami disaster on the 26th

December2004 in Sumatra

Indonesia .http://en.wikipedia.org/wiki/2004_Indian_Ocean_earthquake



I can be reached on (davidgurupatham9@gmail.com) for

Dollars (US$19 Million Dollars) left behind.

more information.My late Client has a deposit of Nineteen Million



Best regards,

Mr.David Gurupatham(Esq.)

Attorney at Law


^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Herbert Xu @ 2011-02-28  4:26 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, rick.jones2, wsommerfeld, daniel.baluta, netdev
In-Reply-To: <AANLkTinxv4JC20YepcMvWi8SL_VTX2sXcRMydD62TT6x@mail.gmail.com>

On Sun, Feb 27, 2011 at 07:45:55PM -0800, Tom Herbert wrote:
> That sounds promising, but receive side will still have problems.
> There is lock contention on the queue as well as cache line bouncing
> on the sock structures.  Also multiple threads sleeping on same socket
> typically leads to asymmetric load across the threads (and
> degenerative cases where receiving thread is woken up and other
> threads have already processed all the packets).  TCP listener threads
> suffer from these same problems.

IOW this is something that we have to solve anyway.  I'm just
being overly cautious here because user-space API changes are
something that we should not enter into lightly.

If this patch was completely internal to the kernel, then I would
have much less of an objection as we can always revert/change it
later on.  With a user-space API we don't have that flexibility.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [V4 PATCH 3/3] bond: service netpoll arp queue on master device
From: David Miller @ 2011-02-28  4:12 UTC (permalink / raw)
  To: amwang; +Cc: linux-kernel, nhorman, herbert, nhorman, eric.dumazet, netdev
In-Reply-To: <4D6B1742.6030108@redhat.com>

From: Cong Wang <amwang@redhat.com>
Date: Mon, 28 Feb 2011 11:32:18 +0800

> 于 2011年02月28日 08:12, David Miller 写道:
>> From: Amerigo Wang<amwang@redhat.com>
>> Date: Fri, 18 Feb 2011 17:43:34 +0800
>>
>>> Neil pointed out that we can't send ARP reply on behalf of slaves,
>>> we need to move the arp queue to their bond device.
>>>
>>> Signed-off-by: WANG Cong<amwang@redhat.com>
>>
>> Applied.
> 
> Oops! Just found that this one I sent was not a refreshed patch.
> Please discard this one, and use the one below, that is the
> correct one in my git tree and the one that I tested.

Done.

^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Tom Herbert @ 2011-02-28  3:45 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, rick.jones2, wsommerfeld, daniel.baluta, netdev
In-Reply-To: <20110227110614.GA6246@gondor.apana.org.au>

> I disagree completely.
>
> This patch adds a user-space API that we will have to carry
> with us for perpetuity.  I would only support this if we had
> no other way around the problem.
>
> If this does turn out to be mostly due to sendmsg contention
> then fixing it is going to be much simpler than making the UDP
> stack multiqueue capable.
>

That sounds promising, but receive side will still have problems.
There is lock contention on the queue as well as cache line bouncing
on the sock structures.  Also multiple threads sleeping on same socket
typically leads to asymmetric load across the threads (and
degenerative cases where receiving thread is woken up and other
threads have already processed all the packets).  TCP listener threads
suffer from these same problems.

Tom

> I'm working on this right now.
>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>

^ permalink raw reply

* Re: [V4 PATCH 3/3] bond: service netpoll arp queue on master device
From: Cong Wang @ 2011-02-28  3:32 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, nhorman, herbert, nhorman, eric.dumazet, netdev
In-Reply-To: <20110227.161225.200373200.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 569 bytes --]

于 2011年02月28日 08:12, David Miller 写道:
> From: Amerigo Wang<amwang@redhat.com>
> Date: Fri, 18 Feb 2011 17:43:34 +0800
>
>> Neil pointed out that we can't send ARP reply on behalf of slaves,
>> we need to move the arp queue to their bond device.
>>
>> Signed-off-by: WANG Cong<amwang@redhat.com>
>
> Applied.

Oops! Just found that this one I sent was not a refreshed patch.
Please discard this one, and use the one below, that is the
correct one in my git tree and the one that I tested.

Sorry for this.

----

Signed-off-by: WANG Cong <amwang@redhat.com>

[-- Attachment #2: bond-move-arp-queue-to-master.diff --]
[-- Type: text/plain, Size: 567 bytes --]

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index f68e694..06be243 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -193,6 +193,17 @@ void netpoll_poll_dev(struct net_device *dev)
 
 	poll_napi(dev);
 
+	if (dev->priv_flags & IFF_SLAVE) {
+		if (dev->npinfo) {
+			struct net_device *bond_dev = dev->master;
+			struct sk_buff *skb;
+			while ((skb = skb_dequeue(&dev->npinfo->arp_tx))) {
+				skb->dev = bond_dev;
+				skb_queue_tail(&bond_dev->npinfo->arp_tx, skb);
+			}
+		}
+	}
+
 	service_arp_queue(dev->npinfo);
 
 	zap_completion_queue();

^ permalink raw reply related

* Re: net-next: warnings from sysctl_net_exit
From: Lucian Adrian Grijincu @ 2011-02-28  1:11 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David S. Miller, netdev
In-Reply-To: <20110227160852.4f748c1f@nehalam>

On Mon, Feb 28, 2011 at 2:08 AM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
>> The check is triggered at network namespace deletion, so a moment
>> before deleting the netns should be fine.
>
> Although the kernel was compiled with netns, I never use net namespaces.

This can be triggered at shutdown when the init_net network namespace
is dismantled.

Are you getting this at shutdown time? If not, is it easily reproducible?
Can you post a .config?

-- 
 .
..: Lucian

^ permalink raw reply

* Re: net-next: warnings from sysctl_net_exit
From: Lucian Adrian Grijincu @ 2011-02-28  0:49 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev, ebiederm
In-Reply-To: <20110227.153418.245395393.davem@davemloft.net>

On Mon, Feb 28, 2011 at 1:34 AM, David Miller <davem@davemloft.net> wrote:
> From: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
>> David: it looks like someone registered a /proc/sys table with
>> register_net_sysctl_table but forgot to remove it (or someone wrote
>> something in the 'struct net*' and buffer overflowed into
>> &net->sysctls.list).
>
> Hmmm, we might therefore want to inspect this commit carefully:
>
> commit bf36076a67db6d7423d09d861a072337866f0dd9
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Mon Jan 31 20:54:17 2011 -0800
>
>    net: Fix ipv6 neighbour unregister_sysctl_table warning


I'm not sure why we need that empty entry in the sysctl table anyway.

I just ran a net-next with the next patch and
register_net_sysctl_table managed to add /proc/sys/net/ipv6/neigh/
because we add 'default' and 'all' entries, followed by 'lo' for every
netns.


diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 7cb65ef..8e8b107 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -37,12 +37,6 @@ static ctl_table ipv6_table_template[] = {
                .mode           = 0644,
                .proc_handler   = proc_dointvec
        },
-       {
-               .procname       = "neigh",
-               .maxlen         = 0,
-               .mode           = 0555,
-               .child          = empty,
-       },
        { }
 };



What are the benefits of this empty entry?

Is there an assumption that we add this empty
'/proc/sys/net/ipv6/neigh/' entry *before* adding entries for
"/proc/sys/net/ipv6/neigh/default/" or
"/proc/sys/net/ipv6/neigh/all/"?

Because we register the empty '/proc/sys/net/ipv6/neigh/' entry
*after* registering these two of it's children, and I'm not sure how
this will affect the attachment of ctl_table_header corresponding to
these ctl_table trees.


With the following patch I get this ordering (for both init_net and a new net):
  CALL addrconf_init_net (create default/all)
  CALL ipv6_sysctl_net_init (create empty entry).


diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3daaf3c..5fe402e 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4564,6 +4564,8 @@ static int __net_init addrconf_init_net(struct net *net)
 	int err;
 	struct ipv6_devconf *all, *dflt;

+	printk(KERN_ALERT "CALL addrconf_init_net\n");
+
 	err = -ENOMEM;
 	all = &ipv6_devconf;
 	dflt = &ipv6_devconf_dflt;
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 7cb65ef..880d64a 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -71,6 +65,8 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
 	struct ctl_table *ipv6_icmp_table;
 	int err;

+	printk(KERN_ALERT "CALL ipv6_sysctl_net_init\n");
+
 	err = -ENOMEM;
 	ipv6_table = kmemdup(ipv6_table_template, sizeof(ipv6_table_template),
 			     GFP_KERNEL);


-- 
 .
..: Lucian

^ permalink raw reply related

* Re: [V4 PATCH 3/3] bond: service netpoll arp queue on master device
From: David Miller @ 2011-02-28  0:12 UTC (permalink / raw)
  To: amwang; +Cc: linux-kernel, nhorman, herbert, nhorman, eric.dumazet, netdev
In-Reply-To: <1298022215-21059-3-git-send-email-amwang@redhat.com>

From: Amerigo Wang <amwang@redhat.com>
Date: Fri, 18 Feb 2011 17:43:34 +0800

> Neil pointed out that we can't send ARP reply on behalf of slaves,
> we need to move the arp queue to their bond device.
> 
> Signed-off-by: WANG Cong <amwang@redhat.com>

Applied.

^ permalink raw reply

* Re: [V4 PATCH 2/3] netpoll: remove IFF_IN_NETPOLL flag
From: David Miller @ 2011-02-28  0:12 UTC (permalink / raw)
  To: amwang
  Cc: linux-kernel, nhorman, herbert, fubar, horms, jpirko, kaber,
	nhorman, eric.dumazet, netdev
In-Reply-To: <1298022215-21059-2-git-send-email-amwang@redhat.com>

From: Amerigo Wang <amwang@redhat.com>
Date: Fri, 18 Feb 2011 17:43:33 +0800

> V4: rebase to net-next-2.6
> 
> This patch removes the flag IFF_IN_NETPOLL, we don't need it any more since
> we have netpoll_tx_running() now.
> 
> Signed-off-by: WANG Cong <amwang@redhat.com>

Applied.

^ permalink raw reply

* Re: [V4 PATCH 1/3] bonding: sync netpoll code with bridge
From: David Miller @ 2011-02-28  0:12 UTC (permalink / raw)
  To: amwang; +Cc: linux-kernel, nhorman, herbert, fubar, netdev
In-Reply-To: <1298022215-21059-1-git-send-email-amwang@redhat.com>

From: Amerigo Wang <amwang@redhat.com>
Date: Fri, 18 Feb 2011 17:43:32 +0800

> V4: rebase to net-next-2.6
> V3: remove an useless #ifdef.
> 
> This patch unifies the netpoll code in bonding with netpoll code in bridge,
> thanks to Herbert that code is much cleaner now.
> 
> Signed-off-by: WANG Cong <amwang@redhat.com>

Applied.

^ permalink raw reply

* Re: net-next: warnings from sysctl_net_exit
From: Stephen Hemminger @ 2011-02-28  0:08 UTC (permalink / raw)
  To: Lucian Adrian Grijincu; +Cc: David S. Miller, netdev
In-Reply-To: <AANLkTi=mhj3Ftq8hFPPzZYJvp8mgOeV89oX_vHZVxwzY@mail.gmail.com>

On Mon, 28 Feb 2011 00:37:09 +0200
Lucian Adrian Grijincu <lucian.grijincu@gmail.com> wrote:

> Stephen Hemminger <shemminger <at> vyatta.com> writes:
> > [26207.669740]  [<ffffffff814154ad>] ? sysctl_net_exit+0x2a/0x2c
> > [26207.669742]  [<ffffffff8136144e>] ? ops_exit_list+0x2a/0x5b
> > [26207.669745]  [<ffffffff813618f0>] ? cleanup_net+0xfa/0x19a
> 
> 
> David: it looks like someone registered a /proc/sys table with
> register_net_sysctl_table but forgot to remove it (or someone wrote
> something in the 'struct net*' and buffer overflowed into
> &net->sysctls.list).
> 
> Stephen, can you post a `ls -R /proc/sys/net/` from before the dmesg
> message appeared?
> 
> The check is triggered at network namespace deletion, so a moment
> before deleting the netns should be fine.

Although the kernel was compiled with netns, I never use net namespaces.

^ permalink raw reply

* Re: [PATCH] bonding: added 802.3ad round-robin hashing policy and source mac selection mode
From: David Miller @ 2011-02-28  0:08 UTC (permalink / raw)
  To: olegu; +Cc: netdev, fubar
In-Reply-To: <20110216191341.GA14920@yandex-team.ru>

From: "Oleg V. Ukhno" <olegu@yandex-team.ru>
Date: Wed, 16 Feb 2011 22:13:41 +0300

> Patch introduces two new (related) features to bonding module.
> First feature is round-robin hashing policy, which is primarily
> intended for use with 802.3ad mode, and puts every next IPv4 and
> IPv6 packet into  next availables slave without taling into account
> which layer3 and above protocol is used.
> Second feature makes possible choosing which MAC-address will be set
> in the transmitted packet - when set to src-mac it will force setting
> slave's interface real MAC address as source MAC address in every
> packet, sent via this slave interface.

Can we get some feedback on this patch from bonding folks?

I'm not applying it blinding without at least one bonding developer
saying it at least looks ok.

Thanks.

^ permalink raw reply

* Re: [PATCH net-2.6 2/2] be2net: remove netif_stop_queue being called before register_netdev.
From: David Miller @ 2011-02-28  0:07 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ajit.khaparde, netdev
In-Reply-To: <1298560543.2814.4.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 24 Feb 2011 16:15:43 +0100

> Le mardi 01 février 2011 à 15:42 -0800, David Miller a écrit :
>> From: Ajit Khaparde <ajit.khaparde@emulex.com>
>> Date: Mon, 31 Jan 2011 17:27:55 -0600
>> 
>> > It is illegal to call netif_stop_queue before register_netdev.
>> > 
>> > Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
>> 
>> Applied, thanks.
> 
> Not sure if this patch is queued for stable, I hit the bug (a Warning
> actually) on 2.6.37.1

I've added this to my -stable queue, thanks.

^ permalink raw reply

* Re: [PATCH net-2.6] bnx2x: Add a missing bit for PXP parity register of 57712.
From: David Miller @ 2011-02-28  0:06 UTC (permalink / raw)
  To: vladz; +Cc: netdev, eilong
In-Reply-To: <201102201627.05803.vladz@broadcom.com>

From: "Vlad Zolotarov" <vladz@broadcom.com>
Date: Sun, 20 Feb 2011 16:27:05 +0200

> Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
> Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] don't allow CAP_NET_ADMIN to load non-netdev kernel modules
From: Arnd Bergmann @ 2011-02-27 20:22 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Ben Hutchings, David Miller, segoon, netdev, linux-kernel, kuznet,
	pekkas, jmorris, yoshfuji, kaber, eric.dumazet, therbert, xiaosuo,
	jesse, kees.cook, eugene, dan.j.rosenberg, akpm
In-Reply-To: <AANLkTikmtKksP6o1hecf0=ZnRv0j4+_Q326UEG8ss16k@mail.gmail.com>

On Friday 25 February 2011, Michał Mirosław wrote:
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 54aaca6..0d09baa 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -1120,8 +1120,20 @@ void dev_load(struct net *net, const char *name)
> >        dev = dev_get_by_name_rcu(net, name);
> >        rcu_read_unlock();
> >
> > -       if (!dev && capable(CAP_NET_ADMIN))
> > -               request_module("%s", name);
> > +       if (!dev && capable(CAP_NET_ADMIN)) {
> > +               /* Check whether the name looks like one that a net
> > +                * driver will generate initially.  If not, require a
> > +                * module alias with a suitable prefix, so that this
> > +                * can't be used to load arbitrary modules.
> > +                */
> > +               if ((strncmp(name, "eth", 3) == 0 &&
> > +                    isdigit((unsigned char)name[3])) ||
> > +                   (strncmp(name, "wlan", 4) == 0 &&
> > +                    isdigit((unsigned char)name[4])))
> > +                       request_module("%s", name);
> > +               else
> > +                       request_module("netdev-%s", name);
> > +       }
> >  }
> >  EXPORT_SYMBOL(dev_load);
> >
> 
> This might be better as:
> 
> if (request_module("netdev-%s", name))
>     ... fallback
> 
> Then after some years the fallback could be removed if announced properly.

The backwards compatibility should mostly be for systems that today don't
use split capabilities, right?

The fallback could therefore rely on CAP_SYS_MODULE as well:

	if (request_module("netdev-%s", name)) {
		if (capable(CAP_SYS_MODULE))
			request_module("%s", name);
	}

Not 100% solution, but should solve the capability escalation nicely without
causing much pain.

	Arnd


^ permalink raw reply

* Re: net-next: warnings from sysctl_net_exit
From: David Miller @ 2011-02-27 23:34 UTC (permalink / raw)
  To: lucian.grijincu; +Cc: shemminger, netdev, ebiederm
In-Reply-To: <AANLkTi=mhj3Ftq8hFPPzZYJvp8mgOeV89oX_vHZVxwzY@mail.gmail.com>

From: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Date: Mon, 28 Feb 2011 00:37:09 +0200

> Stephen Hemminger <shemminger <at> vyatta.com> writes:
>> [26207.669740]  [<ffffffff814154ad>] ? sysctl_net_exit+0x2a/0x2c
>> [26207.669742]  [<ffffffff8136144e>] ? ops_exit_list+0x2a/0x5b
>> [26207.669745]  [<ffffffff813618f0>] ? cleanup_net+0xfa/0x19a
> 
> 
> David: it looks like someone registered a /proc/sys table with
> register_net_sysctl_table but forgot to remove it (or someone wrote
> something in the 'struct net*' and buffer overflowed into
> &net->sysctls.list).

Hmmm, we might therefore want to inspect this commit carefully:

commit bf36076a67db6d7423d09d861a072337866f0dd9
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Jan 31 20:54:17 2011 -0800

    net: Fix ipv6 neighbour unregister_sysctl_table warning
    
    In my testing of 2.6.37 I was occassionally getting a warning about
    sysctl table entries being unregistered in the wrong order.  Digging
    in it turns out this dates back to the last great sysctl reorg done
    where Al Viro introduced the requirement that sysctl directories
    needed to be created before and destroyed after the files in them.
    
    It turns out that in that great reorg /proc/sys/net/ipv6/neigh was
    overlooked.  So this patch fixes that oversight and makes an annoying
    warning message go away.
    
    >------------[ cut here ]------------
    >WARNING: at kernel/sysctl.c:1992 unregister_sysctl_table+0x134/0x164()
    >Pid: 23951, comm: kworker/u:3 Not tainted 2.6.37-350888.2010AroraKernelBeta.fc14.x86_64 #1
    >Call Trace:
    > [<ffffffff8103e034>] warn_slowpath_common+0x80/0x98
    > [<ffffffff8103e061>] warn_slowpath_null+0x15/0x17
    > [<ffffffff810452f8>] unregister_sysctl_table+0x134/0x164
    > [<ffffffff810e7834>] ? kfree+0xc4/0xd1
    > [<ffffffff813439b2>] neigh_sysctl_unregister+0x22/0x3a
    > [<ffffffffa02cd14e>] addrconf_ifdown+0x33f/0x37b [ipv6]
    > [<ffffffff81331ec2>] ? skb_dequeue+0x5f/0x6b
    > [<ffffffffa02ce4a5>] addrconf_notify+0x69b/0x75c [ipv6]
    > [<ffffffffa02eb953>] ? ip6mr_device_event+0x98/0xa9 [ipv6]
    > [<ffffffff813d2413>] notifier_call_chain+0x32/0x5e
    > [<ffffffff8105bdea>] raw_notifier_call_chain+0xf/0x11
    > [<ffffffff8133cdac>] call_netdevice_notifiers+0x45/0x4a
    > [<ffffffff8133d2b0>] rollback_registered_many+0x118/0x201
    > [<ffffffff8133d3af>] unregister_netdevice_many+0x16/0x6d
    > [<ffffffff8133d571>] default_device_exit_batch+0xa4/0xb8
    > [<ffffffff81337c42>] ? cleanup_net+0x0/0x194
    > [<ffffffff81337a2a>] ops_exit_list+0x4e/0x56
    > [<ffffffff81337d36>] cleanup_net+0xf4/0x194
    > [<ffffffff81053318>] process_one_work+0x187/0x280
    > [<ffffffff8105441b>] worker_thread+0xff/0x19f
    > [<ffffffff8105431c>] ? worker_thread+0x0/0x19f
    > [<ffffffff8105776d>] kthread+0x7d/0x85
    > [<ffffffff81003824>] kernel_thread_helper+0x4/0x10
    > [<ffffffff810576f0>] ? kthread+0x0/0x85
    > [<ffffffff81003820>] ? kernel_thread_helper+0x0/0x10
    >---[ end trace 8a7e9310b35e9486 ]---
    
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>


^ permalink raw reply

* Re: txqueuelen has wrong units; should be time
From: Albert Cahalan @ 2011-02-27 23:33 UTC (permalink / raw)
  To: Jussi Kivilinna; +Cc: Eric Dumazet, Mikael Abrahamsson, linux-kernel, netdev
In-Reply-To: <20110227125540.40754c5y78j9u2m8@hayate.sektori.org>

On Sun, Feb 27, 2011 at 5:55 AM, Jussi Kivilinna
<jussi.kivilinna@mbnet.fi> wrote:

> I made simple hack on sch_fifo with per packet time limits (attachment) this
> weekend and have been doing limited testing on wireless link. I think
> hardlimit is fine, it's simple and does somewhat same as what
> packet(-hard)limited buffer does, drops packets when buffer is 'full'. My
> hack checks for timed out packets on enqueue, might be wrong approach (on
> other hand might allow some more burstiness).

Thanks!

I think the default is too high. 1 ms may even be a bit high.

I suppose there is a need to allow at least 2 packets despite any
time limits, so that it remains possible to use a traditional modem
even if a huge packet takes several seconds to send.

^ permalink raw reply

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: David Miller @ 2011-02-27 23:22 UTC (permalink / raw)
  To: jpirko
  Cc: fubar, nicolas.2p.debian, kaber, eric.dumazet, netdev, shemminger,
	andy, anna.fischer
In-Reply-To: <20110227125816.GB2814@psychotron.redhat.com>

From: Jiri Pirko <jpirko@redhat.com>
Date: Sun, 27 Feb 2011 13:58:17 +0100

> That is very true. And given that af_packet uses orig_dev to obtain
> ifindex, it can be replaced by skb->skb_iif. That way we can get rid of
> orig_dev parameter for good.

I would rather see a complete patch set submitting at a unit, thanks.

I've already marked your V3 last night as "changes requested" in
patchwork for this reason.

^ permalink raw reply

* Re: [PATCH] don't allow CAP_NET_ADMIN to load non-netdev kernel modules
From: David Miller @ 2011-02-27 23:19 UTC (permalink / raw)
  To: segoon
  Cc: bhutchings, netdev, linux-kernel, kuznet, pekkas, jmorris,
	yoshfuji, kaber, eric.dumazet, therbert, xiaosuo, jesse,
	kees.cook, eugene, dan.j.rosenberg, akpm
In-Reply-To: <20110227114438.GA4317@albatros>

From: Vasiliy Kulikov <segoon@openwall.com>
Date: Sun, 27 Feb 2011 14:44:38 +0300

> Then the things are still broken - a user has to update modprobe
> together with the kernel, otherwise the updated kernel would call
> "modprobe" with unsupported argument and even "sit0" wouldn't work.

The capability bits get passed on the modprobe command line.

The module loading system call in the kernel inspects the command
line looking for the argument, and uses it to validate the module
load by comparing the mask with the special ELF section.

^ permalink raw reply

* Re: [PATCH] don't allow CAP_NET_ADMIN to load non-netdev kernel modules
From: David Miller @ 2011-02-27 23:18 UTC (permalink / raw)
  To: segoon
  Cc: bhutchings, netdev, linux-kernel, kuznet, pekkas, jmorris,
	yoshfuji, kaber, eric.dumazet, therbert, xiaosuo, jesse,
	kees.cook, eugene, dan.j.rosenberg, akpm
In-Reply-To: <20110227114438.GA4317@albatros>

From: Vasiliy Kulikov <segoon@openwall.com>
Date: Sun, 27 Feb 2011 14:44:38 +0300

>    d) run modprobe with CAP_NET_ADMIN only

This is not part of my scheme.

The module loading will run with existing module loading privileges,
the "allowed capability" bits will be passed along back into the
kernel at module load time (via modprobe arguments or similar)
and validated by the kernel as it walks the ELF sections anyways
to perform relocations and whatnot.

^ permalink raw reply

* Re: net-next: warnings from sysctl_net_exit
From: Lucian Adrian Grijincu @ 2011-02-27 22:37 UTC (permalink / raw)
  To: Stephen Hemminger, David S. Miller, netdev

Stephen Hemminger <shemminger <at> vyatta.com> writes:
> [26207.669740]  [<ffffffff814154ad>] ? sysctl_net_exit+0x2a/0x2c
> [26207.669742]  [<ffffffff8136144e>] ? ops_exit_list+0x2a/0x5b
> [26207.669745]  [<ffffffff813618f0>] ? cleanup_net+0xfa/0x19a

David: it looks like someone registered a /proc/sys table with
register_net_sysctl_table but forgot to remove it (or someone wrote
something in the 'struct net*' and buffer overflowed into
&net->sysctls.list).

Stephen, can you post a `ls -R /proc/sys/net/` from before the dmesg
message appeared?

The check is triggered at network namespace deletion, so a moment
before deleting the netns should be fine.

-- 
 .
..: Lucian

^ permalink raw reply

* Re: txqueuelen has wrong units; should be time
From: Jussi Kivilinna @ 2011-02-27 21:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Albert Cahalan, Mikael Abrahamsson, linux-kernel, netdev
In-Reply-To: <1298837273.8726.128.camel@edumazet-laptop>

Quoting Eric Dumazet <eric.dumazet@gmail.com>:

> Le dimanche 27 février 2011 à 12:55 +0200, Jussi Kivilinna a écrit :
>> Quoting Albert Cahalan <acahalan@gmail.com>:
>>
>> > On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet  
>> <eric.dumazet@gmail.com> wrote:
>> >> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>> >>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>> >>>
>> >>> > Nanoseconds seems fine; it's unlikely you'd ever want
>> >>> > more than 4.2 seconds (32-bit unsigned) of queue.
>> > ...
>> >> Problem is some machines have slow High Resolution timing services.
>> >>
>> >> _If_ we have a time limit, it will probably use the low resolution (aka
>> >> jiffies), unless high resolution services are cheap.
>> >
>> > As long as that is totally internal to the kernel and never
>> > getting exposed by some API for setting the amount, sure.
>> >
>> >> I was thinking not having an absolute hard limit, but an EWMA based one.
>> >
>> > The whole point is to prevent stale packets, especially to prevent
>> > them from messing with TCP, so I really don't think so. I suppose
>> > you do get this to some extent via early drop.
>>
>> I made simple hack on sch_fifo with per packet time limits
>> (attachment) this weekend and have been doing limited testing on
>> wireless link. I think hardlimit is fine, it's simple and does
>> somewhat same as what packet(-hard)limited buffer does, drops packets
>> when buffer is 'full'. My hack checks for timed out packets on
>> enqueue, might be wrong approach (on other hand might allow some more
>> burstiness).
>>
>
>
> Qdisc should return to caller a good indication packet is queued or
> dropped at enqueue() time... not later (aka : never)

Ok, it is ugly hack ;) I got idea of dropping head from pfifo_head_drop.

>
> Accepting a packet at t0, and dropping it later at t0+limit without
> giving any indication to caller is a problem.

Ok.

>
> This is why I suggested using an EWMA plus a probabilist drop or
> congestion indication (NET_XMIT_CN) to caller at enqueue() time.
>
> The absolute time limit you are trying to implement should be checked at
> dequeue time, to cope with enqueue bursts or pauses on wire.
>

Ok.



^ permalink raw reply

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-27 20:59 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Miller, kaber, eric.dumazet, netdev, shemminger, fubar,
	andy
In-Reply-To: <20110227200628.GA2984@psychotron.redhat.com>

Le 27/02/2011 21:06, Jiri Pirko a écrit :
> Sun, Feb 27, 2011 at 03:17:01PM CET, nicolas.2p.debian@gmail.com wrote:

>>> +	if (bond_should_deliver_exact_match(skb, slave_dev, bond_dev)) {
>>> +		skb->deliver_no_wcard = 1;
>>> +		return skb;
>>
>> Shouldn't we return NULL here ?
>
> No we shouldn't. We need sbk to be delivered to exact match.

So, if I understand properly:

- If skb->dev changed, loop,
- else, if skb->deliver_no_wcard, do exact match delivery only,
- Else, if !skb, drop the frame, without ever exact match delivery,
- Else, do normal delivery.

Right?

>> The vlan_on_bond case used to be cost effective. Now, we clone the skb and call netif_rx...
>
> This should not cost too much overhead considering only few packets are
> going thru this. This hook shouldn't have exited in the fisrt place. I
> think introducing this functionality was a big mistake.

What would you have proposed instead?

Anyway, I think the feature is broken, because it wouldn't provide the expected effect on the 
following configuration:

eth0/eth1 -> bond0 -> br0 -> br0.100.

We probably need a more general way to fix this, after your patch have been accepted.

[snip]

>> I would instead consider NULL as meaning exact-match-delivery-only.
>> (The same effect as dev_bond_should_drop() returning true).
>
> we can change the behaviour later on.

Agreed.

	Nicolas.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox