Netdev List
 help / color / mirror / Atom feed
* [PATCH net] tcp: flush DMA queue before sk_wait_data if rcv_wnd is zero
From: Michal Kubecek @ 2012-09-14 14:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, mkubecek

If recv() syscall is called for a TCP socket so that
  - IOAT DMA is used
  - MSG_WAITALL flag is used
  - requested length is bigger than sk_rcvbuf
  - enough data has already arrived to bring rcv_wnd to zero
then when tcp_recvmsg() gets to calling sk_wait_data(), receive
window can be still zero while sk_async_wait_queue exhausts
enough space to keep it zero. As this queue isn't cleaned until
the tcp_service_net_dma() call, sk_wait_data() cannot receive
any data and blocks forever.

If zero receive window and non-empty sk_async_wait_queue is
detected before calling sk_wait_data(), process the queue first.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Cc: <stable@vger.kernel.org>
---
 net/ipv4/tcp.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2109ff4..bf9a8ab 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1762,8 +1762,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		}
 
 #ifdef CONFIG_NET_DMA
-		if (tp->ucopy.dma_chan)
-			dma_async_memcpy_issue_pending(tp->ucopy.dma_chan);
+		if (tp->ucopy.dma_chan) {
+			if (tp->rcv_wnd == 0 &&
+			    !skb_queue_empty(&sk->sk_async_wait_queue)) {
+				tcp_service_net_dma(sk, true);
+				tcp_cleanup_rbuf(sk, copied);
+			} else
+				dma_async_memcpy_issue_pending(tp->ucopy.dma_chan);
+		}
 #endif
 		if (copied >= target) {
 			/* Do not sleep, just process backlog. */
-- 
1.7.10.4

^ permalink raw reply related

* RE: e1000e: Disabling IRQ #57 .. machine dies
From: Wyborny, Carolyn @ 2012-09-14 16:40 UTC (permalink / raw)
  To: Cristian Rodríguez, Netdev
In-Reply-To: <504F87F1.5040004@opensuse.org>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Cristian Rodríguez
> Sent: Tuesday, September 11, 2012 11:50 AM
> To: Netdev
> Subject: e1000e: Disabling IRQ #57 .. machine dies
> 
> Hi:
> 
> In current linus tree, in a machine using the e1000e driver
> 
> hwinfo --netcard
> 22: PCI 400.0: 0200 Ethernet controller
>   [Created at pci.319]
>   Unique ID: rBUF.cuT2KaC3ZS1
>   Parent ID: HnsE.Z52SDLRIaH2
>   SysFS ID: /devices/pci0000:00/0000:00:1c.5/0000:04:00.0
>   SysFS BusID: 0000:04:00.0
>   Hardware Class: network
>   Model: "Intel 82574L Gigabit Network Connection"
>   Vendor: pci 0x8086 "Intel Corporation"
>   Device: pci 0x10d3 "82574L Gigabit Network Connection"
>   SubVendor: pci 0x1043 "ASUSTeK Computer Inc."
>   SubDevice: pci 0x8369
>   Driver: "e1000e"
>   Driver Modules: "e1000e"
>   Device File: eth0
>   Memory Range: 0xf7d00000-0xf7d1ffff (rw,non-prefetchable)
>   I/O Ports: 0xd000-0xdfff (rw)
>   Memory Range: 0xf7d20000-0xf7d23fff (rw,non-prefetchable)
>   IRQ: 17 (no events)
>   Link detected: yes
>   Module Alias: "pci:v00008086d000010D3sv00001043sd00008369bc02sc00i00"
>   Driver Info #0:
>     Driver Status: e1000e is active
>     Driver Activation Cmd: "modprobe e1000e"
>   Config Status: cfg=no, avail=yes, need=no, active=unknown
>   Attached to: #15 (PCI bridge)
> 
> 
> 
> cat /proc/interrupts
> 
> 57 IR-PCI-MSI-edge      eth0
> 
> the driver works after a while, but then a single message
> 
> "Disabling IRQ #57"
> 

Hello,

I'm sorry you're having problems with out parts.  Can you provide some more info on this issue?  What version of the driver are you using?  Is the only problem that this driver dies or are there other symptoms?  What does the dmesg say for the e1000e driver at load?  It usually list whether its using MSI/MSI-X, and some other info.  Can you provide an lspci -vvv output for the 82574L device and the dmesg info?  Is the "Disabling IRQ #57 message appearing in the dmesg log or somewhere else?  Can you try loading the driver with the module parameter IntMode=0 to try the device in legacy interrupt mode, if its not already in that mode.

Thanks,

Carolyn

Carolyn Wyborny
Linux Development
LAN Access Division
Intel Corporation

^ permalink raw reply

* Re: interrupt coalescing and CSUM offload
From: Stephen Hemminger @ 2012-09-14 16:32 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: netdev
In-Reply-To: <OF3E0CDC2E.8FFEBDC0-ONC1257A79.004A2EB6-C1257A79.00502122@transmode.se>

On Fri, 14 Sep 2012 16:35:13 +0200
Joakim Tjernlund <joakim.tjernlund@transmode.se> wrote:

> 
> I am adding interrupt coalescing to the ucc_geth driver. Unfortunately
> there is only support for RX interrupt coalescing.
> I wonder if there is any way "simulate" TX interrupt coalescing?
> 
> I am also looking at adding HWCSUM support but this device can only do
> IP header CSUM offload. This doesn't seem to be an option in Linux?
> As I understand it, one must do CSUM offload for the whole frame, both
> IP header and TCP/UDP csums?
> 
>  Jocke

There are a few drivers that turn off TX interrupt completely.
They cleanup TX buffers on next send and have a timer to cleanup
as well. This has performance benefits, but it does cause issues
with local flow control (the freeing of skb is used to rate
limit local traffic).

^ permalink raw reply

* [PATCH] netxen: check for root bus in netxen_mask_aer_correctable
From: nikolay @ 2012-09-14 15:50 UTC (permalink / raw)
  To: sony.chacko; +Cc: rajesh.borundia, netdev

From: Nikolay Aleksandrov <nikolay@redhat.com>

Add a check if pdev->bus->self == NULL (root bus). When attaching
a netxen NIC to a VM it can be on the root bus and the guest would
crash in netxen_mask_aer_correctable() because of a NULL pointer
dereference if CONFIG_PCIEAER is present.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
---
 drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
index 342b3a7..e2a4858 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
@@ -1378,6 +1378,10 @@ static void netxen_mask_aer_correctable(struct netxen_adapter *adapter)
 	struct pci_dev *root = pdev->bus->self;
 	u32 aer_pos;
 
+	/* root bus? */
+	if (!root)
+		return;
+
 	if (adapter->ahw.board_type != NETXEN_BRDTYPE_P3_4_GB_MM &&
 		adapter->ahw.board_type != NETXEN_BRDTYPE_P3_10G_TP)
 		return;
-- 
1.7.11.4

^ permalink raw reply related

* Re: NULL deref in bnx2 / crashes ? ( was: netconsole leads to stalled CPU task )
From: Sylvain Munaut @ 2012-09-14 15:36 UTC (permalink / raw)
  To: Cong Wang; +Cc: Eric Dumazet, netdev
In-Reply-To: <50532FA1.3070706@gmail.com>

Hi,

> Anyway, I think Eric is right, the bug may be in other place. I am wondering
> if the attached patch could help? It seems in netpoll tx path, we miss the
> chance of calling ->ndo_select_queue().
>
> Please give it a try.

This fixes the NULL deref as well. I applied your patch over vanilla
3.6-rc5 (so without eric's patch) and I can load netconsole without it
crashing directly.

I will try to put a bit of load on the machine and see how it behaves,
if it solves the hung cpu tasks as well.

Cheers,

    Sylvain

^ permalink raw reply

* Re: [PATCH] cgroup: net_cls: Include missing header with sock_update_classid() definition
From: Daniel Wagner @ 2012-09-14 14:50 UTC (permalink / raw)
  To: sedat.dilek-Re5JQEeQqe8AvxtiuMwx3w
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-next-u79uwXL29TY76Z2rM5mHXA, Daniel Wagner, David S. Miller,
	Michael S. Tsirkin, Gao feng, Jamal Hadi Salim, Joe Perches,
	John Fastabend, Li Zefan, Neil Horman, Rick Jones,
	Stanislav Kinsbursky, Tejun Heo, netdev-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Stephen Rothwell
In-Reply-To: <CA+icZUUrdGgUM32r88EyRHEb3Q6bULqnk74PikQ8A9q25ZA3NA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi Sedat,

On 14.09.2012 16:43, Sedat Dilek wrote:
> On Fri, Sep 14, 2012 at 4:33 PM, Daniel Wagner <wagi-kQCPcA+X3s7YtjvyW6yDsg@public.gmane.org> wrote:
>> From: Daniel Wagner <daniel.wagner-98C5kh4wR6ohFhg+JK9F0w@public.gmane.org>
>>
>> commit 1f66c0a8833c3974ab6b35edcf4f8585b2f94592
>> Author: Daniel Wagner <daniel.wagner-98C5kh4wR6ohFhg+JK9F0w@public.gmane.org>
>> Date:   Wed Sep 12 16:12:01 2012 +0200
>>
>>      cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
>>
>> Claimed that there was only net/socket.c depending on
>> sock_update_class(). That is not true drivers/net/tun.c needs to
>> include the cls_cgroup.h header too.
>>
>> Signed-off-by: Daniel Wagner <daniel.wagner-98C5kh4wR6ohFhg+JK9F0w@public.gmane.org>
>> Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
>> Cc: "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Cc: Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> Cc: Jamal Hadi Salim <jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
>> Cc: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>
>> Cc: John Fastabend <john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
>> Cc: Neil Horman <nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
>> Cc: Rick Jones <rick.jones2-VXdhtT5mjnY@public.gmane.org>
>> Cc: Stanislav Kinsbursky <skinsbursky-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>> Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> ---
>>
>> This version is on top of the latest cgroup for-3.7 branch.
>>
>
> Thanks for the quick fix.

No problem. I am still ashamed not finding it myself.

> Please honour Reported-by: Stephen Rothwell <sfr-3FnU+UHB4dNDw9hX6IcOSA@public.gmane.org>.

Sure, will do next time.

> If this is the fix for the breakage in today's Linux-Next
> (next-20120914), please add a "-next" to the subject next time.

Correct, this one is for linux-next but I got the impression that cgroup 
for-3.7 branch was ignored because of this. Therefore I have send two 
versions. Hmm, I'll need to check the results of todays next tree.

thanks,
daniel

^ permalink raw reply

* Re: [PATCH] cgroup: net_cls: Include missing header with sock_update_classid() definition
From: Sedat Dilek @ 2012-09-14 14:43 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-next-u79uwXL29TY76Z2rM5mHXA, Daniel Wagner, David S. Miller,
	Michael S. Tsirkin, Gao feng, Jamal Hadi Salim, Joe Perches,
	John Fastabend, Li Zefan, Neil Horman, Rick Jones,
	Stanislav Kinsbursky, Tejun Heo, netdev-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Stephen Rothwell
In-Reply-To: <1347633184-26837-3-git-send-email-wagi-kQCPcA+X3s7YtjvyW6yDsg@public.gmane.org>

On Fri, Sep 14, 2012 at 4:33 PM, Daniel Wagner <wagi-kQCPcA+X3s7YtjvyW6yDsg@public.gmane.org> wrote:
> From: Daniel Wagner <daniel.wagner-98C5kh4wR6ohFhg+JK9F0w@public.gmane.org>
>
> commit 1f66c0a8833c3974ab6b35edcf4f8585b2f94592
> Author: Daniel Wagner <daniel.wagner-98C5kh4wR6ohFhg+JK9F0w@public.gmane.org>
> Date:   Wed Sep 12 16:12:01 2012 +0200
>
>     cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
>
> Claimed that there was only net/socket.c depending on
> sock_update_class(). That is not true drivers/net/tun.c needs to
> include the cls_cgroup.h header too.
>
> Signed-off-by: Daniel Wagner <daniel.wagner-98C5kh4wR6ohFhg+JK9F0w@public.gmane.org>
> Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> Cc: "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Cc: Jamal Hadi Salim <jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org>
> Cc: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>
> Cc: John Fastabend <john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Cc: Neil Horman <nhorman-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
> Cc: Rick Jones <rick.jones2-VXdhtT5mjnY@public.gmane.org>
> Cc: Stanislav Kinsbursky <skinsbursky-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
>
> This version is on top of the latest cgroup for-3.7 branch.
>

Thanks for the quick fix.
Please honour Reported-by: Stephen Rothwell <sfr-3FnU+UHB4dNDw9hX6IcOSA@public.gmane.org>.
If this is the fix for the breakage in today's Linux-Next
(next-20120914), please add a "-next" to the subject next time.

- Sedat -

>  drivers/net/tun.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 3a16d4f..9336b82 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -68,6 +68,7 @@
>  #include <net/netns/generic.h>
>  #include <net/rtnetlink.h>
>  #include <net/sock.h>
> +#include <net/cls_cgroup.h>
>
>  #include <asm/uaccess.h>
>
> --
> 1.7.12.315.g682ce8b
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-next" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* interrupt coalescing and CSUM offload
From: Joakim Tjernlund @ 2012-09-14 14:35 UTC (permalink / raw)
  To: netdev


I am adding interrupt coalescing to the ucc_geth driver. Unfortunately
there is only support for RX interrupt coalescing.
I wonder if there is any way "simulate" TX interrupt coalescing?

I am also looking at adding HWCSUM support but this device can only do
IP header CSUM offload. This doesn't seem to be an option in Linux?
As I understand it, one must do CSUM offload for the whole frame, both
IP header and TCP/UDP csums?

 Jocke

^ permalink raw reply

* [PATCH] cgroup: net_cls: Include missing header with sock_update_classid() definition
From: Daniel Wagner @ 2012-09-14 14:33 UTC (permalink / raw)
  To: linux-kernel, linux-next
  Cc: Daniel Wagner, David S. Miller, Michael S. Tsirkin, Gao feng,
	Jamal Hadi Salim, Joe Perches, John Fastabend, Li Zefan,
	Neil Horman, Rick Jones, Stanislav Kinsbursky, Tejun Heo, netdev,
	cgroups
In-Reply-To: <1347633184-26837-1-git-send-email-wagi@monom.org>

From: Daniel Wagner <daniel.wagner@bmw-carit.de>

commit 1f66c0a8833c3974ab6b35edcf4f8585b2f94592
Author: Daniel Wagner <daniel.wagner@bmw-carit.de>
Date:   Wed Sep 12 16:12:01 2012 +0200

    cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h

Claimed that there was only net/socket.c depending on
sock_update_class(). That is not true drivers/net/tun.c needs to
include the cls_cgroup.h header too.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Joe Perches <joe@perches.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
---

This version is on top of the latest cgroup for-3.7 branch.

 drivers/net/tun.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3a16d4f..9336b82 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -68,6 +68,7 @@
 #include <net/netns/generic.h>
 #include <net/rtnetlink.h>
 #include <net/sock.h>
+#include <net/cls_cgroup.h>
 
 #include <asm/uaccess.h>
 
-- 
1.7.12.315.g682ce8b

^ permalink raw reply related

* [PATCH v6] cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
From: Daniel Wagner @ 2012-09-14 14:33 UTC (permalink / raw)
  To: linux-kernel, linux-next
  Cc: Daniel Wagner, Gao feng, Jamal Hadi Salim, John Fastabend,
	Neil Horman, Stephen Rothwell, netdev, cgroups
In-Reply-To: <1347633184-26837-1-git-send-email-wagi@monom.org>

From: Daniel Wagner <daniel.wagner@bmw-carit.de>

The two users of sock_update_classid() are net/socket.c and
drivers/net/tun.c. socket.c includes cls_cgroup.h already. Update
tun.c to inlcude the header.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
---

This patch is an updated version of the faulty one. 

 drivers/net/tun.c        | 1 +
 include/net/cls_cgroup.h | 6 ++++++
 include/net/sock.h       | 8 --------
 3 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3a16d4f..9336b82 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -68,6 +68,7 @@
 #include <net/netns/generic.h>
 #include <net/rtnetlink.h>
 #include <net/sock.h>
+#include <net/cls_cgroup.h>
 
 #include <asm/uaccess.h>
 
diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
index a4dc5b0..e88527a 100644
--- a/include/net/cls_cgroup.h
+++ b/include/net/cls_cgroup.h
@@ -24,6 +24,8 @@ struct cgroup_cls_state
 	u32 classid;
 };
 
+extern void sock_update_classid(struct sock *sk);
+
 #ifdef CONFIG_NET_CLS_CGROUP
 static inline u32 task_cls_classid(struct task_struct *p)
 {
@@ -62,6 +64,10 @@ static inline u32 task_cls_classid(struct task_struct *p)
 }
 #endif
 #else
+static inline void sock_update_classid(struct sock *sk)
+{
+}
+
 static inline u32 task_cls_classid(struct task_struct *p)
 {
 	return 0;
diff --git a/include/net/sock.h b/include/net/sock.h
index 72132ae..160a680 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1486,14 +1486,6 @@ extern void *sock_kmalloc(struct sock *sk, int size,
 extern void sock_kfree_s(struct sock *sk, void *mem, int size);
 extern void sk_send_sigurg(struct sock *sk);
 
-#ifdef CONFIG_CGROUPS
-extern void sock_update_classid(struct sock *sk);
-#else
-static inline void sock_update_classid(struct sock *sk)
-{
-}
-#endif
-
 /*
  * Functions to fill in entries in struct proto_ops when a protocol
  * does not implement a particular function.
-- 
1.7.12.315.g682ce8b

^ permalink raw reply related

* [PATCH] Fix for "cgroup: Assign subsystem IDs during compile time"
From: Daniel Wagner @ 2012-09-14 14:33 UTC (permalink / raw)
  To: linux-kernel, linux-next
  Cc: Daniel Wagner, David S. Miller, Michael S. Tsirkin, Gao feng,
	Jamal Hadi Salim, Joe Perches, John Fastabend, Li Zefan,
	Neil Horman, Rick Jones, Stanislav Kinsbursky, Tejun Heo, netdev,
	cgroups

From: Daniel Wagner <daniel.wagner@bmw-carit.de>

Hi, 

Sorry for breaking Tejun's cgroup for-3.7 branch. Here are two patches
which fix this problem. Either of them will do. 

cheers,
daniel

Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Joe Perches <joe@perches.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org

-- 
1.7.12.315.g682ce8b

^ permalink raw reply

* Re: [PATCH] Xen backend support for paged out grant targets.
From: Andres Lagar-Cavilla @ 2012-09-14 14:27 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Ian Campbell, Andres Lagar-Cavilla, xen-devel@xen.lists.org,
	David Vrabel, David Miller, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20120914124411.GB25249@phenom.dumpdata.com>


On Sep 14, 2012, at 8:44 AM, Konrad Rzeszutek Wilk wrote:

> On Fri, Sep 14, 2012 at 08:19:01AM +0100, Ian Campbell wrote:
>> On Thu, 2012-09-13 at 20:45 +0100, Andres Lagar-Cavilla wrote:
>>> On Sep 13, 2012, at 2:11 PM, Ian Campbell wrote:
>>> 
>>>> On Thu, 2012-09-13 at 18:28 +0100, Andres Lagar-Cavilla wrote:
>>>>> 
>>>>> * Add placeholder in array of grant table error descriptions for
>>>>> unrelated error code we jump over. 
>>>> 
>>>> Why not just define it, it's listed here:
>>>> http://xenbits.xen.org/docs/unstable/hypercall/include,public,grant_table.h.html#Enum_grant_status
>>> Well, a) we'd be defining something no one will be using (for the
>>> moment)
>> 
>> Even if no one in the kernel is using it, having "placeholder" as an
>> entry in GNTTABOP_error_msgs is just silly, even things which don't
>> understand GNTST_address_too_big directly could end up looking it up
>> here.
>> 
>>> b) I would be signing-off on something unrelated.
>> 
>> Lets take this patch instead then.
>> 
>> 8<------------------------------------------------
>> 
>>> From cb9daaf3029accb6d5fef58b450a625b27190429 Mon Sep 17 00:00:00 2001
>> From: Ian Campbell <ian.campbell@citrix.com>
>> Date: Fri, 14 Sep 2012 08:10:06 +0100
>> Subject: [PATCH] xen: resynchronise grant table status codes with upstream
>> 
>> Adds GNTST_address_too_big and GNTST_eagain.
>> 
>> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> 
> applied.
Thanks. Rebased on top of this and just sent.
Andres
> 
>> ---
>> include/xen/interface/grant_table.h |    8 ++++++--
>> 1 file changed, 6 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/xen/interface/grant_table.h b/include/xen/interface/grant_table.h
>> index a17d844..84a8fbf 100644
>> --- a/include/xen/interface/grant_table.h
>> +++ b/include/xen/interface/grant_table.h
>> @@ -519,7 +519,9 @@ DEFINE_GUEST_HANDLE_STRUCT(gnttab_get_version);
>> #define GNTST_no_device_space  (-7) /* Out of space in I/O MMU.              */
>> #define GNTST_permission_denied (-8) /* Not enough privilege for operation.  */
>> #define GNTST_bad_page         (-9) /* Specified page was invalid for op.    */
>> -#define GNTST_bad_copy_arg    (-10) /* copy arguments cross page boundary */
>> +#define GNTST_bad_copy_arg    (-10) /* copy arguments cross page boundary.   */
>> +#define GNTST_address_too_big (-11) /* transfer page address too large.      */
>> +#define GNTST_eagain          (-12) /* Operation not done; try again.        */
>> 
>> #define GNTTABOP_error_msgs {                   \
>>     "okay",                                     \
>> @@ -532,7 +534,9 @@ DEFINE_GUEST_HANDLE_STRUCT(gnttab_get_version);
>>     "no spare translation slot in the I/O MMU", \
>>     "permission denied",                        \
>>     "bad page",                                 \
>> -    "copy arguments cross page boundary"        \
>> +    "copy arguments cross page boundary",       \
>> +    "page address size too large",              \
>> +    "operation not done; try again"             \
>> }
>> 
>> #endif /* __XEN_PUBLIC_GRANT_TABLE_H__ */
>> -- 
>> 1.7.10.4
>> 
>> 

^ permalink raw reply

* [PATCH] Xen backend support for paged out grant targets V4.
From: Andres Lagar-Cavilla @ 2012-09-14 14:26 UTC (permalink / raw)
  To: xen-devel
  Cc: Konrad Rzeszutek Wilk, Ian Campbell, David Vrabel, David Miller,
	linux-kernel, netdev, Andres Lagar-Cavilla

Since Xen-4.2, hvm domains may have portions of their memory paged out. When a
foreign domain (such as dom0) attempts to map these frames, the map will
initially fail. The hypervisor returns a suitable errno, and kicks an
asynchronous page-in operation carried out by a helper. The foreign domain is
expected to retry the mapping operation until it eventually succeeds. The
foreign domain is not put to sleep because itself could be the one running the
pager assist (typical scenario for dom0).

This patch adds support for this mechanism for backend drivers using grant
mapping and copying operations. Specifically, this covers the blkback and
gntdev drivers (which map foregin grants), and the netback driver (which copies
foreign grants).

* Add a retry method for grants that fail with GNTST_eagain (i.e. because the
  target foregin frame is paged out).
* Insert hooks with appropriate wrappers in the aforementioned drivers.

The retry loop is only invoked if the grant operation status is GNTST_eagain.
It guarantees to leave a new status code different from GNTST_eagain. Any other
status code results in identical code execution as before.

The retry loop performs 256 attempts with increasing time intervals through a
32 second period. It uses msleep to yield while waiting for the next retry.

V2 after feedback from David Vrabel:
* Explicit MAX_DELAY instead of wrap-around delay into zero
* Abstract GNTST_eagain check into core grant table code for netback module.

V3 after feedback from Ian Campbell:
* Add placeholder in array of grant table error descriptions for unrelated
  error code we jump over.
* Eliminate single map and retry macro in favor of a generic batch flavor.
* Some renaming.
* Bury most implementation in grant_table.c, cleaner interface.

V4 rebased on top of sync of Xen grant table interface headers.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
---
 drivers/net/xen-netback/netback.c  |   11 ++------
 drivers/xen/grant-table.c          |   53 ++++++++++++++++++++++++++++++++++++
 drivers/xen/xenbus/xenbus_client.c |    6 ++--
 include/xen/grant_table.h          |   12 ++++++++
 4 files changed, 70 insertions(+), 12 deletions(-)

diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 682633b..05593d8 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -635,9 +635,7 @@ static void xen_netbk_rx_action(struct xen_netbk *netbk)
 		return;
 
 	BUG_ON(npo.copy_prod > ARRAY_SIZE(netbk->grant_copy_op));
-	ret = HYPERVISOR_grant_table_op(GNTTABOP_copy, &netbk->grant_copy_op,
-					npo.copy_prod);
-	BUG_ON(ret != 0);
+	gnttab_batch_copy(netbk->grant_copy_op, npo.copy_prod);
 
 	while ((skb = __skb_dequeue(&rxq)) != NULL) {
 		sco = (struct skb_cb_overlay *)skb->cb;
@@ -1460,18 +1458,15 @@ static void xen_netbk_tx_submit(struct xen_netbk *netbk)
 static void xen_netbk_tx_action(struct xen_netbk *netbk)
 {
 	unsigned nr_gops;
-	int ret;
 
 	nr_gops = xen_netbk_tx_build_gops(netbk);
 
 	if (nr_gops == 0)
 		return;
-	ret = HYPERVISOR_grant_table_op(GNTTABOP_copy,
-					netbk->tx_copy_ops, nr_gops);
-	BUG_ON(ret);
 
-	xen_netbk_tx_submit(netbk);
+	gnttab_batch_copy(netbk->tx_copy_ops, nr_gops);
 
+	xen_netbk_tx_submit(netbk);
 }
 
 static void xen_netbk_idx_release(struct xen_netbk *netbk, u16 pending_idx)
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index eea81cf..f5681c8 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -38,6 +38,7 @@
 #include <linux/vmalloc.h>
 #include <linux/uaccess.h>
 #include <linux/io.h>
+#include <linux/delay.h>
 #include <linux/hardirq.h>
 
 #include <xen/xen.h>
@@ -823,6 +824,52 @@ unsigned int gnttab_max_grant_frames(void)
 }
 EXPORT_SYMBOL_GPL(gnttab_max_grant_frames);
 
+/* Handling of paged out grant targets (GNTST_eagain) */
+#define MAX_DELAY 256
+static inline void
+gnttab_retry_eagain_gop(unsigned int cmd, void *gop, int16_t *status,
+						const char *func)
+{
+	unsigned delay = 1;
+
+	do {
+		BUG_ON(HYPERVISOR_grant_table_op(cmd, gop, 1));
+		if (*status == GNTST_eagain)
+			msleep(delay++);
+	} while ((*status == GNTST_eagain) && (delay < MAX_DELAY));
+
+	if (delay >= MAX_DELAY) {
+		printk(KERN_ERR "%s: %s eagain grant\n", func, current->comm);
+		*status = GNTST_bad_page;
+	}
+}
+
+void gnttab_batch_map(struct gnttab_map_grant_ref *batch, unsigned count)
+{
+	struct gnttab_map_grant_ref *op;
+
+	if (HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, batch, count))
+		BUG();
+	for (op = batch; op < batch + count; op++)
+		if (op->status == GNTST_eagain)
+			gnttab_retry_eagain_gop(GNTTABOP_map_grant_ref, op,
+									&op->status, __func__);
+}
+EXPORT_SYMBOL_GPL(gnttab_batch_map);
+
+void gnttab_batch_copy(struct gnttab_copy *batch, unsigned count)
+{
+	struct gnttab_copy *op;
+
+	if (HYPERVISOR_grant_table_op(GNTTABOP_copy, batch, count))
+		BUG();
+	for (op = batch; op < batch + count; op++)
+		if (op->status == GNTST_eagain)
+			gnttab_retry_eagain_gop(GNTTABOP_copy, op, &op->status,
+									__func__);
+}
+EXPORT_SYMBOL_GPL(gnttab_batch_copy);
+
 int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
 		    struct gnttab_map_grant_ref *kmap_ops,
 		    struct page **pages, unsigned int count)
@@ -836,6 +883,12 @@ int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
 	if (ret)
 		return ret;
 
+	/* Retry eagain maps */
+	for (i = 0; i < count; i++)
+		if (map_ops[i].status == GNTST_eagain)
+			gnttab_retry_eagain_gop(GNTTABOP_map_grant_ref, map_ops + i,
+                                    &map_ops[i].status, __func__);
+
 	if (xen_feature(XENFEAT_auto_translated_physmap))
 		return ret;
 
diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
index b3e146e..bcf3ba4 100644
--- a/drivers/xen/xenbus/xenbus_client.c
+++ b/drivers/xen/xenbus/xenbus_client.c
@@ -490,8 +490,7 @@ static int xenbus_map_ring_valloc_pv(struct xenbus_device *dev,
 
 	op.host_addr = arbitrary_virt_to_machine(pte).maddr;
 
-	if (HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, &op, 1))
-		BUG();
+	gnttab_batch_map(&op, 1);
 
 	if (op.status != GNTST_okay) {
 		free_vm_area(area);
@@ -572,8 +571,7 @@ int xenbus_map_ring(struct xenbus_device *dev, int gnt_ref,
 	gnttab_set_map_op(&op, (unsigned long)vaddr, GNTMAP_host_map, gnt_ref,
 			  dev->otherend_id);
 
-	if (HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref, &op, 1))
-		BUG();
+	gnttab_batch_map(&op, 1);
 
 	if (op.status != GNTST_okay) {
 		xenbus_dev_fatal(dev, op.status,
diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
index 11e27c3..da9386e 100644
--- a/include/xen/grant_table.h
+++ b/include/xen/grant_table.h
@@ -189,4 +189,16 @@ int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
 int gnttab_unmap_refs(struct gnttab_unmap_grant_ref *unmap_ops,
 		      struct page **pages, unsigned int count, bool clear_pte);
 
+/* Perform a batch of grant map/copy operations. Retry every batch slot
+ * for which the hypervisor returns GNTST_eagain. This is typically due 
+ * to paged out target frames.
+ *
+ * Will retry for 1, 2, ... 255 ms, i.e. 256 times during 32 seconds.
+ *
+ * Return value in each iand every status field of the batch guaranteed
+ * to not be GNTST_eagain. 
+ */
+void gnttab_batch_map(struct gnttab_map_grant_ref *batch, unsigned count);
+void gnttab_batch_copy(struct gnttab_copy *batch, unsigned count);
+
 #endif /* __ASM_GNTTAB_H__ */
-- 
1.7.9.5

^ permalink raw reply related

* Re: tulip Ethernet driver messes up USB keyboard
From: Sergey Vlasov @ 2012-09-14 14:24 UTC (permalink / raw)
  To: Marti Raudsepp; +Cc: Kernel hackers, Linux USB list, Linux network list
In-Reply-To: <CABRT9RCWU1RqpAay9iWPKwduejvkdk_CA2P_gV70iiEE2FS5HQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1924 bytes --]

On Fri, Sep 14, 2012 at 12:49:45PM +0300, Marti Raudsepp wrote:
> After installing an old 100Mbit PCI Ethernet card to my machine, it has
> complained a few times about spurious interrupts ("nobody cared") at a
> random time of the day. After the oops is reported, my USB keyboard (HP
> smart card keyboard) stops working properly -- it has lots of delay, it
> misses some keystrokes and also causes "stuck keys". Strangely enough my USB
> mouse continues to work without problems. Apparently the USB controller and
> the Ethernet card share an interrupt line. Un-/replugging the keyboard does
> not help.
> 
> So far I have simply rebooted to fix the issue. I haven't been able to
> reproduce this on will despite my best attempts.
> 
> The Ethernet card is driven by the tulip driver and is listed as "ADMtek
> NC100 Network Everywhere Fast Ethernet 10/100 (rev 11)" in lspci. My
> motherboard uses the Intel H77 chipset (ASRock H77 Pro4/MVP)

This board uses the ASMedia ASM1083 PCIe-to-PCI bridge (or maybe ASM1085 -
hard to see from photos available on the net, and the PCI IDs for these
chips are the same), which is known to have problems with interrupt
handling:

  https://lkml.org/lkml/2012/1/30/216

I experience the same problem with ASUS P8P67 LE (except in my case the
interrupt is not shared with anything, and the network card does not stop
working completely, just gives horrible performance with > 100ms pings).

Some people use "noirqdebug irqpoll" as a workaround:

  http://lkml.indiana.edu/hypermail/linux/kernel/1204.1/03451.html

But this workaround basically turns off the protection against stuck
interrupts in the kernel, so the system could potentially be busy handling
a stuck interrupt forever.

You can also try to move the PCI card to another slot, so that it would
use another interrupt, which might be not shared with USB or another
important device.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: [RFC PATCH net-next v2 0/1] Add support of ECMPv6
From: Nicolas Dichtel @ 2012-09-14 13:37 UTC (permalink / raw)
  To: Vincent Bernat; +Cc: yoshfuji, netdev, davem
In-Reply-To: <5053329F.6030109@6wind.com>

Le 14/09/2012 15:35, Nicolas Dichtel a écrit :
> Le 14/09/2012 11:40, Vincent Bernat a écrit :
>>   ❦ 14 septembre 2012 09:59 CEST, Nicolas Dichtel <nicolas.dichtel@6wind.com> :
>>
>>> Here is an example of a command to add an ECMP route:
>>> $ ip -6 route add 3ffe:304:124:2306::/64 \
>>>     nexthop via fe80::230:1bff:feb4:e05c dev eth0 weight 1 \
>>>     nexthop via fe80::230:1bff:feb4:dd4f dev eth0 weight 1
> In fact, I use this command as a shortcut. You can also use:
> $ ip -6 route add 3ffe:304:124:2306::/64 via fe80::230:1bff:feb4:dd4f dev eth0
> $ ip -6 route append 3ffe:304:124:2406::/64 via fe80::230:1bff:feb4:dd4f dev eth0
Note also that with these commands, there is no need to patch iproute2.

^ permalink raw reply

* Re: [RFC PATCH net-next v2 0/1] Add support of ECMPv6
From: Nicolas Dichtel @ 2012-09-14 13:35 UTC (permalink / raw)
  To: Vincent Bernat; +Cc: yoshfuji, netdev, davem
In-Reply-To: <87392l7xjc.fsf@guybrush.luffy.cx>

Le 14/09/2012 11:40, Vincent Bernat a écrit :
>   ❦ 14 septembre 2012 09:59 CEST, Nicolas Dichtel <nicolas.dichtel@6wind.com> :
>
>> Here is an example of a command to add an ECMP route:
>> $ ip -6 route add 3ffe:304:124:2306::/64 \
>> 	nexthop via fe80::230:1bff:feb4:e05c dev eth0 weight 1 \
>> 	nexthop via fe80::230:1bff:feb4:dd4f dev eth0 weight 1
In fact, I use this command as a shortcut. You can also use:
$ ip -6 route add 3ffe:304:124:2306::/64 via fe80::230:1bff:feb4:dd4f dev eth0
$ ip -6 route append 3ffe:304:124:2406::/64 via fe80::230:1bff:feb4:dd4f dev eth0

and these commands will output two standard netlink messages, without 'nexthop' 
lines.

>
> When displaying ECMP routes, the display is different than for IPv4: we
> get two distinct routes instead of an ECMP route (with nexthop
> keyword).
Sure, this implementation stores each 'nexthop' like a standard route in the 
kernel table.

>
> With IPv4:
>
> 193.252.X.X/26  proto zebra  metric 20
> 	nexthop via 193.252.X.X  dev bae1 weight 1
> 	nexthop via 193.252.X.X  dev bae2 weight 1
>
> With IPv6:
>
> 2a01:c9c0:X:X::/64 via fe80::215:17ff:fe85:76b9 dev bae1 metric 11
> 2a01:c9c0:X:X::/64 via fe80::222:91ff:fe4e:b000 dev bae2 metric 11
>
> If I capture the netlink message from the add command, put it in a file
> and use "ip monitor file ...", I see this:
>
> 2a01:c9c0:X:X::/64
> 	nexthop via fe80::215:17ff:fe85:76b9  dev if12 weight 1
> 	nexthop via fe80::222:91ff:fe4e:b000  dev if11 weight 1
>
> Therefore, the problem is not in iproute2 which knows how to display
> those ECMP routes. I fear that this difference make support in routing
> daemons more difficult.
Hmm, can you elaborate? Our routing daemon, quagga, manage it without any problem.

^ permalink raw reply

* Re: NULL deref in bnx2 / crashes ? ( was: netconsole leads to stalled CPU task )
From: Cong Wang @ 2012-09-14 13:22 UTC (permalink / raw)
  To: Sylvain Munaut; +Cc: Eric Dumazet, netdev
In-Reply-To: <CAF6-1L6RJkV6u4Rc2uDMQGUG23EGRKrdWSzTiHEYO9Z9ioGtaw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 904 bytes --]

On 09/14/2012 01:35 AM, Sylvain Munaut wrote:
> Hi,
>
>> Yes, but I have some worries of why it is needed.
>>
>> Isnt it covering a bug elsewhere ?
>
> That may very well be.
>
> Of the few test servers I have running the same kernel, I just found
> the one with netconsole active to be "stuck".
>
> Not frozen, but all user process are hanged up and it's spitting
> message about processes and CPU being "stuck". The trace is different
> in each case depending on what the process was actually doing at the
> time it got stuck.
>
> No message sent to the netconsole with the root cause and nothing was
> written in the logs ...
>

Yeah, in this case, kdump is your friend. :)

Anyway, I think Eric is right, the bug may be in other place. I am 
wondering if the attached patch could help? It seems in netpoll tx path, 
we miss the chance of calling ->ndo_select_queue().

Please give it a try.

Thanks!

[-- Attachment #2: netpoll-txq.diff --]
[-- Type: text/x-patch, Size: 1645 bytes --]

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ae3153c0..72661f6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1403,6 +1403,9 @@ static inline void netdev_for_each_tx_queue(struct net_device *dev,
 		f(dev, &dev->_tx[i], arg);
 }
 
+extern struct netdev_queue *netdev_pick_tx(struct net_device *dev,
+					   struct sk_buff *skb);
+
 /*
  * Net namespace inlines
  */
diff --git a/net/core/dev.c b/net/core/dev.c
index b1e6d63..d76bf73 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2381,8 +2381,8 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 #endif
 }
 
-static struct netdev_queue *dev_pick_tx(struct net_device *dev,
-					struct sk_buff *skb)
+struct netdev_queue *netdev_pick_tx(struct net_device *dev,
+				    struct sk_buff *skb)
 {
 	int queue_index;
 	const struct net_device_ops *ops = dev->netdev_ops;
@@ -2556,7 +2556,7 @@ int dev_queue_xmit(struct sk_buff *skb)
 
 	skb_update_prio(skb);
 
-	txq = dev_pick_tx(dev, skb);
+	txq = netdev_pick_tx(dev, skb);
 	q = rcu_dereference_bh(txq->qdisc);
 
 #ifdef CONFIG_NET_CLS_ACT
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index dd67818..77a0388 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -328,7 +328,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
 	if (skb_queue_len(&npinfo->txq) == 0 && !netpoll_owner_active(dev)) {
 		struct netdev_queue *txq;
 
-		txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
+		txq = netdev_pick_tx(dev, skb);
 
 		/* try until next clock tick */
 		for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;

^ permalink raw reply related

* Re: Oops with latest (netfilter) nf-next tree, when unloading iptable_nat
From: Patrick McHardy @ 2012-09-14 13:15 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Florian Westphal, Jesper Dangaard Brouer, netfilter-devel, netdev,
	yongjun_wei
In-Reply-To: <20120914120750.GA5764@1984>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2900 bytes --]

On Fri, 14 Sep 2012, Pablo Neira Ayuso wrote:

> On Wed, Sep 12, 2012 at 11:36:27PM +0200, Florian Westphal wrote:
>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>
>> [ CC'd Patrick ]
>>
>>> I'm hitting this general protection fault, when unloading iptables_nat.
>>> [  524.591067] Pid: 5842, comm: modprobe Not tainted 3.6.0-rc3-pablo-nf-next+ #1 Red Hat KVM
>>> [  524.591067] RIP: 0010:[<ffffffffa002c2fd>]  [<ffffffffa002c2fd>] nf_nat_proto_clean+0x6d/0xc0 [nf_nat]
>>> [  524.591067] RSP: 0018:ffff880073203e18  EFLAGS: 00010246
>>> [  524.591067] RAX: 0000000000000000 RBX: ffff880077dff2c8 RCX: ffff8800797fab70
>>> [  524.591067] RDX: dead000000200200 RSI: ffff880073203e88 RDI: ffffffffa002f208
>>> [  524.591067] RBP: ffff880073203e28 R08: ffff880073202000 R09: 0000000000000000
>>> [  524.591067] R10: dead000000200200 R11: dead000000100100 R12: ffffffff81c6dc00
>>>  list corruption?   ^^^^^^^^^^^^^^^^      ^^^^^^^^^^^^^^^^
>>
>> Yep, looks like it.
>>
>>> [  524.591067]  [<ffffffffa002c290>] ? nf_nat_net_exit+0x50/0x50 [nf_nat]
>>> [  524.591067]  [<ffffffff815614e3>] nf_ct_iterate_cleanup+0xc3/0x170
>>> [  524.591067]  [<ffffffffa002c54a>] nf_nat_l3proto_unregister+0x8a/0x100 [nf_nat]
>>> [  524.591067]  [<ffffffff812a0303>] ? compat_prepare_timeout+0x13/0xb0
>>> [  524.591067]  [<ffffffffa0035848>] nf_nat_l3proto_ipv4_exit+0x10/0x23 [nf_nat_ipv4]
>>
>> On module removal nf_nat_ipv4 calls nf_iterate_cleanup which invokes
>> nf_nat_proto_clean() for each conntrack.  That will then call
>> hlist_del_rcu(&nat->bysource) using eachs conntracks nat ext area.
>>
>> Problem is that nf_nat_proto_clean() is called multiple times for the same
>> conntrack:
>> a) nf_ct_iterate_cleanup() returns each ct twice (origin, reply)
>> b) we call it both for l3 and for l4 protocol ids
>>
>> We barf in hlist_del_rcu the 2nd time because ->pprev is poisoned.
>>
>> This was introduced with the ipv6 nat patches.
>>
>> --- a/net/netfilter/nf_nat_core.c
>> +++ b/net/netfilter/nf_nat_core.c
>> @@ -487,7 +487,7 @@ static int nf_nat_proto_clean(struct nf_conn *i, void *data)
>>
>>         if (clean->hash) {
>>                 spin_lock_bh(&nf_nat_lock);
>> -               hlist_del_rcu(&nat->bysource);
>> +               hlist_del_init_rcu(&nat->bysource);
>>                 spin_unlock_bh(&nf_nat_lock);
>>         } else {
>>
>> Would probably avoid it.  I guess it would be nicer to only call this
>> once for each ct.
>>
>> Patrick, any other idea?
>
> I already discussed this with Florian (I've been having problems with
> two out of three of my email accounts this week... so I couldn't reply
> to this email in the mailing list).
>
> We can add nf_nat_iterate_cleanup that can iterate over the NAT
> hashtable to replace current usage of nf_ct_iterate_cleanup.

Lets just bail out when IPS_SRC_NAT_DONE is not set, that should also fix
it. Could you try this patch please?

[-- Attachment #2: Type: TEXT/PLAIN, Size: 480 bytes --]

diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index 29d4452..8b5d220 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -481,6 +481,8 @@ static int nf_nat_proto_clean(struct nf_conn *i, void *data)
 
 	if (!nat)
 		return 0;
+	if (!(i->status & IPS_SRC_NAT_DONE))
+		return 0;
 	if ((clean->l3proto && nf_ct_l3num(i) != clean->l3proto) ||
 	    (clean->l4proto && nf_ct_protonum(i) != clean->l4proto))
 		return 0;

^ permalink raw reply related

* RE: [PATCH] sch_red: fix weighted average calculation
From: Eric Dumazet @ 2012-09-14 13:13 UTC (permalink / raw)
  To: Dowdal, John
  Cc: Chemparathy, Cyril, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, davem@davemloft.net,
	david.ward@ll.mit.edu, paul.gortmaker@windriver.com
In-Reply-To: <2D29FFBF554947468515B8EF6C733BB72E564F65@DLEE09.ent.ti.com>

On Fri, 2012-09-14 at 13:01 +0000, Dowdal, John wrote:
> Eric, thank you for reviewing the code.  I now see the problem with
> the patch since backlog is an integer and qavg is a fixed point number
> at logW.  
> 
> We are considering another patch to update the comments to this code
> (with the actual C code change reverted) to stop the FPP by showing
> the derivation of the equation in the comments.  Does this sound good?

This would be good, please do so.

Thanks

^ permalink raw reply

* RE: [PATCH] sch_red: fix weighted average calculation
From: Dowdal, John @ 2012-09-14 13:01 UTC (permalink / raw)
  To: Eric Dumazet, Chemparathy, Cyril
  Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	davem@davemloft.net, david.ward@ll.mit.edu,
	paul.gortmaker@windriver.com
In-Reply-To: <1347544431.13103.1557.camel@edumazet-glaptop>

Eric, thank you for reviewing the code.  I now see the problem with the patch since backlog is an integer and qavg is a fixed point number at logW.  

We are considering another patch to update the comments to this code (with the actual C code change reverted) to stop the FPP by showing the derivation of the equation in the comments.  Does this sound good?


-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: Thursday, September 13, 2012 9:54 AM
To: Chemparathy, Cyril
Cc: linux-kernel@vger.kernel.org; netdev@vger.kernel.org; davem@davemloft.net; david.ward@ll.mit.edu; Dowdal, John; paul.gortmaker@windriver.com
Subject: Re: [PATCH] sch_red: fix weighted average calculation

On Thu, 2012-09-13 at 09:43 -0400, Cyril Chemparathy wrote:
> This patch fixes an apparent bug in the running weighted average calculation
> used in the RED algorithm.
> 
> Going by the described formula:
> 	   qavg = qavg*(1-W) + backlog*W
> 	=> qavg = qavg + (backlog - qavg) * W
> 
> ... with W converted to a pre-calculated shift, this then becomes:
> 	qavg = qavg + (backlog - qavg) >> logW
> 
> ... giving the modified expression introduced by this patch.
> 
> Signed-off-by: John Dowdal <jdowdal@ti.com>
> ---
>  include/net/red.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/net/red.h b/include/net/red.h
> index ef46058..05960a4 100644
> --- a/include/net/red.h
> +++ b/include/net/red.h
> @@ -287,7 +287,7 @@ static inline unsigned long red_calc_qavg_no_idle_time(const struct red_parms *p
>  	 *
>  	 * --ANK (980924)
>  	 */
> -	return v->qavg + (backlog - (v->qavg >> p->Wlog));
> +	return v->qavg + (backlog - v->qavg) >> p->Wlog;
>  }
>  
>  static inline unsigned long red_calc_qavg(const struct red_parms *p,

This is going to be a FPP (Frequently Posted Patch)

Current formulae is fine.

Thats because backlog, at start of red_calc_qavg_no_idle_time() is not
yet scaled by p->Wlog. v->avg is scaled, but not backlog.

Have you tested RED after your patch ?



^ permalink raw reply

* Re: [PATCH] netfilter: Allow xt_nat.c and x_tables.c to compiled in
From: Pablo Neira Ayuso @ 2012-09-14 12:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, Eric W. Biederman, David Miller, netdev,
	netfilter-devel
In-Reply-To: <1347625514.26523.2.camel@edumazet-glaptop>

On Fri, Sep 14, 2012 at 02:25:14PM +0200, Eric Dumazet wrote:
> On Fri, 2012-09-14 at 14:08 +0200, Pablo Neira Ayuso wrote:
> > On Fri, Sep 14, 2012 at 01:54:22PM +0200, Patrick McHardy wrote:
> > > On Thu, 13 Sep 2012, Eric W. Biederman wrote:
> > > 
> > > >xt_init in x_tables.c must be called before xt_nat_init in xt_nat.c
> > > >Reorder the makefile so that x_tables.o comes before xt_nat.o in
> > > >netfilter.o.
> > > >
> > > >This allows me to built a kernel with both of these modules compiled in.
> > > 
> > > Thanks, we've already fixed that, the patch is queued in Pablo's tree.
> > 
> > It should hit Linus tree soon, David pulled the fix yesterday.
> > --
> 
> Little correction :
> 
> Its in net-next, Linus tree doesnt need this fix yet.
> 
> http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commitdiff;h=00545bec9412d130c77f72a08d6c8b6ad21d4a1e

You're right, thanks for the clarification.

^ permalink raw reply

* Re: [PATCH] Xen backend support for paged out grant targets.
From: Konrad Rzeszutek Wilk @ 2012-09-14 12:44 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Andres Lagar-Cavilla, Andres Lagar-Cavilla,
	xen-devel@xen.lists.org, David Vrabel, David Miller,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <1347607141.25803.73.camel@dagon.hellion.org.uk>

On Fri, Sep 14, 2012 at 08:19:01AM +0100, Ian Campbell wrote:
> On Thu, 2012-09-13 at 20:45 +0100, Andres Lagar-Cavilla wrote:
> > On Sep 13, 2012, at 2:11 PM, Ian Campbell wrote:
> > 
> > > On Thu, 2012-09-13 at 18:28 +0100, Andres Lagar-Cavilla wrote:
> > >> 
> > >> * Add placeholder in array of grant table error descriptions for
> > >> unrelated error code we jump over. 
> > > 
> > > Why not just define it, it's listed here:
> > > http://xenbits.xen.org/docs/unstable/hypercall/include,public,grant_table.h.html#Enum_grant_status
> > Well, a) we'd be defining something no one will be using (for the
> > moment)
> 
> Even if no one in the kernel is using it, having "placeholder" as an
> entry in GNTTABOP_error_msgs is just silly, even things which don't
> understand GNTST_address_too_big directly could end up looking it up
> here.
> 
> >  b) I would be signing-off on something unrelated.
> 
> Lets take this patch instead then.
> 
> 8<------------------------------------------------
> 
> >From cb9daaf3029accb6d5fef58b450a625b27190429 Mon Sep 17 00:00:00 2001
> From: Ian Campbell <ian.campbell@citrix.com>
> Date: Fri, 14 Sep 2012 08:10:06 +0100
> Subject: [PATCH] xen: resynchronise grant table status codes with upstream
> 
> Adds GNTST_address_too_big and GNTST_eagain.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>

applied.

> ---
>  include/xen/interface/grant_table.h |    8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/include/xen/interface/grant_table.h b/include/xen/interface/grant_table.h
> index a17d844..84a8fbf 100644
> --- a/include/xen/interface/grant_table.h
> +++ b/include/xen/interface/grant_table.h
> @@ -519,7 +519,9 @@ DEFINE_GUEST_HANDLE_STRUCT(gnttab_get_version);
>  #define GNTST_no_device_space  (-7) /* Out of space in I/O MMU.              */
>  #define GNTST_permission_denied (-8) /* Not enough privilege for operation.  */
>  #define GNTST_bad_page         (-9) /* Specified page was invalid for op.    */
> -#define GNTST_bad_copy_arg    (-10) /* copy arguments cross page boundary */
> +#define GNTST_bad_copy_arg    (-10) /* copy arguments cross page boundary.   */
> +#define GNTST_address_too_big (-11) /* transfer page address too large.      */
> +#define GNTST_eagain          (-12) /* Operation not done; try again.        */
>  
>  #define GNTTABOP_error_msgs {                   \
>      "okay",                                     \
> @@ -532,7 +534,9 @@ DEFINE_GUEST_HANDLE_STRUCT(gnttab_get_version);
>      "no spare translation slot in the I/O MMU", \
>      "permission denied",                        \
>      "bad page",                                 \
> -    "copy arguments cross page boundary"        \
> +    "copy arguments cross page boundary",       \
> +    "page address size too large",              \
> +    "operation not done; try again"             \
>  }
>  
>  #endif /* __XEN_PUBLIC_GRANT_TABLE_H__ */
> -- 
> 1.7.10.4
> 
> 

^ permalink raw reply

* Re: [PATCH] netfilter: Allow xt_nat.c and x_tables.c to compiled in
From: Eric Dumazet @ 2012-09-14 12:25 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Patrick McHardy, Eric W. Biederman, David Miller, netdev,
	netfilter-devel
In-Reply-To: <20120914120853.GA6056@1984>

On Fri, 2012-09-14 at 14:08 +0200, Pablo Neira Ayuso wrote:
> On Fri, Sep 14, 2012 at 01:54:22PM +0200, Patrick McHardy wrote:
> > On Thu, 13 Sep 2012, Eric W. Biederman wrote:
> > 
> > >xt_init in x_tables.c must be called before xt_nat_init in xt_nat.c
> > >Reorder the makefile so that x_tables.o comes before xt_nat.o in
> > >netfilter.o.
> > >
> > >This allows me to built a kernel with both of these modules compiled in.
> > 
> > Thanks, we've already fixed that, the patch is queued in Pablo's tree.
> 
> It should hit Linus tree soon, David pulled the fix yesterday.
> --

Little correction :

Its in net-next, Linus tree doesnt need this fix yet.

http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=commitdiff;h=00545bec9412d130c77f72a08d6c8b6ad21d4a1e




^ permalink raw reply

* Re: [PATCH] netfilter: Allow xt_nat.c and x_tables.c to compiled in
From: Pablo Neira Ayuso @ 2012-09-14 12:08 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Eric W. Biederman, David Miller, netdev, netfilter-devel
In-Reply-To: <Pine.GSO.4.63.1209141353470.11622@stinky-local.trash.net>

On Fri, Sep 14, 2012 at 01:54:22PM +0200, Patrick McHardy wrote:
> On Thu, 13 Sep 2012, Eric W. Biederman wrote:
> 
> >xt_init in x_tables.c must be called before xt_nat_init in xt_nat.c
> >Reorder the makefile so that x_tables.o comes before xt_nat.o in
> >netfilter.o.
> >
> >This allows me to built a kernel with both of these modules compiled in.
> 
> Thanks, we've already fixed that, the patch is queued in Pablo's tree.

It should hit Linus tree soon, David pulled the fix yesterday.

^ permalink raw reply

* Re: Oops with latest (netfilter) nf-next tree, when unloading iptable_nat
From: Pablo Neira Ayuso @ 2012-09-14 12:07 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Jesper Dangaard Brouer, netfilter-devel, netdev, yongjun_wei,
	kaber
In-Reply-To: <20120912213627.GJ14750@breakpoint.cc>

On Wed, Sep 12, 2012 at 11:36:27PM +0200, Florian Westphal wrote:
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> [ CC'd Patrick ]
> 
> > I'm hitting this general protection fault, when unloading iptables_nat.
> > [  524.591067] Pid: 5842, comm: modprobe Not tainted 3.6.0-rc3-pablo-nf-next+ #1 Red Hat KVM
> > [  524.591067] RIP: 0010:[<ffffffffa002c2fd>]  [<ffffffffa002c2fd>] nf_nat_proto_clean+0x6d/0xc0 [nf_nat]
> > [  524.591067] RSP: 0018:ffff880073203e18  EFLAGS: 00010246
> > [  524.591067] RAX: 0000000000000000 RBX: ffff880077dff2c8 RCX: ffff8800797fab70
> > [  524.591067] RDX: dead000000200200 RSI: ffff880073203e88 RDI: ffffffffa002f208
> > [  524.591067] RBP: ffff880073203e28 R08: ffff880073202000 R09: 0000000000000000
> > [  524.591067] R10: dead000000200200 R11: dead000000100100 R12: ffffffff81c6dc00
> >  list corruption?   ^^^^^^^^^^^^^^^^      ^^^^^^^^^^^^^^^^
> 
> Yep, looks like it.
> 
> > [  524.591067]  [<ffffffffa002c290>] ? nf_nat_net_exit+0x50/0x50 [nf_nat]
> > [  524.591067]  [<ffffffff815614e3>] nf_ct_iterate_cleanup+0xc3/0x170
> > [  524.591067]  [<ffffffffa002c54a>] nf_nat_l3proto_unregister+0x8a/0x100 [nf_nat]
> > [  524.591067]  [<ffffffff812a0303>] ? compat_prepare_timeout+0x13/0xb0
> > [  524.591067]  [<ffffffffa0035848>] nf_nat_l3proto_ipv4_exit+0x10/0x23 [nf_nat_ipv4]
> 
> On module removal nf_nat_ipv4 calls nf_iterate_cleanup which invokes
> nf_nat_proto_clean() for each conntrack.  That will then call
> hlist_del_rcu(&nat->bysource) using eachs conntracks nat ext area.
> 
> Problem is that nf_nat_proto_clean() is called multiple times for the same
> conntrack:
> a) nf_ct_iterate_cleanup() returns each ct twice (origin, reply)
> b) we call it both for l3 and for l4 protocol ids
> 
> We barf in hlist_del_rcu the 2nd time because ->pprev is poisoned.
> 
> This was introduced with the ipv6 nat patches.
> 
> --- a/net/netfilter/nf_nat_core.c
> +++ b/net/netfilter/nf_nat_core.c
> @@ -487,7 +487,7 @@ static int nf_nat_proto_clean(struct nf_conn *i, void *data)
>  
>         if (clean->hash) {
>                 spin_lock_bh(&nf_nat_lock);
> -               hlist_del_rcu(&nat->bysource);
> +               hlist_del_init_rcu(&nat->bysource);
>                 spin_unlock_bh(&nf_nat_lock);
>         } else {
> 
> Would probably avoid it.  I guess it would be nicer to only call this
> once for each ct.
> 
> Patrick, any other idea?

I already discussed this with Florian (I've been having problems with
two out of three of my email accounts this week... so I couldn't reply
to this email in the mailing list).

We can add nf_nat_iterate_cleanup that can iterate over the NAT
hashtable to replace current usage of nf_ct_iterate_cleanup.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox