Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] iptables: xt_recent: Add optional mask option for xt_recent
From: Denys Fedoryshchenko @ 2012-07-14 17:55 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, Linux netdev
In-Reply-To: <20120714150527.GA6271@1984>

Hi Pablo

On 2012-07-14 18:05, Pablo Neira Ayuso wrote:
> Hi Denys,
>
> On Thu, May 17, 2012 at 11:08:57PM +0300, Denys Fedoryshchenko wrote:
>> Use case for this feature:
>> 1)In some occasions if you need to allow,block,match specific 
>> subnet.
>> 2)I can use recent as a trigger when netfilter rule matches, with 
>> mask 0.0.0.0
>>
>> Tested for backward compatibility:
>> )old (userspace) iptables, new kernel
>> )old kernel, new iptables
>> )new kernel, new iptables
>
> I've cleaned up this patch:
>
> 
> http://git.netfilter.org/cgi-bin/gitweb.cgi?p=iptables.git;a=commit;h=73bf03981dfaee48ac1d6da380d46501a96cc83e
>
> It's not yet in master. Please, check that this is correct.
>
> BTW, the man page update is still missing.

Thank you, everything looks ok, i will try to diff(to learn, not repeat 
coding style mistakes in next submissions) and test. Since i'm in desert 
now, it will take few days.
I will prepare also man page patch for iptables.

---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.

^ permalink raw reply

* Re: Crash in CIPSO_V4_TAG_LOCAL handling
From: Lin Ming @ 2012-07-14 17:22 UTC (permalink / raw)
  To: Alan Cox, Paul Moore; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 4634 bytes --]

On Sat, Jul 14, 2012 at 1:08 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> DaveM asked that this goes direct to the list so ...
>
> Looking at some static code analyser output and the results seem valid on
> this one:
>
> ip_options_compile called with skb = NULL and an IPOPT_CIPSO option calls
> into cipso_v4_validate, which if the option is CIPSO_V4_TAG_LOCAL then
> dereferences skb->dev and explodes.

Hi Alan,

I can trigger this oops with the attached simple program(cipso_test.c).

# netlabelctl cipsov4 add local doi:3
# ./cipso_test

It's caused by below code added in commit 15c45f7b.

                case CIPSO_V4_TAG_LOCAL:
                        /* This is a non-standard tag that we only allow for
                         * local connections, so if the incoming interface is
                         * not the loopback device drop the packet. */
                        if (!(skb->dev->flags & IFF_LOOPBACK)) {
                                err_offset = opt_iter;
                                goto validate_return_locked;
                        }

But I don't know how to check if it is loopback device
just after socket() syscall.

sockfd = socket(AF_INET, SOCK_DGRAM, 0);
setsockopt(sockfd, SOL_IP, IP_OPTIONS, &cipso, sizeof(struct cipso));

Obviously, at this time net_device is not assigned yet.
Any idea?

[  346.396250] BUG: unable to handle kernel NULL pointer dereference at 00000014
[  346.400024] IP: [<c171041b>] cipso_v4_validate+0x323/0x363
[  346.400024] *pdpt = 000000001efe4001 *pde = 0000000000000000 
[  346.400024] Oops: 0000 [#1] SMP 
[  346.400024] Modules linked in:
[  346.400024] 
[  346.400024] Pid: 2471, comm: cipso_test Not tainted 3.5.0-rc5-01438-gd322ef7 #28 Bochs Bochs
[  346.400024] EIP: 0060:[<c171041b>] EFLAGS: 00010246 CPU: 0
[  346.400024] EIP is at cipso_v4_validate+0x323/0x363
[  346.400024] EAX: 00000006 EBX: de77d4de ECX: 00000006 EDX: de77d380
[  346.400024] ESI: 00000000 EDI: 00000006 EBP: df61dcb8 ESP: df61dc5c
[  346.400024]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  346.400024] CR0: 8005003b CR2: 00000014 CR3: 1efac000 CR4: 000006b0
[  346.400024] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  346.400024] DR6: ffff0ff0 DR7: 00000400
[  346.400024] Process cipso_test (pid: 2471, ti=df61c000 task=df7f0530 task.ti=df61c000)
[  346.400024] Stack:
[  346.400024]  000000d0 00000040 df404600 de7debe0 0001cd2c df61dc8c 0000000c 0000001c
[  346.400024]  00000000 00000001 c1b2606c df61dc98 c12dfa20 df61dcf4 0c7f0530 de77d4d8
[  346.400024]  c106c380 06f73800 00000006 de77d398 de77d4c8 00000000 0000000c df61dd04
[  346.400024] Call Trace:
[  346.400024]  [<c12dfa20>] ? security_capable+0x1c/0x20
[  346.400024]  [<c106c380>] ? sys_sysctl+0x110/0x140
[  346.400024]  [<c16d2370>] ip_options_compile+0x3c4/0x423
[  346.400024]  [<c10f0999>] ? unlock_page+0x46/0x49
[  346.400024]  [<c110b5ef>] ? __do_fault+0x460/0x474
[  346.400024]  [<c16d1d57>] ? ip_options_get_alloc+0x1b/0x1d
[  346.400024]  [<c16d2401>] ip_options_get_finish+0x32/0x53
[  346.400024]  [<c16d24cf>] ip_options_get_from_user+0x5d/0x65
[  346.400024]  [<c16d5453>] do_ip_setsockopt+0xe3/0xa97
[  346.400024]  [<c105de56>] ? kmap_atomic_prot+0x110/0x112
[  346.400024]  [<c10f2f12>] ? zone_watermark_ok+0x30/0x37
[  346.400024]  [<c10f58e9>] ? get_page_from_freelist+0x38b/0x42c
[  346.400024]  [<c10ef736>] ? find_get_page+0x22/0x6d
[  346.400024]  [<c10f0d2f>] ? filemap_fault+0x7a/0x31a
[  346.400024]  [<c10f0999>] ? unlock_page+0x46/0x49
[  346.400024]  [<c110b5c7>] ? __do_fault+0x438/0x474
[  346.400024]  [<c1664565>] ? sk_prot_alloc+0x25/0xb8
[  346.400024]  [<c16645f8>] ? sk_prot_alloc+0xb8/0xb8
[  346.400024]  [<c16645c7>] ? sk_prot_alloc+0x87/0xb8
[  346.400024]  [<c12dfca1>] ? security_file_alloc+0x14/0x16
[  346.400024]  [<c11333c0>] ? get_empty_filp+0xfb/0x1cd
[  346.400024]  [<c16d5e31>] ip_setsockopt+0x2a/0x8c
[  346.400024]  [<c16ef873>] udp_setsockopt+0x42/0x49
[  346.400024]  [<c1662b91>] sock_common_setsockopt+0x23/0x29
[  346.400024]  [<c16613fa>] sys_setsockopt+0x6e/0x8a
[  346.400024]  [<c16627c4>] sys_socketcall+0x1fa/0x2bc
[  346.400024]  [<c1311ea0>] ? trace_hardirqs_on_thunk+0xc/0x10
[  346.400024]  [<c180f878>] sysenter_do_call+0x12/0x2d
[  346.400024] Code: 3b 4d ec 77 09 66 8b 74 0b 02 66 c1 c6 08 0f b7 ff 3b 7d e4 77 0d 0f b7 f6 89 75 e4 3b 4d ec 72 d1 eb 1d 83 c0 04 eb 2e 8b 75 c4 <8b> 4e 14 f6 81 e4 00 00 00 08 74 1f 80 7d eb 06 74 03 40 eb 16 
[  346.400024] EIP: [<c171041b>] cipso_v4_validate+0x323/0x363 SS:ESP 0068:df61dc5c


>
> All the other cases appear to correctly avoid or check it versus NULL
>
> Alan




[-- Attachment #2: cipso_test.c --]
[-- Type: text/x-csrc, Size: 660 bytes --]

#include <sys/types.h>
#include <sys/socket.h>
#include <linux/ip.h>
#include <linux/in.h>
#include <string.h>

struct local_tag {
	char type;
	char length;
	char info[4];
};

struct cipso {
	char type;
	char length;
	char doi[4];
	struct local_tag local;	
};

int main(int argc, char **argv)
{
	int sockfd;

	struct cipso cipso = {
		.type = IPOPT_CIPSO,
		.length = sizeof(struct cipso),
		.local = {
			.type = 128,
			.length = sizeof(struct local_tag),
		},
	};

	memset(cipso.doi, 0, 4);
	cipso.doi[3] = 3;

	sockfd = socket(AF_INET, SOCK_DGRAM, 0);

	#define SOL_IP 0
	setsockopt(sockfd, SOL_IP, IP_OPTIONS, &cipso, sizeof(struct cipso));

	return 0;
}

^ permalink raw reply

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: Greg KH @ 2012-07-14 17:10 UTC (permalink / raw)
  To: Jon Mason; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <1342215900-3358-1-git-send-email-jon.mason@intel.com>

On Fri, Jul 13, 2012 at 02:44:59PM -0700, Jon Mason wrote:
> +static int max_num_cbs = 2;
> +module_param(max_num_cbs, uint, 0644);
> +MODULE_PARM_DESC(max_num_cbs, "Maximum number of NTB transport connections");
> +
> +static bool no_msix;
> +module_param(no_msix, bool, 0644);
> +MODULE_PARM_DESC(no_msix, "Do not allow MSI-X interrupts to be selected");

How would a user, or a distro, know to set these options?  Why are they
even options at all?


> +struct ntb_device {
> +	struct pci_dev *pdev;
> +	struct msix_entry *msix_entries;
> +	void __iomem *reg_base;
> +	struct ntb_mw mw[NTB_NUM_MW];
> +	struct {
> +		unsigned int max_spads;
> +		unsigned int max_db_bits;
> +		unsigned int msix_cnt;
> +	} limits;
> +	struct {
> +		void __iomem *pdb;
> +		void __iomem *pdb_mask;
> +		void __iomem *sdb;
> +		void __iomem *sbar2_xlat;
> +		void __iomem *sbar4_xlat;
> +		void __iomem *spad_write;
> +		void __iomem *spad_read;
> +		void __iomem *lnk_cntl;
> +		void __iomem *lnk_stat;
> +		void __iomem *spci_cmd;
> +	} reg_ofs;
> +	void *ntb_transport;
> +	void (*event_cb)(void *handle, unsigned int event);

Shouldn't the event be an enum?

> +	struct ntb_db_cb *db_cb;
> +	unsigned char hw_type;
> +	unsigned char conn_type;
> +	unsigned char dev_type;
> +	unsigned char num_msix;
> +	unsigned char bits_per_vector;
> +	unsigned char max_cbs;
> +	unsigned char link_status;
> +	struct delayed_work hb_timer;
> +	unsigned long last_ts;
> +};

Why isn't this either a 'struct device' itself, or why isn't the 'struct
pci_device' embedded within it?  What controls the lifetime of this
device?  Why doesn't it show up in sysfs?  Don't you want it to show up
in the global device tree?

> +static DEFINE_PCI_DEVICE_TABLE(ntb_pci_tbl) = {
> +	{PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_B2B_BWD)},
> +	{PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_B2B_JSF)},
> +	{PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_CLASSIC_JSF)},
> +	{PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_RP_JSF)},
> +	{PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_RP_SNB)},
> +	{PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_B2B_SNB)},
> +	{PCI_VDEVICE(INTEL, PCI_DEVICE_ID_INTEL_NTB_CLASSIC_SNB)},
> +	{0}
> +};
> +MODULE_DEVICE_TABLE(pci, ntb_pci_tbl);
> +
> +static struct ntb_device *ntbdev;

You can really only have just one of these in the whole system?  Is that
wise?  Why isn't it dynamic and tied to the pci device itself as a
child?

thanks,

greg k-h

^ permalink raw reply

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Johannes Truschnigg @ 2012-07-14 17:09 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Eric Dumazet, Hillf Danton, linux-kernel, Linux-Netdev
In-Reply-To: <20120714131540.GQ16256@1wt.eu>

[-- Attachment #1: Type: text/plain, Size: 1039 bytes --]

On Sat, Jul 14, 2012 at 03:15:40PM +0200, Willy Tarreau wrote:
> On Sat, Jul 14, 2012 at 01:06:07PM +0200, Eric Dumazet wrote:
> [...]
> > Hmmm, this is supposed to fix a bug introduced in 3.4, no ?
> > 
> > So 3.3 kernel should work well ?
> 
> You're right indeed. So maybe it's not the same bug. Or maybe Johannes
> was affected by two different bugs in both versions, since Thorsten's
> report seems to point the finger at the same bug.
> 
> Johannes, are you certain that you were having the exact same issue
> with 3.3 ?

I still have the Linux 3.3-series kernel image around that I _think_ I first
saw the problem occur with. I'll try to reproduce the problem with that
kernel, but I cannot promise that results will be ready before Tuesday. I'll
keep you posted!

-- 
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   http://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-email or attachments. Thank you.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [RFC 1/2] PCI-Express Non-Transparent Bridge Support
From: Greg KH @ 2012-07-14 17:04 UTC (permalink / raw)
  To: Jon Mason; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <1342215900-3358-1-git-send-email-jon.mason@intel.com>

On Fri, Jul 13, 2012 at 02:44:59PM -0700, Jon Mason wrote:
> The NTB device driver is needed to configure these memory windows, doorbell, and
> scratch-pad registers as well as use them in such a way as they can be turned
> into a viable communication channel to the remote system.  ntb_hw.[ch]
> determines the usage model (NTB to NTB or NTB to Root Port) and abstracts away
> the underlying hardware to provide access and a common interface to the doorbell
> registers, scratch pads, and memory windows.  These hardware interfaces are
> exported so that other, non-mainlined kernel drivers can access these.

Why would you have non-mainlined drivers?

Can you submit the drivers at the same time so we see how you are using
these new interfaces?

> +++ b/drivers/ntb/ntb_hw.c
> @@ -0,0 +1,1283 @@
> +/*
> + * This file is provided under a dual BSD/GPLv2 license.  When using or
> + *   redistributing this file, you may do so under either license.
> + *
> + *   GPL LICENSE SUMMARY
> + *
> + *   Copyright(c) 2012 Intel Corporation. All rights reserved.
> + *
> + *   This program is free software; you can redistribute it and/or modify
> + *   it under the terms of version 2 of the GNU General Public License as
> + *   published by the Free Software Foundation.
> + *
> + *   This program is distributed in the hope that it will be useful, but
> + *   WITHOUT ANY WARRANTY; without even the implied warranty of
> + *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + *   General Public License for more details.
> + *
> + *   You should have received a copy of the GNU General Public License
> + *   along with this program; if not, write to the Free Software
> + *   Foundation, Inc., 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
> + *   The full GNU General Public License is included in this distribution
> + *   in the file called LICENSE.GPL.

You really want to track the office movements of the FSF for the next 40
years?  You should ask your lawyers if you can remove this paragraph.

> +/**
> + * ntb_hw_link_status() - return the hardware link status
> + * @ndev: pointer to ntb_device instance
> + *
> + * Returns true if the hardware is connected to the remote system
> + *
> + * RETURNS: true or false based on the hardware link state
> + */
> +bool ntb_hw_link_status(struct ntb_device *ndev)
> +{
> +	return ndev->link_status == NTB_LINK_UP;
> +}
> +EXPORT_SYMBOL(ntb_hw_link_status);

EXPORT_SYMBOL_GPL() perhaps, for these, and the other symbols you are
creating?

thanks,

greg k-h

^ permalink raw reply

* [PATCH 2/6] drivers/net/can/softing/softing_main.c: ensure a consistent return value in error case
From: Julia Lawall @ 2012-07-14 16:43 UTC (permalink / raw)
  To: Wolfgang Grandegger
  Cc: kernel-janitors, Marc Kleine-Budde, linux-can, netdev,
	linux-kernel
In-Reply-To: <1342284188-19176-1-git-send-email-Julia.Lawall@lip6.fr>

From: Julia Lawall <Julia.Lawall@lip6.fr>

Typically, the return value desired for the failure of a function with an
integer return value is a negative integer.  In these cases, the return
value is sometimes a negative integer and sometimes 0, due to a subsequent
initialization of the return variable within the loop.

A simplified version of the semantic match that finds this problem is:
(http://coccinelle.lip6.fr/)

//<smpl>
@r exists@
identifier ret;
position p;
constant C;
expression e1,e3,e4;
statement S;
@@

ret = -C
... when != ret = e3
    when any
if@p (...) S
... when any
if (\(ret != 0\|ret < 0\|ret > 0\) || ...) { ... return ...; }
... when != ret = e3
    when any
*if@p (...)
{
  ... when != ret = e4
  return ret;
}
//</smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>

---
 drivers/net/can/softing/softing_main.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/can/softing/softing_main.c b/drivers/net/can/softing/softing_main.c
index a7c77c7..f2a221e 100644
--- a/drivers/net/can/softing/softing_main.c
+++ b/drivers/net/can/softing/softing_main.c
@@ -826,12 +826,12 @@ static __devinit int softing_pdev_probe(struct platform_device *pdev)
 		goto sysfs_failed;
 	}
 
-	ret = -ENOMEM;
 	for (j = 0; j < ARRAY_SIZE(card->net); ++j) {
 		card->net[j] = netdev =
 			softing_netdev_create(card, card->id.chip[j]);
 		if (!netdev) {
 			dev_alert(&pdev->dev, "failed to make can[%i]", j);
+			ret = -ENOMEM;
 			goto netdev_failed;
 		}
 		priv = netdev_priv(card->net[j]);


^ permalink raw reply related

* Re: [PATCH] iptables: xt_recent: Add optional mask option for xt_recent
From: Pablo Neira Ayuso @ 2012-07-14 15:05 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netfilter-devel, Linux netdev
In-Reply-To: <1337285337-13619-1-git-send-email-denys@visp.net.lb>

Hi Denys,

On Thu, May 17, 2012 at 11:08:57PM +0300, Denys Fedoryshchenko wrote:
> Use case for this feature:
> 1)In some occasions if you need to allow,block,match specific subnet.
> 2)I can use recent as a trigger when netfilter rule matches, with mask 0.0.0.0
> 
> Tested for backward compatibility:
> )old (userspace) iptables, new kernel
> )old kernel, new iptables
> )new kernel, new iptables

I've cleaned up this patch:

http://git.netfilter.org/cgi-bin/gitweb.cgi?p=iptables.git;a=commit;h=73bf03981dfaee48ac1d6da380d46501a96cc83e

It's not yet in master. Please, check that this is correct.

BTW, the man page update is still missing.

^ permalink raw reply

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Willy Tarreau @ 2012-07-14 14:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hillf Danton, Johannes Truschnigg, linux-kernel, Linux-Netdev
In-Reply-To: <1342275540.3265.9760.camel@edumazet-glaptop>

On Sat, Jul 14, 2012 at 04:19:00PM +0200, Eric Dumazet wrote:
> On Sat, 2012-07-14 at 22:08 +0800, Hillf Danton wrote:
> > On Sat, Jul 14, 2012 at 4:20 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >
> > > Might be, or not (could be a NIC bug)
> > >
> > Dunno why sendfile sits in the layer of NIC and
> > how they interact.
> 
> sendfile() relies heavily on TSO capabilities, a buggy NIC could
> corrupt frame content on some obscure occasions.
> 
> We had some known cases on IPv6 for example.

Similarly I remind having experienced bugs on early Yukon chips years
ago that would regularly emit total crap on the wire.

Willy

^ permalink raw reply

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Eric Dumazet @ 2012-07-14 14:19 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Johannes Truschnigg, linux-kernel, Willy Tarreau, Linux-Netdev
In-Reply-To: <CAJd=RBAJOqJDSBpaaB+2-WU_pa5vChXSf6TbLH8fi3HNt6hZ9w@mail.gmail.com>

On Sat, 2012-07-14 at 22:08 +0800, Hillf Danton wrote:
> On Sat, Jul 14, 2012 at 4:20 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Might be, or not (could be a NIC bug)
> >
> Dunno why sendfile sits in the layer of NIC and
> how they interact.

sendfile() relies heavily on TSO capabilities, a buggy NIC could
corrupt frame content on some obscure occasions.

We had some known cases on IPv6 for example.

^ permalink raw reply

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Hillf Danton @ 2012-07-14 14:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Johannes Truschnigg, linux-kernel, Willy Tarreau, Linux-Netdev
In-Reply-To: <1342254042.3265.9017.camel@edumazet-glaptop>

On Sat, Jul 14, 2012 at 4:20 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Might be, or not (could be a NIC bug)
>
Dunno why sendfile sits in the layer of NIC and
how they interact.

^ permalink raw reply

* SUSPECT: 营>销管<理者的八维行为准则c履行管理职能与创造销售结果
From: 6 @ 2012-07-14 11:32 UTC (permalink / raw)
  To: netdemon

[-- Attachment #1: 从 销(售骨干走向管)理高手快速蜕变特 训营.xls --]
[-- Type: application/octet-stream, Size: 38912 bytes --]

^ permalink raw reply

* [PATCH net-next] netem: refine early skb orphaning
From: Eric Dumazet @ 2012-07-14 13:16 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Hagen Paul Pfeifer, Mark Gordon, Andreas Terzis,
	Yuchung Cheng

From: Eric Dumazet <edumazet@google.com>

netem does an early orphaning of skbs. Doing so breaks TCP Small Queue
or any mechanism relying on socket sk_wmem_alloc feedback.

Ideally, we should perform this orphaning after the rate module and
before the delay module, to mimic what happens on a real link :

skb orphaning is indeed normally done at TX completion, before the
transit on the link.

+-------+   +--------+  +---------------+  +-----------------+
+ Qdisc +---> Device +--> TX completion +--> links / hops    +->
+       +   +  xmit  +  + skb orphaning +  + propagation     +
+-------+   +--------+  +---------------+  +-----------------+
      < rate limiting >                  < delay, drops, reorders >

If netem is used without delay feature (drops, reorders, rate
limiting), then we should avoid early skb orphaning, to keep pressure
on sockets as long as packets are still in qdisc queue.

Ideally, netem should be refactored to implement delay module
as the last stage. Current algorithm merges the two phases
(rate limiting + delay) so its not correct.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Mark Gordon <msg@google.com>
Cc: Andreas Terzis <aterzis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/sched/sch_netem.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index c412ad0..298c0dd 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -380,7 +380,14 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
 	}
 
-	skb_orphan(skb);
+	/* If a delay is expected, orphan the skb. (orphaning usually takes
+	 * place at TX completion time, so _before_ the link transit delay)
+	 * Ideally, this orphaning should be done after the rate limiting
+	 * module, because this breaks TCP Small Queue, and other mechanisms
+	 * based on socket sk_wmem_alloc.
+	 */
+	if (q->latency || q->jitter)
+		skb_orphan(skb);
 
 	/*
 	 * If we need to duplicate packet, then re-insert at top of the

^ permalink raw reply related

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Willy Tarreau @ 2012-07-14 13:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Johannes Truschnigg, Hillf Danton, linux-kernel, Linux-Netdev
In-Reply-To: <1342263967.3265.9347.camel@edumazet-glaptop>

On Sat, Jul 14, 2012 at 01:06:07PM +0200, Eric Dumazet wrote:
> On Sat, 2012-07-14 at 12:44 +0200, Willy Tarreau wrote:
> > On Sat, Jul 14, 2012 at 12:33:24PM +0200, Eric Dumazet wrote:
> > > On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote:
> > > > On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote:
> > > > > > Please Johannes could you try latest kernel tree ?
> > > > > 
> > > > > It would be useful, especially given the amount of changes you performed
> > > > > in this area in latest version, it could be very possible that this new
> > > > > bug got fixed as a side effect !
> > > > 
> > > > I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running)
> > > > and what can I say - the problem really seems to have disappeared. I performed
> > > > about 3700 iterations of my previos tests over the night, and the data always
> > > > turned out to be OK, not a single byte turned out kaput!
> > > > 
> > > > I wish I would have tested that earlier, and spared you the noise... well,
> > > > maybe someone who runs into a similar problem in the future will have this
> > > > discovery save her/him some time and headaches and make her/him just upgrade
> > > > kernels :)
> > > > 
> > > > Thanks a lot for your polite and quick responses!
> > > > 
> > > 
> > > Nice to hear. Now we should make sure we have all needed fixes for prior
> > > stable kernels as well !
> > > 
> > > Still trying to understand the issue, since I thought I only did
> > > optimizations, not bug fixes. So maybe real bug is still there but its
> > > probability of occurrence lowered enough to not hit your workload.
> > 
> > Please note that Johannes tested 3.4.4 while your changes are in 3.5-rc.
> > 
> > I'm wondering whether this patch merged into 3.4.2 one has an impact on
> > sendfile :
> > 
> >   commit b642cb6a143da812f188307c2661c0357776a9d0
> >   Author: Konstantin Khlebnikov <khlebnikov@openvz.org>
> >   Date:   Tue Jun 5 21:36:33 2012 +0400
> > 
> >     radix-tree: fix contiguous iterator
> >     
> >     commit fffaee365fded09f9ebf2db19066065fa54323c3 upstream.
> >     
> >     This patch fixes bug in macro radix_tree_for_each_contig().
> >     
> >     If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following
> >     radix_tree_next_chunk() switches iterating into next chunk. As result iterating
> >     becomes non-contiguous and breaks vfs "splice" and all its users.
> > 
> > Willy
> > 
> 
> 
> Hmmm, this is supposed to fix a bug introduced in 3.4, no ?
> 
> So 3.3 kernel should work well ?

You're right indeed. So maybe it's not the same bug. Or maybe Johannes
was affected by two different bugs in both versions, since Thorsten's
report seems to point the finger at the same bug.

Johannes, are you certain that you were having the exact same issue
with 3.3 ?

Willy

^ permalink raw reply

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Thorsten Kranzkowski @ 2012-07-14 11:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Johannes Truschnigg, Willy Tarreau, Hillf Danton, linux-kernel,
	Linux-Netdev
In-Reply-To: <1342262004.3265.9279.camel@edumazet-glaptop>

On Sat, Jul 14, 2012 at 12:33:24PM +0200, Eric Dumazet wrote:
> On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote:
> > On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote:
> > > > Please Johannes could you try latest kernel tree ?
> > > 
> > > It would be useful, especially given the amount of changes you performed
> > > in this area in latest version, it could be very possible that this new
> > > bug got fixed as a side effect !
> > 
> > I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running)
> > and what can I say - the problem really seems to have disappeared. I performed
> > about 3700 iterations of my previos tests over the night, and the data always
> > turned out to be OK, not a single byte turned out kaput!
> > 
> > I wish I would have tested that earlier, and spared you the noise... well,
> > maybe someone who runs into a similar problem in the future will have this
> > discovery save her/him some time and headaches and make her/him just upgrade
> > kernels :)
> > 
> > Thanks a lot for your polite and quick responses!
> > 
> 
> Nice to hear. Now we should make sure we have all needed fixes for prior
> stable kernels as well !
> 
> Still trying to understand the issue, since I thought I only did
> optimizations, not bug fixes. So maybe real bug is still there but its
> probability of occurrence lowered enough to not hit your workload.
>  
> Hmmm...
> 

Not sure if this is related, but I had a similar data corruption problem:
Reading data from filesystem 'normally' (including through nfs) showed
corruption at random places, mostly 0xff tuning into 0xfe.
Reading with ODIRECT (I used 'dd iflag=direct') was OK.

I found my problem to be fixed by
fffaee365fded09f9ebf2db19066065fa54323c3 (upstrem)
which was backported as
b642cb6a143da812f188307c2661c0357776a9d0 (stable, v3.4.1-66-gb642cb6)


Bye,
Thorsten

-- 
| Thorsten Kranzkowski        Internet: dl8bcu@dl8bcu.de                      |
| Mobile: ++49 170 1876134       Snail: Kiebitzstr. 14, 49324 Melle, Germany  |
| Ampr: dl8bcu@db0lj.#rpl.deu.eu, dl8bcu@marvin.dl8bcu.ampr.org [44.130.8.19] |

^ permalink raw reply

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Eric Dumazet @ 2012-07-14 11:06 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Johannes Truschnigg, Hillf Danton, linux-kernel, Linux-Netdev
In-Reply-To: <20120714104441.GP16256@1wt.eu>

On Sat, 2012-07-14 at 12:44 +0200, Willy Tarreau wrote:
> On Sat, Jul 14, 2012 at 12:33:24PM +0200, Eric Dumazet wrote:
> > On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote:
> > > On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote:
> > > > > Please Johannes could you try latest kernel tree ?
> > > > 
> > > > It would be useful, especially given the amount of changes you performed
> > > > in this area in latest version, it could be very possible that this new
> > > > bug got fixed as a side effect !
> > > 
> > > I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running)
> > > and what can I say - the problem really seems to have disappeared. I performed
> > > about 3700 iterations of my previos tests over the night, and the data always
> > > turned out to be OK, not a single byte turned out kaput!
> > > 
> > > I wish I would have tested that earlier, and spared you the noise... well,
> > > maybe someone who runs into a similar problem in the future will have this
> > > discovery save her/him some time and headaches and make her/him just upgrade
> > > kernels :)
> > > 
> > > Thanks a lot for your polite and quick responses!
> > > 
> > 
> > Nice to hear. Now we should make sure we have all needed fixes for prior
> > stable kernels as well !
> > 
> > Still trying to understand the issue, since I thought I only did
> > optimizations, not bug fixes. So maybe real bug is still there but its
> > probability of occurrence lowered enough to not hit your workload.
> 
> Please note that Johannes tested 3.4.4 while your changes are in 3.5-rc.
> 
> I'm wondering whether this patch merged into 3.4.2 one has an impact on
> sendfile :
> 
>   commit b642cb6a143da812f188307c2661c0357776a9d0
>   Author: Konstantin Khlebnikov <khlebnikov@openvz.org>
>   Date:   Tue Jun 5 21:36:33 2012 +0400
> 
>     radix-tree: fix contiguous iterator
>     
>     commit fffaee365fded09f9ebf2db19066065fa54323c3 upstream.
>     
>     This patch fixes bug in macro radix_tree_for_each_contig().
>     
>     If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following
>     radix_tree_next_chunk() switches iterating into next chunk. As result iterating
>     becomes non-contiguous and breaks vfs "splice" and all its users.
> 
> Willy
> 


Hmmm, this is supposed to fix a bug introduced in 3.4, no ?

So 3.3 kernel should work well ?

^ permalink raw reply

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Willy Tarreau @ 2012-07-14 10:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Johannes Truschnigg, Hillf Danton, linux-kernel, Linux-Netdev
In-Reply-To: <1342262004.3265.9279.camel@edumazet-glaptop>

On Sat, Jul 14, 2012 at 12:33:24PM +0200, Eric Dumazet wrote:
> On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote:
> > On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote:
> > > > Please Johannes could you try latest kernel tree ?
> > > 
> > > It would be useful, especially given the amount of changes you performed
> > > in this area in latest version, it could be very possible that this new
> > > bug got fixed as a side effect !
> > 
> > I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running)
> > and what can I say - the problem really seems to have disappeared. I performed
> > about 3700 iterations of my previos tests over the night, and the data always
> > turned out to be OK, not a single byte turned out kaput!
> > 
> > I wish I would have tested that earlier, and spared you the noise... well,
> > maybe someone who runs into a similar problem in the future will have this
> > discovery save her/him some time and headaches and make her/him just upgrade
> > kernels :)
> > 
> > Thanks a lot for your polite and quick responses!
> > 
> 
> Nice to hear. Now we should make sure we have all needed fixes for prior
> stable kernels as well !
> 
> Still trying to understand the issue, since I thought I only did
> optimizations, not bug fixes. So maybe real bug is still there but its
> probability of occurrence lowered enough to not hit your workload.

Please note that Johannes tested 3.4.4 while your changes are in 3.5-rc.

I'm wondering whether this patch merged into 3.4.2 one has an impact on
sendfile :

  commit b642cb6a143da812f188307c2661c0357776a9d0
  Author: Konstantin Khlebnikov <khlebnikov@openvz.org>
  Date:   Tue Jun 5 21:36:33 2012 +0400

    radix-tree: fix contiguous iterator
    
    commit fffaee365fded09f9ebf2db19066065fa54323c3 upstream.
    
    This patch fixes bug in macro radix_tree_for_each_contig().
    
    If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following
    radix_tree_next_chunk() switches iterating into next chunk. As result iterating
    becomes non-contiguous and breaks vfs "splice" and all its users.

Willy

^ permalink raw reply

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Eric Dumazet @ 2012-07-14 10:33 UTC (permalink / raw)
  To: Johannes Truschnigg
  Cc: Willy Tarreau, Hillf Danton, linux-kernel, Linux-Netdev
In-Reply-To: <20120714101321.GA26329@vault.local>

On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote:
> On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote:
> > > Please Johannes could you try latest kernel tree ?
> > 
> > It would be useful, especially given the amount of changes you performed
> > in this area in latest version, it could be very possible that this new
> > bug got fixed as a side effect !
> 
> I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running)
> and what can I say - the problem really seems to have disappeared. I performed
> about 3700 iterations of my previos tests over the night, and the data always
> turned out to be OK, not a single byte turned out kaput!
> 
> I wish I would have tested that earlier, and spared you the noise... well,
> maybe someone who runs into a similar problem in the future will have this
> discovery save her/him some time and headaches and make her/him just upgrade
> kernels :)
> 
> Thanks a lot for your polite and quick responses!
> 

Nice to hear. Now we should make sure we have all needed fixes for prior
stable kernels as well !

Still trying to understand the issue, since I thought I only did
optimizations, not bug fixes. So maybe real bug is still there but its
probability of occurrence lowered enough to not hit your workload.
 
Hmmm...

^ permalink raw reply

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Johannes Truschnigg @ 2012-07-14 10:13 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Eric Dumazet, Hillf Danton, linux-kernel, Linux-Netdev
In-Reply-To: <20120714083136.GO16256@1wt.eu>

[-- Attachment #1: Type: text/plain, Size: 1154 bytes --]

On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote:
> > Please Johannes could you try latest kernel tree ?
> 
> It would be useful, especially given the amount of changes you performed
> in this area in latest version, it could be very possible that this new
> bug got fixed as a side effect !

I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running)
and what can I say - the problem really seems to have disappeared. I performed
about 3700 iterations of my previos tests over the night, and the data always
turned out to be OK, not a single byte turned out kaput!

I wish I would have tested that earlier, and spared you the noise... well,
maybe someone who runs into a similar problem in the future will have this
discovery save her/him some time and headaches and make her/him just upgrade
kernels :)

Thanks a lot for your polite and quick responses!

-- 
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   http://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-email or attachments. Thank you.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* [net 1/2] e1000e: Correct link check logic for 82571 serdes
From: Jeff Kirsher @ 2012-07-14  8:34 UTC (permalink / raw)
  To: davem
  Cc: Tushar Dave, netdev, gospo, sassmann, stable, dnelson,
	bruce.w.allan, Jeff Kirsher

From: Tushar Dave <tushar.n.dave@intel.com>

SYNCH bit and IV bit of RXCW register are sticky. Before examining these bits,
RXCW should be read twice to filter out one-time false events and have correct
values for these bits. Incorrect values of these bits in link check logic can
cause weird link stability issues if auto-negotiation fails.

CC: stable <stable@vger.kernel.org> [2.6.38+]
Reported-by: Dean Nelson <dnelson@redhat.com>
Signed-off-by: Tushar Dave <tushar.n.dave@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/82571.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/82571.c b/drivers/net/ethernet/intel/e1000e/82571.c
index 36db4df..1f063dc 100644
--- a/drivers/net/ethernet/intel/e1000e/82571.c
+++ b/drivers/net/ethernet/intel/e1000e/82571.c
@@ -1572,6 +1572,9 @@ static s32 e1000_check_for_serdes_link_82571(struct e1000_hw *hw)
 	ctrl = er32(CTRL);
 	status = er32(STATUS);
 	rxcw = er32(RXCW);
+	/* SYNCH bit and IV bit are sticky */
+	udelay(10);
+	rxcw = er32(RXCW);
 
 	if ((rxcw & E1000_RXCW_SYNCH) && !(rxcw & E1000_RXCW_IV)) {
 
-- 
1.7.10.4

^ permalink raw reply related

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Willy Tarreau @ 2012-07-14  8:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hillf Danton, Johannes Truschnigg, linux-kernel, Linux-Netdev
In-Reply-To: <1342254042.3265.9017.camel@edumazet-glaptop>

On Sat, Jul 14, 2012 at 10:20:41AM +0200, Eric Dumazet wrote:
> > On Tue, Jul 3, 2012 at 8:02 AM, Willy Tarreau <w@1wt.eu> wrote:
> > >
> > > In fact it has been true zero copy in 2.6.25 until we faced a large
> > > amount of data corruption and the zero copy was disabled in 2.6.25.X.
> > > Since then it remained that way until you brought your patches to
> > > re-instantiate it.
> 
> Might be, or not (could be a NIC bug)

I may be wrong but what I recall from this bug was an issue when
forwarding TCP between two NICs, related to linear vs non-linear
data (I have memories of something around data not yet ACKed being
replaced before being retransmitted but I may be wrong). Anyway,
the way it was fixed consisted in simply disabling the zero-copy
code path. So this should be something different from what Johannes
reports. Maybe a regression since then though.

> Please Johannes could you try latest kernel tree ?

It would be useful, especially given the amount of changes you performed
in this area in latest version, it could be very possible that this new
bug got fixed as a side effect !

Regards,
Willy

^ permalink raw reply

* Re: [RFC 2/2] net: Add support for NTB virtual ethernet device
From: Jiri Pirko @ 2012-07-14  8:30 UTC (permalink / raw)
  To: Jon Mason; +Cc: linux-kernel, netdev, linux-pci, Dave Jiang
In-Reply-To: <20120714055034.GB4808@jonmason-lab>

Sat, Jul 14, 2012 at 07:50:35AM CEST, jon.mason@intel.com wrote:
>On Sat, Jul 14, 2012 at 01:14:03AM +0200, Jiri Pirko wrote:
>> Fri, Jul 13, 2012 at 11:45:00PM CEST, jon.mason@intel.com wrote:
>> >A virtual ethernet device that uses the NTB transport API to send/receive data.
>> >
>> >Signed-off-by: Jon Mason <jon.mason@intel.com>
>> >---
>> > drivers/net/Kconfig      |    4 +
>> > drivers/net/Makefile     |    1 +
>> > drivers/net/ntb_netdev.c |  411 ++++++++++++++++++++++++++++++++++++++++++++++
>> > 3 files changed, 416 insertions(+), 0 deletions(-)
>> > create mode 100644 drivers/net/ntb_netdev.c

<snip>

>> >+
>> >+static const struct net_device_ops ntb_netdev_ops = {
>> >+	.ndo_open = ntb_netdev_open,
>> >+	.ndo_stop = ntb_netdev_close,
>> >+	.ndo_start_xmit = ntb_netdev_start_xmit,
>> >+	.ndo_change_mtu = ntb_netdev_change_mtu,
>> >+	.ndo_tx_timeout = ntb_netdev_tx_timeout,
>> >+	.ndo_set_mac_address = eth_mac_addr,
>> 
>> Does your device support mac change while it's up and running?
>
>It's virtual ethernet, so there is no hardware limitation, only what is acceptable for the remote side to receive.

In that case, it would be good to do:
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;

This enables mac change in eth_mac_addr() when iface is running.

<snip>

>> >+
>> >+static int __init ntb_netdev_init_module(void)
>> >+{
>> >+	struct ntb_netdev *dev;
>> >+	int rc;
>> >+
>> >+	pr_info("%s: Probe\n", KBUILD_MODNAME);
>> >+
>> >+	netdev = alloc_etherdev(sizeof(struct ntb_netdev));
>> 
>> I might be missing something but this place (module init) does not seems
>> like a good place to do alloc_etherdev(). Do you want to support only
>> one netdevice instance?
>> 
>> Anyway, I think that using "static netdev" should be avoided in any case.
>> 
>
>It would fail the probe if there is no underlying ntb hardware, but it would make sense to check for that before allocing the etherdev.

But isn't there possible to have multiple ntb hardware devices? It would make
sense to register ntb device here with ntb core and let the core call
probe which would actually create new netdev.

Is there a limitation that one underlying ntb hardware ~ one ntb netdevice?

Thanks,
Jiri

^ permalink raw reply

* Re: resurrecting tcphealth
From: Eric Dumazet @ 2012-07-14  8:27 UTC (permalink / raw)
  To: Piotr Sawuk; +Cc: netdev, linux-kernel
In-Reply-To: <cc6495b92f1df180c1ad43057ceb0b98.squirrel@webmail.univie.ac.at>

On Sat, 2012-07-14 at 09:56 +0200, Piotr Sawuk wrote:
> On Sa, 14.07.2012, 03:31, valdis.kletnieks@vt.edu wrote:
> > On Fri, 13 Jul 2012 16:55:44 -0700, Stephen Hemminger said:
> >
> >> >+			/* Course retransmit inefficiency- this packet has been received
> >> twice. */
> >> >+			tp->dup_pkts_recv++;
> >> I don't understand that comment, could you use a better sentence please?
> >
> > I think what was intended was:
> >
> > /* Curse you, retransmit inefficiency! This packet has been received at
> least twice */
> >
> 
> LOL, no. I think "course retransmit" is short for "course-grained timeout
> caused retransmit" but I can't be sure since I'm not the author of these
> lines. I'll replace that comment with the non-shorthand version though.
> however, I think the real comment here should be:
> 
> /*A perceived shortcoming of the standard TCP implementation: A
> TCP receiver can get duplicate packets from the sender because it cannot
> acknowledge packets that arrive out of order. These duplicates would happen
> when the sender mistakenly thinks some packets have been lost by the network
> because it does not receive acks for them but in reality they were
> successfully received out of order. Since the receiver has no way of letting
> the sender know about the receipt of these packets, they could potentially
> be re-sent and re-received at the receiver. Not only do duplicate packets
> waste precious Internet bandwidth but they hurt performance because the
> sender mistakenly detects congestion from packet losses. The SACK TCP
> extension speci\fcally addresses this issue. A large number of duplicate
> packets received would indicate a signi\fcant bene\ft to the wide adoption of
> SACK. The duplicatepacketsreceived metric is computed at the
> receiver and counts these packets on a per-connection basis.*/
> 
> as copied from his thesis at [1]. also in the thesis he writes:
> 
> In our limited experiment, the results indicated no duplicate packets were
> received on any connection in the 18 hour run. This leads us to several
> conclusions. Since duplicate ACKs were seen on many connections we know that
> some packets were lost or reordered, but unACKed reordered packets never
> caused a /coursegrainedtimeouts/ on our connections. Only these timeouts
> will cause duplicate packets to be received since less severe out-of-order
> conditions will be resolved with fast retransmits. The lack of course
> timeouts
> may be due to the quality of UCSD's ActiveWeb network or the paucity of
> large gaps between received packet groups. It should be noted that Linux 2.2
> implements fast retransmits for up to two packet gaps, thus reducing the
> need for course grained timeouts due to the lack of SACK.
> 
> [1] https://sacerdoti.org/tcphealth/tcphealth-paper.pdf

Not sure how pertinent is this paper today in 2012

I would prefer you add global counters, instead of per tcp counters that
most applications wont use at all.

Example of a more useful patch : add a counter of packets queued in Out
Of Order queue ( in tcp_data_queue_ofo() )

"netstat -s" will display the total count, without any changes in
userland tools/applications.

^ permalink raw reply

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Eric Dumazet @ 2012-07-14  8:20 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Johannes Truschnigg, linux-kernel, Willy Tarreau, Linux-Netdev
In-Reply-To: <CAJd=RBAntSubDBbJ292SzeoN4hTwBQ_Q23jt+Y6i-+vfrQ5EHQ@mail.gmail.com>

On Sat, 2012-07-14 at 16:04 +0800, Hillf Danton wrote:
> On Sat, Jul 14, 2012 at 1:18 AM, Johannes Truschnigg
> <johannes@truschnigg.info> wrote:
> > Hello good people of linux-kernel.
> >
> > I've been bothered by silent data corruption from my personal fileserver - no
> > matter the Layer 7 protocol used, huge transfers sporadically ended up damaged
> > in-flight. I used Samba/CIFS, NFS(v4, via TCP), Apache httpd 2.2, thttpd,
> > python and netcat to verify this.
> >
> > I think I managed to track down the culprit: as soon as I disable sendfile()
> > for all programs that support such a configuration (netcat, afaik, won't ever
> > use sendfile() to transmit data over a socket, so the problem was never
> > reproducible there in the first place), everything reverts to perfect and
> > proper working condition.
> >
> > I've been experiencing this problem with vanilla kernel releases from the 3.3
> > up until 3.4.0 series. I do not know if it also occurs with earlier releases,
> > but I can verify if that is useful. I set up the environment for a minimal
> > kind of testcase (a large ISO image file available from the server's local
> > filesystem, as well as from a mounted NFS export - once via lo, and once via
> > br0/eth0), and proceeded to do the following:
> >
> > i=0; for i in {1..100}
> > do
> >   echo "pass $i:"; sync; echo 3 > /proc/sys/vm/drop_caches
> >   cmp -b /mnt/nfs-test/lo/tmp/X15-65741.iso /srv/files/pub/tmp/X15-65741.iso
> > done
> >
> > I then rotated the source of the data, and tested the network-mount against
> > the loopback-mount, as well as the network-mount against the local filesystem.
> >
> > Computing the file's md5sum in a loop whilst dropping caches after each
> > iteration by reading it directly from its location in the filesystem produces
> > the very same hash every time - I therefore think it's safe to assume the
> > corruption is introduced when traversing the networking stack. The hash also
> > does not change if I repeadetly compute the md5sum of the file as transferred
> > by, e. g., Apache httpd or smbd with sendfile explicitly disabled.
> >
> > Please take a look at the attachment to see the actual output of the above
> > script. It does not matter if I do an actual transfer over the network from my
> > server to one of its clients (I verified the problem with two different client
> > machines, one even running Windows), or if the server is both source and
> > destination of the transfer - as long as sendfile is involed, some of the data
> > will always become garbled sooner or later. That also leads me to believe that
> > my internetworking devices (my switch in particular) is working just fine;
> > testing bulky transfers from one host to another confirms this insofar as thus
> > all data makes it through unscathed.
> >
> > As soon as I switch off sendfile-support (in, e. g. Samba or Apache httpd), I
> > can run a series of thousands and more transfers, and not experience any
> > corruption at all. Whenever the data gets fubared, there is no hint at
> > anything fishy going on in the debug ringbuffer - curruption takes place in
> > total silence.
> >
> > The system in question has an Intel Pro/1000 PCI-e NIC for doing the networked
> > file transfers, and is backed by a md RAID5-Array with LVM2 on top. The 4GB of
> > system memory (ECC-enabled UDIMM) are operating in S4ECD4ED mode as reported
> > by EDAC, and there are no reported errors. The CPU I have installed is an AMD
> > Athlon II X2 245e on an ASUS M4A88TD-M/USB3 Motherboard. It's running Gentoo
> > for amd64. The box can run prime96 in torture mode and linpack just fine for
> > days - I'm therefore assuming the hardware to be working correctly.
> >
> > I have attached my kernel's config (from 3.4.0, as that's the image that I
> > have running right now) attached for sake of completeness, as well as some
> > information for you to see how I tested, and what these tests actually
> > produced. If you need any other information to help track this down, please
> > let me know.
> >
> > If you decide to answer please keep me CC'd, as I'm not subscribed to this
> > list.
> >
> > Just in case the numerous attachments get scrubbed/removed, I've also uploaded
> > them to http://johannes.truschnigg.info/tmp/sendfile_data_corruption/
> >
> > Thanks for reading, and have a nice weekend everyone :)
> >
> 
> Is the above corruption related to the one below?
> 
> 
> On Tue, Jul 3, 2012 at 8:02 AM, Willy Tarreau <w@1wt.eu> wrote:
> >
> > In fact it has been true zero copy in 2.6.25 until we faced a large
> > amount of data corruption and the zero copy was disabled in 2.6.25.X.
> > Since then it remained that way until you brought your patches to
> > re-instantiate it.

Might be, or not (could be a NIC bug)

Please Johannes could you try latest kernel tree ?

^ permalink raw reply

* Re: [RFC PATCH] tun: don't zeroize sock->file on detach
From: Al Viro @ 2012-07-14  8:15 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: davem, netdev, ruanzhijie, linux-kernel
In-Reply-To: <20120711114753.24395.53193.stgit@localhost6.localdomain6>

On Wed, Jul 11, 2012 at 03:48:20PM +0400, Stanislav Kinsbursky wrote:
> This is a fix for bug, introduced in 3.4 kernel by commit
> 1ab5ecb90cb6a3df1476e052f76a6e8f6511cb3d, which, among other things, replaced
> simple sock_put() by sk_release_kernel(). Below is sequence, which leads to
> oops for non-persistent devices:
> 
> tun_chr_close()
> tun_detach()				<== tun->socket.file = NULL
> tun_free_netdev()
> sk_release_sock()
> sock_release(sock->file == NULL)
> iput(SOCK_INODE(sock))			<== dereference on NULL pointer
> 
> This patch just removes zeroing of socket's file from __tun_detach().
> sock_release() will do this.
> 
> Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
> ---
>  drivers/net/tun.c |    1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 987aeef..c1639f3 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -185,7 +185,6 @@ static void __tun_detach(struct tun_struct *tun)
>  	netif_tx_lock_bh(tun->dev);
>  	netif_carrier_off(tun->dev);
>  	tun->tfile = NULL;
> -	tun->socket.file = NULL;
>  	netif_tx_unlock_bh(tun->dev);

ACK, but I have to say that I don't like the entire area.  The games around sock->file
in general tend to be really nasty.  Examples:
1) net/9p/trans_fd.c:p9_socket_open():
	we come there with freshly created and connected struct socket in *csocket
	we do sock_map_fd() and bugger off if it fails
	we do get_file(csocket->file) twice and, having grabbed the references, close
the damn fd.
What happens if that races with close() on the same fd before we get to those get_file()?
We hit sock_close(), which calls sock_release(), which clears csocket->file.  Boom -
atomic_inc_long(&NULL->f_count) is not going to do us any good.  Outright bug, mitigated
only by the fact that all callchains to that place go through mount(2), so you have elevated
privs anyway.

2) with this sucker we hit an interesting interplay with vhost; note that the total effect
of tun_get_socket() does *not* include any refcount changes.  Nor should it - the caller
has a valid reference to struct file, after all.  Eventually the caller proceeds to drop
the same reference, by doing fput(sock->file).  And it will be the same struct file, but
proving that takes a lot of digging through the tun.c guts; the crucial observation is that
we never get to __tun_detach() as long as we have a reference to opened (cdev) file that
has been successfully attached at some point and that ones that hadn't been attached at
all wouldn't have passed through tun_get_socket().  IOW, it works, but it's brittle as hell.
Unless I've missed something in the analysis and it's really broken, that is.

Frankly, I would prefer to keep the reference to struct file for vhost explicitly in vhost
data structures.  Would be less dependent on the guts of tun/macvtap/whatnot that way...

3) iscsi goes as far as allocating fake struct file (with kzalloc(), and $DEITY help you
if you ever call fput() on that), presumably for the sake of sctp.  The only place in sctp
stack I see looking at sock->file is
        /* in-kernel sockets don't generally have a file allocated to them
         * if all they do is call sock_create_kern().
         */
        if (sk->sk_socket->file)
                f_flags = sk->sk_socket->file->f_flags;

        timeo = sock_sndtimeo(sk, f_flags & O_NONBLOCK);
in __sctp_connect() and AFAICS we could bloody well have left it NULL - we leave ->f_flags
zero in that code anyway and that's what __sctp_connect() will presume on NULL ->file.
I'm not familiar enough with sctp or iscsi, but at the first look it seems to be asking
for removal of all those games with ->file in the latter.

I really wonder if we have a single legitimate case for anything other than sock_alloc_file()
setting sock->file.  Anyone?

^ permalink raw reply

* Re: PROBLEM: Silent data corruption when using sendfile()
From: Hillf Danton @ 2012-07-14  8:04 UTC (permalink / raw)
  To: Johannes Truschnigg
  Cc: linux-kernel, Eric Dumazet, Willy Tarreau, Linux-Netdev
In-Reply-To: <20120713171835.GA26052@vault.local>

On Sat, Jul 14, 2012 at 1:18 AM, Johannes Truschnigg
<johannes@truschnigg.info> wrote:
> Hello good people of linux-kernel.
>
> I've been bothered by silent data corruption from my personal fileserver - no
> matter the Layer 7 protocol used, huge transfers sporadically ended up damaged
> in-flight. I used Samba/CIFS, NFS(v4, via TCP), Apache httpd 2.2, thttpd,
> python and netcat to verify this.
>
> I think I managed to track down the culprit: as soon as I disable sendfile()
> for all programs that support such a configuration (netcat, afaik, won't ever
> use sendfile() to transmit data over a socket, so the problem was never
> reproducible there in the first place), everything reverts to perfect and
> proper working condition.
>
> I've been experiencing this problem with vanilla kernel releases from the 3.3
> up until 3.4.0 series. I do not know if it also occurs with earlier releases,
> but I can verify if that is useful. I set up the environment for a minimal
> kind of testcase (a large ISO image file available from the server's local
> filesystem, as well as from a mounted NFS export - once via lo, and once via
> br0/eth0), and proceeded to do the following:
>
> i=0; for i in {1..100}
> do
>   echo "pass $i:"; sync; echo 3 > /proc/sys/vm/drop_caches
>   cmp -b /mnt/nfs-test/lo/tmp/X15-65741.iso /srv/files/pub/tmp/X15-65741.iso
> done
>
> I then rotated the source of the data, and tested the network-mount against
> the loopback-mount, as well as the network-mount against the local filesystem.
>
> Computing the file's md5sum in a loop whilst dropping caches after each
> iteration by reading it directly from its location in the filesystem produces
> the very same hash every time - I therefore think it's safe to assume the
> corruption is introduced when traversing the networking stack. The hash also
> does not change if I repeadetly compute the md5sum of the file as transferred
> by, e. g., Apache httpd or smbd with sendfile explicitly disabled.
>
> Please take a look at the attachment to see the actual output of the above
> script. It does not matter if I do an actual transfer over the network from my
> server to one of its clients (I verified the problem with two different client
> machines, one even running Windows), or if the server is both source and
> destination of the transfer - as long as sendfile is involed, some of the data
> will always become garbled sooner or later. That also leads me to believe that
> my internetworking devices (my switch in particular) is working just fine;
> testing bulky transfers from one host to another confirms this insofar as thus
> all data makes it through unscathed.
>
> As soon as I switch off sendfile-support (in, e. g. Samba or Apache httpd), I
> can run a series of thousands and more transfers, and not experience any
> corruption at all. Whenever the data gets fubared, there is no hint at
> anything fishy going on in the debug ringbuffer - curruption takes place in
> total silence.
>
> The system in question has an Intel Pro/1000 PCI-e NIC for doing the networked
> file transfers, and is backed by a md RAID5-Array with LVM2 on top. The 4GB of
> system memory (ECC-enabled UDIMM) are operating in S4ECD4ED mode as reported
> by EDAC, and there are no reported errors. The CPU I have installed is an AMD
> Athlon II X2 245e on an ASUS M4A88TD-M/USB3 Motherboard. It's running Gentoo
> for amd64. The box can run prime96 in torture mode and linpack just fine for
> days - I'm therefore assuming the hardware to be working correctly.
>
> I have attached my kernel's config (from 3.4.0, as that's the image that I
> have running right now) attached for sake of completeness, as well as some
> information for you to see how I tested, and what these tests actually
> produced. If you need any other information to help track this down, please
> let me know.
>
> If you decide to answer please keep me CC'd, as I'm not subscribed to this
> list.
>
> Just in case the numerous attachments get scrubbed/removed, I've also uploaded
> them to http://johannes.truschnigg.info/tmp/sendfile_data_corruption/
>
> Thanks for reading, and have a nice weekend everyone :)
>

Is the above corruption related to the one below?


On Tue, Jul 3, 2012 at 8:02 AM, Willy Tarreau <w@1wt.eu> wrote:
>
> In fact it has been true zero copy in 2.6.25 until we faced a large
> amount of data corruption and the zero copy was disabled in 2.6.25.X.
> Since then it remained that way until you brought your patches to
> re-instantiate it.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox