Netdev List
 help / color / mirror / Atom feed
* Re: KVM induced panic on 2.6.38[2367] & 2.6.39
From: Brad Campbell @ 2011-06-08  0:15 UTC (permalink / raw)
  To: Bart De Schuymer
  Cc: Patrick McHardy, kvm, linux-mm, linux-kernel, netdev,
	netfilter-devel
In-Reply-To: <4DEE6815.7040504@pandora.be>

On 08/06/11 02:04, Bart De Schuymer wrote:

> If the bug is easily triggered with your guest os, then you could try to
> capture the traffic with wireshark (or something else) in a
> configuration that doesn't crash your system. Save the traffic in a pcap
> file. Then you can see if resending that traffic in the vulnerable
> configuration triggers the bug (I don't know if something in Windows
> exists, but tcpreplay should work for Linux). Once you have such a
> capture , chances are the bug is even easily reproducible by us (unless
> it's hardware-specific). Success isn't guaranteed, but I think it's
> worth a shot...

The issue with this is I don't have a configuration that does not crash 
the system. This only happens under the specific circumstance that 
traffic from VM A is being DNAT'd to VM B. If I disable 
CONFIG_BRIDGE_NETFILTER, or I leave out the DNAT then I can't replicate 
the problem as I don't seem to be able to get the packets to go where I 
want them to go.

Let me try and explain it a little more clearly with made up IP 
addresses to illustrate the problem.

I have VM A (1.1.1.2) and VM B (1.1.1.3) on br1 (1.1.1.1)
I have public IP on ppp0 (2.2.2.2).

VM B can talk to VM A using its host address (1.1.1.2) and there is no 
problem.

The DNAT says anything destined for PPP0 that is on port 443 and coming 
from anywhere other than PPP0 (ie inside the network) is to be DNAT'd to 
1.1.1.3.

So VM B (1.1.1.3) tries to connect to ppp0 (2.2.2.2) on port 443, and 
this is redirected to VM B on 1.1.1.2.

Only under this specific circumstance does the problem occur. I can get 
VM B (1.1.1.3) to talk directly to VM A (1.1.1.2) all day long and there 
is no problem, it's only when VM B tries to talk to ppp0 that there is 
an issue (and it happens within seconds of the initial connection).

All these tests have been performed with VM B being a Windows XP guest. 
Tonight I'll try it with a Linux guest and see if I can make it happen. 
If that works I might be able to come up with some reproducible test 
case for you. I have a desktop machine that has Intel VT extensions, so 
I'll work toward making a portable test case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: KVM induced panic on 2.6.38[2367] & 2.6.39
From: Brad Campbell @ 2011-06-08  0:18 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Eric Dumazet, Bart De Schuymer, kvm, linux-mm, linux-kernel,
	netdev, netfilter-devel
In-Reply-To: <4DEEACC3.3030509@trash.net>

On 08/06/11 06:57, Patrick McHardy wrote:
> On 07.06.2011 20:31, Eric Dumazet wrote:
>> Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit :
>>
>>> The main suspects would be NAT and TCPMSS. Did you also try whether
>>> the crash occurs with only one of these these rules?
>>>
>>>> I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
>>>> the address the way I was doing it, so that's a no-go for me.
>>>
>>> That's really weird since you're apparently not using any bridge
>>> netfilter features. It shouldn't have any effect besides changing
>>> at which point ip_tables is invoked. How are your network devices
>>> configured (specifically any bridges)?
>>
>> Something in the kernel does
>>
>> u16 *ptr = addr (given by kmalloc())
>>
>> ptr[-1] = 0;
>>
>> Could be an off-one error in a memmove()/memcopy() or loop...
>>
>> I cant see a network issue here.
>
> So far me neither, but netfilter appears to trigger the bug.

Would it help if I tried some older kernels? This issue only surfaced 
for me recently as I only installed the VM's in question about 12 weeks 
ago and have only just started really using them in anger. I could try 
reproducing it on progressively older kernels to see if I can find one 
that works and then bisect from there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv2 RFC 4/4] Revert "virtio: make add_buf return capacity remaining:
From: Rusty Russell @ 2011-06-08  0:19 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Carsten Otte, Christian Borntraeger, linux390, Martin Schwidefsky,
	Heiko Carstens, Shirley Ma, lguest, virtualization, netdev,
	linux-s390, kvm, Krishna Kumar, Tom Lendacky, steved, habanero
In-Reply-To: <20110607155457.GA17436@redhat.com>

On Tue, 7 Jun 2011 18:54:57 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Jun 02, 2011 at 06:43:25PM +0300, Michael S. Tsirkin wrote:
> > This reverts commit 3c1b27d5043086a485f8526353ae9fe37bfa1065.
> > The only user was virtio_net, and it switched to
> > min_capacity instead.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> It turns out another place in virtio_net: receive
> buf processing - relies on the old behaviour:
> 
> try_fill_recv:
> 	do {
> 		if (vi->mergeable_rx_bufs)
> 			err = add_recvbuf_mergeable(vi, gfp);
> 		else if (vi->big_packets)
> 			err = add_recvbuf_big(vi, gfp);
> 		else
> 			err = add_recvbuf_small(vi, gfp);
> 
> 		oom = err == -ENOMEM;
> 		if (err < 0)
> 			break;
> 		++vi->num;
> 	} while (err > 0);
> 
> The point is to avoid allocating a buf if
> the ring is out of space and we are sure
> add_buf will fail.
> 
> It works well for mergeable buffers and for big
> packets if we are not OOM. small packets and
> oom will do extra get_page/put_page calls
> (but maybe we don't care).
> 
> So this is RX, I intend to drop it from this patchset and focus on the
> TX side for starters.

We could do some hack where we get the capacity, and estimate how many
packets we need to fill it, then try to do that many.

I say hack, because knowing whether we're doing indirect buffers is a
layering violation.  But that's life when you're trying to do
microoptimizations.

Cheers,
Rusty.

^ permalink raw reply

* Re: [RFC] should we care of COMPAT mode in bridge ?
From: David Miller @ 2011-06-08  0:27 UTC (permalink / raw)
  To: shemminger; +Cc: eric.dumazet, netdev
In-Reply-To: <20110607165134.3e20bb9b@s6510.ftrdhcpuser.net>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 7 Jun 2011 16:51:34 -0700

> The problem is that most of the other ioctl's won't work because of use of
> SIOCDEVPRIVATE.

As we discussed we can pass SIOCDEVPRIVATE requests down to the driver
just like we do for all kinds of other compat ioctls.

^ permalink raw reply

* Re: [PATCH 1/2] vlan: only create special VLAN 0 once
From: Jesse Gross @ 2011-06-08  1:25 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: David Miller, kaber, netdev, pedro.netdev
In-Reply-To: <20110607161808.GA5018@midget.suse.cz>

On Tue, Jun 7, 2011 at 9:18 AM, Jiri Bohac <jbohac@suse.cz> wrote:
> Hi David,
>
> On Sun, Jun 05, 2011 at 02:28:23PM -0700, David Miller wrote:
>> From: Jiri Bohac <jbohac@suse.cz>
>> Date: Fri, 3 Jun 2011 22:07:38 +0200
>>
>> > Commit ad1afb00 registers a VLAN with vid == 0 for every device to handle
>> > 802.1p frames.  This is currently done on every NETDEV_UP event and the special
>> > vlan is never unregistered.  This may have strange effects on drivers
>> > implementning ndo_vlan_rx_add_vid(). E.g. bonding will allocate a linked-list
>> > element each time, causing a memory leak.
>> >
>> > Only register the special VLAN once on NETDEV_REGISTER.
>> >
>> > Signed-off-by: Jiri Bohac <jbohac@suse.cz>
>>
>> I recognize the problem, but this solution isn't all that good.
>>
>> I am pretty sure that the hardware device driver methods that
>> implement ndo_vlan_rx_add_vid() assume that the device is up.
>> Because most drivers completely reset the chip when the
>> interface is brought up and this will likely clear out the
>> VLAN ID tables in the chip.
>
> Really? In that case, we have a much bigger problem: the vlan
> code allows registering a new vlan on an interface that is down.
> And it only registers the VID with ndo_vlan_rx_add_vid() in
> register_vlan_dev() during the registration of the new vlan
> interface -- it never re-registers the VIDs on a NETDEV_UP.

No, it's not true.  All drivers store the registered vlan filters in
some way so that they can restore them when the device is reset.  This
is currently done in one of two ways: storing a bitmap or iterating
over the devices currently registered in a group.

The vlan code is moving away from directly accessing groups and no new
drivers do this.  In fact, once all drivers are converted over groups
will not even be registered on devices.  This is because otherwise
there is quite a bit of vlan code in each driver, which leads to
inconsistent behavior and bugs.

Really, all a driver needs to know is whether it should add a given
vlan to its table, not what the upper layers plan to do with it.  So
when ndo_vlan_rx_add_vid() is called it should add it to its CAM table
and store it if it is needed to restore behavior after a reset, just
as is done with all other configuration state.

^ permalink raw reply

* [af-packet 0/2] Enhance af-packet to provide (near zero)lossless packet capture functionality.
From: Chetan Loke @ 2011-06-08  3:13 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, kaber, johann.baudy, Chetan Loke

Hello,

Please review the patchset. Any feedback is appreciated.

The patch set is intended to:
a) demonstrate the improvements
b) gather suggestions

This patch attempts to:
1)Improve network capture visibility by increasing packet density
2)Assist in analyzing multiple(aggregated) capture ports.

Benefits:
  B1) ~15-20% reduction in cpu-usage.
  B2) ~20% increase in packet capture rate.
  B3) ~2x  increase in packet density.
  B4) Port aggregation analysis.
  B5) Non static frame size to capture entire packet payload.


With the current af_packet->rx::mmap based approach, the element size
in the block needs to be statically configured. Nothing wrong with this
config/implementation. But the traffic profile cannot be known in advance.
And so it would be nice if that configuration wasn't static. Normally,
one would configure the element-size to be '2048' so that you can atleast
capture the entire 'MTU-size'.But if the traffic profile varies then we
would end up either i)wasting memory or ii) end up getting a sliced frame.
In other words the packet density will be much less in the first case.

--------------------
Performance results:
--------------------

Tpacket config(same on Physical/Virtual setup):
64 blocks(1MB block size)

**************
Physical setup
**************

pktgen: 64 byte traffic.

1G Intel
driver: igb
version: 2.1.0-k2
firmware-version: 3.19-0


Tpacket          V1                 V3
capture-rate     600K pps     720K pps
cpu usage        70%           53%
Drop-rate         7-10%        ~1%

**********************
Virtual Machine setup:
**********************

pktgen: 64 byte traffic,40M packets(clone_skb <40000000>)

Worker VMs(FC12):
3 VMs:VM0 .. VM2, each sending 40M packets.

probe-VM(FC15): 1-vCPU/512MB memory
running patched kernel


Tpacket          V1                       V3
capture-rate     700-800K pps        1M pps
cpu usage        50%                   ~30%
Drop-rate         9-10%                <1%


Plus, in the VM setup,V3 sees/captures around 5-10% more traffic than V1/V2.

------------
Enhancement:
------------
E1) Enhanced tpacket_rcv so that it can dump/copy the packets one after another.
E2) Also implemented basic timeout mechanism to close 'a' current block.
    That way, user-space won't be blocked forever on an idle link.
    This is a much needed feature while monitoring multiple ports.
    Look at 3) below.

-------------------------------
Why is such enhancement needed?
-------------------------------
1) Well, spin-waiting/polling on a per-packet basis to see if it's ready
   to be consumed does not scale while monitoring multiple ports.
   poll() is not performance friendly either.
2) Also, typically a user-space packet capture interface handles multiple
   packets to another user-space protocol-decoder.

   ----------------
   protocol-decoder
          T2
   ----------------
    =============
    ship pkts
    =============
           ^
           |
           v
   -----------------
   pkt-capture logic
           T1
   -----------------
   ================
     nic/sock IF
   ================
           ^
           |
           V

T1 and T2 are user-space threads. If the hand-off between T1 and T2
happens on a per-pkt basis then the solution does NOT scale.

However, one can argue that T1 can coalesce packets and then pass of a
single chunk to T2.But T1's packet consumption granularity is still at
an individual packet level and that is something that needs to be
addressed to avoid excessive polling.


3) Port aggregation analysis:
   Multiple ports are viewed/analyzed as one logical pipe.
   Example:
   3.1) up-stream    path can be tapped in eth1
   3.2) down-stream  path can be tapped in eth2
   3.3) Network TAP splits Rx/Tx paths and then feeds to eth1,eth2.

   If both eth1,eth2 need to be viewed as one logical channel,
   then that implies we need to timesort the packets as they come across
   eth1,eth2.

   3.4) But following issues further complicates the problem:
        3.4.1)What if one stream is bursty and other is flowing
              at line rate?
        3.4.2)How long do we wait before we can actually make a
              decision in the app-space and bail-out from the spin-wait?

   Solution:
   3.5) Once we receive a block from multiple ports,we can compare
        the timestamps from the block-descriptor and then easily time sort
        the packets and feed them to the decoders.


Please don't shoot the patchset because of its size :).



Chetan Loke (2):

 include/linux/if_packet.h |  127 +++++++-
 net/packet/af_packet.c    |  878 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 957 insertions(+), 48 deletions(-)

-- 
1.7.5.2


^ permalink raw reply

* [af-packet 1/2] Enhance af-packet to provide (near zero)lossless packet capture functionality.
From: Chetan Loke @ 2011-06-08  3:13 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, kaber, johann.baudy, Chetan Loke,
	Chetan Loke
In-Reply-To: <1307502786-1396-1-git-send-email-loke.chetan@gmail.com>

Added TPACKET_V3 definitions

Signed-off-by: Chetan Loke <lokec@ccs.neu.edu>
---
 include/linux/if_packet.h |  127 ++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 126 insertions(+), 1 deletions(-)

diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 72bfa5a..9e4eea1 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -24,7 +24,7 @@ struct sockaddr_ll {
 #define PACKET_HOST		0		/* To us		*/
 #define PACKET_BROADCAST	1		/* To all		*/
 #define PACKET_MULTICAST	2		/* To group		*/
-#define PACKET_OTHERHOST	3		/* To someone else 	*/
+#define PACKET_OTHERHOST	3		/* To someone else	*/
 #define PACKET_OUTGOING		4		/* Outgoing of any type */
 /* These ones are invisible by user level */
 #define PACKET_LOOPBACK		5		/* MC/BRD frame looped back */
@@ -55,6 +55,17 @@ struct tpacket_stats {
 	unsigned int	tp_drops;
 };
 
+struct tpacket_stats_v3 {
+	unsigned int	tp_packets;
+	unsigned int	tp_drops;
+	unsigned int	tp_freeze_q_cnt;
+};
+
+union tpacket_stats_u {
+	struct tpacket_stats stats1;
+	struct tpacket_stats_v3 stats3;
+};
+
 struct tpacket_auxdata {
 	__u32		tp_status;
 	__u32		tp_len;
@@ -70,6 +81,7 @@ struct tpacket_auxdata {
 #define TP_STATUS_COPY		0x2
 #define TP_STATUS_LOSING	0x4
 #define TP_STATUS_CSUMNOTREADY	0x8
+#define TP_STATUS_BLK_TMO	0x10
 
 /* Tx ring - header status */
 #define TP_STATUS_AVAILABLE	0x0
@@ -102,11 +114,111 @@ struct tpacket2_hdr {
 	__u16		tp_vlan_tci;
 };
 
+struct tpacket3_hdr {
+	__u32		tp_status;
+	__u32		tp_len;
+	__u32		tp_snaplen;
+	__u16		tp_mac;
+	__u16		tp_net;
+	__u32		tp_sec;
+	__u32		tp_nsec;
+	__u16		tp_vlan_tci;
+	__u32		tp_next_offset;
+};
+
+struct bd_ts {
+	unsigned int ts_sec;
+	union {
+		struct {
+			unsigned int ts_usec;
+		};
+		struct {
+			unsigned int ts_nsec;
+		};
+	};
+} __attribute__ ((__packed__));
+
+struct bd_v1 {
+	/*
+	 * If you re-order the first 5 fields then
+	 * the BLOCK_XXX macros will NOT work.
+	 */
+	__u32	block_status;
+	__u32	num_pkts;
+	__u32	offset_to_first_pkt;
+
+	/* Number of valid bytes (including padding)
+	 * blk_len <= tp_block_size
+	 */
+	__u32	blk_len;
+
+	/*
+	 * Quite a few uses of sequence number:
+	 * 1. Make sure cache flush etc worked.
+	 *    Well, one can argue - why not use the increasing ts below?
+	 *    But look at 2. below first.
+	 * 2. When you pass around blocks to other user space decoders,
+	 *    you can see which blk[s] is[are] outstanding etc.
+	 * 3. Validate kernel code.
+	 */
+	__u64	seq_num;
+
+	/*
+	 * ts_last_pkt:
+	 *
+	 * Case 1.	Block has 'N'(N >=1) packets and TMO'd(timed out)
+	 *		ts_last_pkt == 'time-stamp of last packet' and NOT the
+	 *		time when the timer fired and the block was closed.
+	 *		By providing the ts of the last packet we can absolutely
+	 *		guarantee that time-stamp wise, the first packet in the next
+	 *		block will never precede the last packet of the previous
+	 *		block.
+	 * Case 2.	Block has zero packets and TMO'd
+	 *		ts_last_pkt = time when the timer fired and the block
+	 *		was closed.
+	 * Case 3.	Block has 'N' packets and NO TMO.
+	 *		ts_last_pkt = time-stamp of the last pkt in the block.
+	 *
+	 * ts_first_pkt:
+	 *		Is always the time-stamp when the block was opened.
+	 *		Case a)	ZERO packets
+	 *			No packets to deal with but atleast you know the
+	 *			time-interval of this block.
+	 *		Case b) Non-zero packets
+	 *			Use the ts of the first packet in the block.
+	 *
+	 */
+	struct bd_ts	ts_first_pkt;
+	struct bd_ts	ts_last_pkt;
+} __attribute__ ((__packed__));
+
+struct block_desc {
+	__u16 version;
+	union {
+		struct {
+			__u32	words[4];
+			__u64	dword;
+		} __attribute__ ((__packed__));
+		struct bd_v1 bd1;
+	};
+} __attribute__ ((__packed__));
+
+
+
 #define TPACKET2_HDRLEN		(TPACKET_ALIGN(sizeof(struct tpacket2_hdr)) + sizeof(struct sockaddr_ll))
+#define TPACKET3_HDRLEN		(TPACKET_ALIGN(sizeof(struct tpacket3_hdr)) + sizeof(struct sockaddr_ll))
+
+#define BLOCK_STATUS(x)	((x)->words[0])
+#define BLOCK_NUM_PKTS(x)	((x)->words[1])
+#define BLOCK_O2FP(x)		((x)->words[2])
+#define BLOCK_LEN(x)		((x)->words[3])
+#define BLOCK_SNUM(x)		((x)->dword)
+
 
 enum tpacket_versions {
 	TPACKET_V1,
 	TPACKET_V2,
+	TPACKET_V3,
 };
 
 /*
@@ -129,6 +241,19 @@ struct tpacket_req {
 	unsigned int	tp_frame_nr;	/* Total number of frames */
 };
 
+struct tpacket_req3 {
+	unsigned int	tp_block_size;	/* Minimal size of contiguous block */
+	unsigned int	tp_block_nr;	/* Number of blocks */
+	unsigned int	tp_frame_size;	/* Size of frame */
+	unsigned int	tp_frame_nr;	/* Total number of frames */
+	unsigned int	tp_retire_blk_tov; /* timeout in msecs */
+};
+
+union tpacket_req_u {
+	struct tpacket_req	req;
+	struct tpacket_req3	req3;
+};
+
 struct packet_mreq {
 	int		mr_ifindex;
 	unsigned short	mr_type;
-- 
1.7.5.2


^ permalink raw reply related

* [af-packet 2/2] Enhance af-packet to provide (near zero)lossless packet capture functionality
From: Chetan Loke @ 2011-06-08  3:13 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, kaber, johann.baudy, Chetan Loke,
	Chetan Loke
In-Reply-To: <1307502786-1396-1-git-send-email-loke.chetan@gmail.com>

1) Blocks can now be configured with non-static frame format.
   Non-static frame format provides following benefits:
   1.1) Increases packet density by a factor of 2x.
   1.2) Ability to capture entire packet.
   1.3) Captures 99% 64-byte traffic as seen by the kernel.
2) Read/poll is now at a block-level rather than at packet level.
3) Added user-configurable timeout knob for timing out blocks on slow/bursty links.
4) Block level processing now allows monitoring multiple links as a single
   logical pipe.

Changes:
C1) tpacket_rcv()
    C1.1) packet_current_frame() is replaced by packet_current_rx_frame()
          The bulk of the processing is then moved in the following chain:
          packet_current_rx_frame()
            __packet_lookup_frame_in_block
              fill_curr_block()
              or
                retire_current_block
                dispatch_next_block
              or
              return NULL(queue is plugged/paused)

Signed-off-by: Chetan Loke <lokec@ccs.neu.edu>
---
 net/packet/af_packet.c |  878 +++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 831 insertions(+), 47 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 91cb1d7..e0a1314 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -40,6 +40,9 @@
  *					byte arrays at the end of sockaddr_ll
  *					and packet_mreq.
  *		Johann Baudy	:	Added TX RING.
+ *		Chetan Loke	:	Implemented TPACKET_V3 block abstraction
+ *					layer. Copyright (C) 2011, <lokec@ccs.neu.edu>
+ *
  *
  *		This program is free software; you can redistribute it and/or
  *		modify it under the terms of the GNU General Public License
@@ -161,9 +164,49 @@ struct packet_mreq_max {
 	unsigned char	mr_address[MAX_ADDR_LEN];
 };
 
-static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
+static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 		int closing, int tx_ring);
 
+
+#define V3_ALIGNMENT	(4)
+#define ALIGN_4(x)	(((x)+V3_ALIGNMENT-1)&~(V3_ALIGNMENT-1))
+
+#define BLK_HDR_LEN	(sizeof(struct block_desc))
+
+/* kbdq - kernel block descriptor queue */
+struct kbdq_core {
+	struct pgv	*pkbdq;
+	unsigned int	hdrlen;
+	unsigned char	reset_pending_on_curr_blk;
+	unsigned char   delete_blk_timer;
+	unsigned short	kactive_blk_num;
+	unsigned short	hole_bytes_size;
+	char		*pkblk_start;
+	char		*pkblk_end;
+	int		kblk_size;
+	unsigned int	knum_blocks;
+	uint64_t	knxt_seq_num;
+	char		*prev;
+	char		*nxt_offset;
+
+	/* last_kactive_blk_num:
+	 * trick to see if user-space has caught up
+	 * in order to avoid refreshing timer when every single pkt arrives.
+	 */
+	unsigned short	last_kactive_blk_num;
+
+	atomic_t	blk_fill_in_prog;
+
+	/* Default is set to 8ms */
+#define DEFAULT_PRB_RETIRE_TOV	(8)
+
+	unsigned short  retire_blk_tov;
+	unsigned long	tov_in_jiffies;
+
+	/* timer to retire an outstanding block */
+	struct timer_list retire_blk_timer;
+};
+
 #define PGV_FROM_VMALLOC 1
 struct pgv {
 	char *buffer;
@@ -180,18 +223,36 @@ struct packet_ring_buffer {
 	unsigned int		pg_vec_pages;
 	unsigned int		pg_vec_len;
 
+	struct kbdq_core	prb_bdqc;
 	atomic_t		pending;
 };
 
 struct packet_sock;
 static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
 
+static void *packet_previous_frame(struct packet_sock *po,
+		struct packet_ring_buffer *rb,
+		int status);
+static void packet_increment_head(struct packet_ring_buffer *buff);
+static int prb_curr_blk_in_use(struct kbdq_core *,
+			struct block_desc *);
+static void *prb_dispatch_next_block(struct kbdq_core *,
+			struct packet_sock *);
+static void prb_retire_current_block(struct kbdq_core *,
+		struct packet_sock *, unsigned int status);
+static int prb_queue_frozen(struct kbdq_core *);
+static void prb_open_block(struct kbdq_core *, struct block_desc *);
+static void prb_retire_rx_blk_timer_expired(unsigned long);
+static void _prb_refresh_rx_retire_blk_timer(struct kbdq_core *);
+static void prb_init_blk_timer(struct packet_sock *, struct kbdq_core *,
+				void (*func) (unsigned long));
 static void packet_flush_mclist(struct sock *sk);
 
 struct packet_sock {
 	/* struct sock has to be the first member of packet_sock */
 	struct sock		sk;
 	struct tpacket_stats	stats;
+	union  tpacket_stats_u	stats_u;
 	struct packet_ring_buffer	rx_ring;
 	struct packet_ring_buffer	tx_ring;
 	int			copy_thresh;
@@ -223,6 +284,19 @@ struct packet_skb_cb {
 
 #define PACKET_SKB_CB(__skb)	((struct packet_skb_cb *)((__skb)->cb))
 
+#define GET_PBDQC_FROM_RB(x)	((struct kbdq_core *)(&(x)->prb_bdqc))
+
+#define GET_PBLOCK_DESC(x, bid)	((struct block_desc *)((x)->pkbdq[(bid)].buffer))
+
+#define GET_CURR_PBLOCK_DESC_FROM_CORE(x)	\
+	((struct block_desc *)((x)->pkbdq[(x)->kactive_blk_num].buffer))
+
+
+#define GET_NEXT_PRB_BLK_NUM(x) \
+	(((x)->kactive_blk_num < ((x)->knum_blocks-1)) ? \
+	((x)->kactive_blk_num+1) : 0)
+
+
 static inline __pure struct page *pgv_to_page(void *addr)
 {
 	if (is_vmalloc_addr(addr))
@@ -248,8 +322,11 @@ static void __packet_set_status(struct packet_sock *po, void *frame, int status)
 		h.h2->tp_status = status;
 		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
 		break;
+	case TPACKET_V3:
 	default:
-		pr_err("TPACKET version not supported\n");
+		pr_err("<%s> TPACKET version not supported.Who is calling?\
+			Dumping stack.\n", __func__);
+		dump_stack();
 		BUG();
 	}
 
@@ -274,8 +351,11 @@ static int __packet_get_status(struct packet_sock *po, void *frame)
 	case TPACKET_V2:
 		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
 		return h.h2->tp_status;
+	case TPACKET_V3:
 	default:
-		pr_err("TPACKET version not supported\n");
+		pr_err("<%s> TPACKET version:%d not supported.\
+			Dumping stack.\n", __func__, po->tp_version);
+		dump_stack();
 		BUG();
 		return 0;
 	}
@@ -312,6 +392,617 @@ static inline void *packet_current_frame(struct packet_sock *po,
 	return packet_lookup_frame(po, rb, rb->head, status);
 }
 
+static void prb_del_retire_blk_timer(struct kbdq_core *pkc)
+{
+	del_timer_sync(&pkc->retire_blk_timer);
+}
+
+static void prb_shutdown_retire_blk_timer(struct packet_sock *po,
+		int tx_ring,
+		struct sk_buff_head *rb_queue)
+{
+	struct kbdq_core *pkc;
+
+	pkc = tx_ring ? &po->tx_ring.prb_bdqc : &po->rx_ring.prb_bdqc;
+
+	spin_lock(&rb_queue->lock);
+	pkc->delete_blk_timer = 1;
+	spin_unlock(&rb_queue->lock);
+
+	prb_del_retire_blk_timer(pkc);
+}
+
+static void prb_init_blk_timer(struct packet_sock *po,
+		struct kbdq_core *pkc,
+		void (*func) (unsigned long))
+{
+	init_timer(&pkc->retire_blk_timer);
+	pkc->retire_blk_timer.data = (long)po;
+	pkc->retire_blk_timer.function = func;
+	pkc->retire_blk_timer.expires = jiffies;
+}
+
+static void prb_setup_retire_blk_timer(struct packet_sock *po, int tx_ring)
+{
+	struct kbdq_core *pkc;
+
+	if (tx_ring)
+		BUG();
+
+	pkc = tx_ring ? &po->tx_ring.prb_bdqc : &po->rx_ring.prb_bdqc;
+	prb_init_blk_timer(po, pkc, prb_retire_rx_blk_timer_expired);
+}
+
+static int prb_calc_retire_blk_tmo(struct packet_sock *po,
+				int blk_size_in_bytes)
+{
+	struct net_device *dev;
+	unsigned int mbits = 0, msec = 0, div = 0, tmo = 0;
+
+	dev = dev_get_by_index(sock_net(&po->sk), po->ifindex);
+	if (unlikely(dev == NULL))
+		return DEFAULT_PRB_RETIRE_TOV;
+
+	if (dev->ethtool_ops && dev->ethtool_ops->get_settings) {
+		struct ethtool_cmd ecmd = { .cmd = ETHTOOL_GSET, };
+
+		if (!dev->ethtool_ops->get_settings(dev, &ecmd)) {
+			switch (ecmd.speed) {
+			case SPEED_10000:
+				msec = 1;
+				div = 10000/1000;
+				break;
+			case SPEED_1000:
+				msec = 1;
+				div = 1000/1000;
+				break;
+			/*
+			 * If the link speed is so slow you don't really
+			 * need to worry about perf anyways
+			 */
+			case SPEED_100:
+			case SPEED_10:
+			default:
+				return DEFAULT_PRB_RETIRE_TOV;
+			}
+		}
+	}
+
+	mbits = (blk_size_in_bytes * 8) / (1024 * 1024);
+
+	if (div)
+		mbits /= div;
+
+	tmo = mbits * msec;
+
+	if (div)
+		return tmo+1;
+	return tmo;
+}
+
+static void init_prb_bdqc(struct packet_sock *po,
+			struct packet_ring_buffer *rb,
+			struct pgv *pg_vec,
+			union tpacket_req_u *req_u, int tx_ring)
+{
+	struct kbdq_core *p1 = &rb->prb_bdqc;
+	struct block_desc *pbd;
+
+	memset(p1, 0x0, sizeof(*p1));
+	p1->knxt_seq_num = 1;
+	p1->pkbdq = pg_vec;
+	pbd = (struct block_desc *)pg_vec[0].buffer;
+	p1->pkblk_start	= (char *)pg_vec[0].buffer;
+	p1->kblk_size = req_u->req3.tp_block_size;
+	p1->knum_blocks	= req_u->req3.tp_block_nr;
+	p1->hdrlen = po->tp_hdrlen;
+	pbd->version = po->tp_version;
+	p1->last_kactive_blk_num = 0;
+	po->stats_u.stats3.tp_freeze_q_cnt = 0;
+	if (req_u->req3.tp_retire_blk_tov)
+		p1->retire_blk_tov = req_u->req3.tp_retire_blk_tov;
+	else
+		p1->retire_blk_tov = prb_calc_retire_blk_tmo(po,
+						req_u->req3.tp_block_size);
+	p1->tov_in_jiffies = msecs_to_jiffies(p1->retire_blk_tov);
+	prb_setup_retire_blk_timer(po, tx_ring);
+	prb_open_block(p1, pbd);
+}
+
+/*  Do NOT update the last_blk_num first.
+ *  Assumes sk_buff_head lock is held.
+ */
+static void _prb_refresh_rx_retire_blk_timer(struct kbdq_core *pkc)
+{
+	mod_timer(&pkc->retire_blk_timer,
+			jiffies + pkc->tov_in_jiffies);
+	pkc->last_kactive_blk_num = pkc->kactive_blk_num;
+}
+
+/*
+ * Timer logic:
+ * 1) We refresh the timer only when we open a block.
+ *    By doing this we don't waste cycles refreshing the timer
+ *	  on packet-by-packet basis.
+ *
+ * With a 1MB block-size, on a 1Gbps line, it will take
+ * i) ~8 ms to fill a block + ii) memcpy etc.
+ * In this cut we are not accounting for the memcpy time.
+ *
+ * So, if the user sets the 'tmo' to 10ms then the timer
+ * will never fire while the block is still getting filled
+ * (which is what we want). However, the user could choose
+ * to close a block early and that's fine.
+ *
+ * But when the timer does fire, we check whether or not to refresh it.
+ * Since the tmo granularity is in msecs, it is not too expensive
+ * to refresh the timer, lets say every '8' msecs.
+ * Either the user can set the 'tmo' or we can derive it based on
+ * a) line-speed and b) block-size.
+ * prb_calc_retire_blk_tmo() calculates the tmo.
+ *
+ */
+static void prb_retire_rx_blk_timer_expired(unsigned long data)
+{
+	struct packet_sock *po = (struct packet_sock *)data;
+	struct kbdq_core *pkc = &po->rx_ring.prb_bdqc;
+	unsigned int frozen;
+	struct block_desc *pbd;
+
+	spin_lock(&po->sk.sk_receive_queue.lock);
+
+	frozen = prb_queue_frozen(pkc);
+	pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+
+	if (unlikely(pkc->delete_blk_timer))
+		goto out;
+
+	/* We only need to plug the race when the block is partially filled.
+	 * tpacket_rcv:
+	 *		lock(); increment BLOCK_NUM_PKTS; unlock()
+	 *		copy_bits() is in progress ...
+	 * timer fires on other cpu:
+	 *		we can't retire the current block because copy_bits
+	 *		is in progress.
+	 *
+	 */
+	if (BLOCK_NUM_PKTS(pbd)) {
+		while (atomic_read(&pkc->blk_fill_in_prog)) {
+			/* Waiting for skb_copy_bits to finish... */
+			cpu_relax();
+		}
+	}
+
+	if (pkc->last_kactive_blk_num == pkc->kactive_blk_num) {
+		if (!frozen) {
+			prb_retire_current_block(pkc, po, TP_STATUS_BLK_TMO);
+			if (!prb_dispatch_next_block(pkc, po))
+				goto refresh_timer;
+			else
+				goto out;
+		} else {
+			/* Case 1. Queue was frozen because user-space was
+			 *	   lagging behind.
+			 */
+			if (prb_curr_blk_in_use(pkc, pbd)) {
+			       /*
+				* Ok, user-space is still behind.
+				* So just refresh the timer.
+				*/
+				goto refresh_timer;
+			} else {
+			       /* Case 2. queue was frozen, user-space caught up,
+				* now the link went idle && the timer fired.
+				* We don't have a block to close. So we open this
+				* block and restart the timer.
+				* opening a block thaws the queue, restarts timer.
+				* Thawing/timer-refresh is a side effect.
+				*/
+				prb_open_block(pkc, pbd);
+				goto out;
+			}
+		}
+	}
+
+refresh_timer:
+	_prb_refresh_rx_retire_blk_timer(pkc);
+
+out:
+	spin_unlock(&po->sk.sk_receive_queue.lock);
+}
+
+static inline void prb_flush_block(struct kbdq_core *pkc1, struct block_desc *pbd1,
+			__u32 status)
+{
+	/* Flush everything minus the block header */
+
+#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1
+	u8 *start, *end;
+
+	start = (u8 *)pbd1;
+
+	/* Skip the block header(we know header WILL fit in 4K) */
+	start += PAGE_SIZE;
+
+	end = (u8 *)PAGE_ALIGN((unsigned long)pkc1->pkblk_end);
+	for (; start < end; start += PAGE_SIZE)
+		flush_dcache_page(pgv_to_page(start));
+
+	smp_wmb();
+#endif
+
+	/* Now update the block status. */
+
+	BLOCK_STATUS(pbd1) = status;
+
+	/* Flush the block header */
+
+#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1
+	start = (u8 *)pbd1;
+	flush_dcache_page(pgv_to_page(start));
+
+	smp_wmb();
+#endif
+}
+
+/*
+ * Side effect:
+ *
+ * 1) flush the block
+ * 2) Increment active_blk_num
+ *
+ * Note:We DONT refresh the timer on purpose.
+ *	Because almost always the next block will be opened.
+ */
+static void prb_close_block(struct kbdq_core *pkc1, struct block_desc *pbd1,
+		struct packet_sock *po, unsigned int stat)
+{
+	__u32 status = TP_STATUS_USER | stat;
+
+	struct tpacket3_hdr *last_pkt;
+	struct bd_v1 *b1 = &pbd1->bd1;
+
+	if (po->stats.tp_drops)
+		status |= TP_STATUS_LOSING;
+
+	last_pkt = (struct tpacket3_hdr *)pkc1->prev;
+	last_pkt->tp_next_offset = 0;
+
+	/* Get the ts of the last pkt */
+	if (BLOCK_NUM_PKTS(pbd1)) {
+		b1->ts_last_pkt.ts_sec = last_pkt->tp_sec;
+		b1->ts_last_pkt.ts_nsec	= last_pkt->tp_nsec;
+	} else {
+		/* Ok, we tmo'd - so get the current time */
+		struct timespec ts;
+		getnstimeofday(&ts);
+		b1->ts_last_pkt.ts_sec = ts.tv_sec;
+		b1->ts_last_pkt.ts_nsec	= ts.tv_nsec;
+	}
+
+	smp_wmb();
+
+	/* Flush the block */
+	prb_flush_block(pkc1, pbd1, status);
+
+	pkc1->kactive_blk_num = GET_NEXT_PRB_BLK_NUM(pkc1);
+}
+
+static inline void prb_thaw_queue(struct kbdq_core *pkc)
+{
+	pkc->reset_pending_on_curr_blk = 0;
+}
+
+/*
+ * Side effect of opening a block:
+ *
+ * 1) prb_queue is thawed.
+ * 2) retire_blk_timer is refreshed.
+ *
+ */
+static void prb_open_block(struct kbdq_core *pkc1, struct block_desc *pbd1)
+{
+	struct timespec ts;
+	struct bd_v1 *b1 = &pbd1->bd1;
+
+	smp_rmb();
+
+	if (likely(TP_STATUS_KERNEL == BLOCK_STATUS(pbd1))) {
+
+		BLOCK_SNUM(pbd1) = pkc1->knxt_seq_num++;
+		BLOCK_NUM_PKTS(pbd1) = 0;
+		BLOCK_LEN(pbd1) = BLK_HDR_LEN;
+		getnstimeofday(&ts);
+		b1->ts_first_pkt.ts_sec = ts.tv_sec;
+		b1->ts_first_pkt.ts_nsec = ts.tv_nsec;
+		pkc1->pkblk_start = (char *)pbd1;
+		pkc1->nxt_offset = (char *)(pkc1->pkblk_start+BLK_HDR_LEN);
+		BLOCK_O2FP(pbd1) = (__u32)BLK_HDR_LEN;
+		pkc1->prev = pkc1->nxt_offset;
+		pkc1->pkblk_end = pkc1->pkblk_start + pkc1->kblk_size;
+		prb_thaw_queue(pkc1);
+		_prb_refresh_rx_retire_blk_timer(pkc1);
+
+		smp_wmb();
+
+		return;
+	}
+
+	pr_err("<%s> ERROR block:%p is NOT FREE status:%d\
+			kactive_blk_num:%d\n",
+			__func__, pbd1, BLOCK_STATUS(pbd1), pkc1->kactive_blk_num);
+	dump_stack();
+	BUG();
+}
+
+/*
+ * Queue freeze logic:
+ * 1) Assume tp_block_nr = 8 blocks.
+ * 2) At time 't0', user opens Rx ring.
+ * 3) Some time past 't0', kernel starts filling blocks starting from 0 .. 7
+ * 4) user-space is either sleeping or processing block '0'.
+ * 5) tpacket_rcv is currently filling block '7', since there is no space left,
+ *    it will close block-7,loop around and try to fill block '0'.
+ *    call-flow:
+ *    __packet_lookup_frame_in_block
+ *      prb_retire_current_block()
+ *      prb_dispatch_next_block()
+ *        |->(BLOCK_STATUS == USER) evaluates to true
+ *    5.1) Since block-0 is currently in-use, we just freeze the queue.
+ * 6) Now there are two cases:
+ *    6.1) Link goes idle right after the queue is frozen.
+ *         But remember, the last open_block() refreshed the timer.
+ *         When this timer expires,it will refresh itself so that we can
+ *         re-open block-0 in near future.
+ *    6.2) Link is busy and keeps on receiving packets. This is a simple
+ *         case and __packet_lookup_frame_in_block will check if block-0
+ *         is free and can now be re-used.
+ */
+static inline void prb_freeze_queue(struct kbdq_core *pkc,
+				  struct packet_sock *po)
+{
+	pkc->reset_pending_on_curr_blk = 1;
+	po->stats_u.stats3.tp_freeze_q_cnt++;
+}
+
+#define TOTAL_PKT_LEN_INCL_ALIGN(length) (ALIGN_4((length)))
+
+/*
+ * If the next block is free then we will dispatch it
+ * and return a good offset.
+ * Else, we will freeze the queue.
+ * So, caller must check the return value.
+ */
+static void *prb_dispatch_next_block(struct kbdq_core *pkc,
+		struct packet_sock *po)
+{
+	struct block_desc *pbd;
+
+	smp_rmb();
+
+	/* 1. Get current block num */
+	pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+
+	/* 2. If this block is currently in_use then freeze the queue */
+	if (TP_STATUS_USER & BLOCK_STATUS(pbd)) {
+		prb_freeze_queue(pkc, po);
+		return NULL;
+	}
+
+	/*
+	 * 3.
+	 * open this block and return the offset where the first packet
+	 * needs to get stored.
+	 */
+	prb_open_block(pkc, pbd);
+	return (void *)pkc->nxt_offset;
+}
+
+static void prb_retire_current_block(struct kbdq_core *pkc,
+		struct packet_sock *po, unsigned int status)
+{
+	struct block_desc *pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+
+	/* retire/close the current block */
+	if (likely(TP_STATUS_KERNEL == BLOCK_STATUS(pbd))) {
+		/*
+		 * Plug the case where copy_bits() is in progress on
+		 * cpu-0 and tpacket_rcv() got invoked on cpu-1, didn't
+		 * have space to copy the pkt in the current block and
+		 * called prb_retire_current_block()
+		 *
+		 * TODO:DURING REVIEW ASK IF THIS IS A VALID RACE.
+		 *	MAIN CONCERN IS ABOUT r[f/p]s THREADS(?) EXECUTING
+		 *	IN PARALLEL.
+		 *
+		 * We don't need to worry about the TMO case because
+		 * the timer-handler already handled this case.
+		 */
+		if (!(status & TP_STATUS_BLK_TMO)) {
+			while (atomic_read(&pkc->blk_fill_in_prog)) {
+				/* Waiting for skb_copy_bits to finish... */
+				cpu_relax();
+			}
+		}
+		prb_close_block(pkc, pbd, po, status);
+		return;
+	}
+
+	pr_err("<%s> ERROR-pbd[%d]:%p.Dumping stack\n",
+			__func__, pkc->kactive_blk_num, pbd);
+	dump_stack();
+	BUG();
+}
+
+static inline int prb_curr_blk_in_use(struct kbdq_core *pkc,
+				      struct block_desc *pbd)
+{
+	return (TP_STATUS_USER & BLOCK_STATUS(pbd));
+}
+
+static inline int prb_queue_frozen(struct kbdq_core *pkc)
+{
+	return pkc->reset_pending_on_curr_blk;
+}
+
+static inline void prb_clear_blk_fill_status(struct packet_ring_buffer *rb)
+{
+	struct kbdq_core *pkc  = GET_PBDQC_FROM_RB(rb);
+	atomic_dec(&pkc->blk_fill_in_prog);
+}
+
+static inline void prb_fill_curr_block(char *curr, struct kbdq_core *pkc,
+				struct block_desc *pbd,
+				unsigned int len)
+{
+	struct tpacket3_hdr *ppd;
+
+	ppd  = (struct tpacket3_hdr *)curr;
+	ppd->tp_next_offset = TOTAL_PKT_LEN_INCL_ALIGN(len);
+	pkc->prev = curr;
+	pkc->nxt_offset += TOTAL_PKT_LEN_INCL_ALIGN(len);
+	BLOCK_LEN(pbd) += TOTAL_PKT_LEN_INCL_ALIGN(len);
+	BLOCK_NUM_PKTS(pbd) += 1;
+	atomic_inc(&pkc->blk_fill_in_prog);
+}
+
+/* Assumes caller has the sk->rx_queue.lock */
+static void *__packet_lookup_frame_in_block(struct packet_ring_buffer *rb,
+					    int status,
+					    unsigned int len,
+					    struct packet_sock *po)
+{
+	struct kbdq_core *pkc  = GET_PBDQC_FROM_RB(rb);
+	struct block_desc *pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+	char *curr, *end;
+
+	/* Queue is frozen when user space is lagging behind */
+	if (prb_queue_frozen(pkc)) {
+		/*
+		 * Check if that last block which caused the queue to freeze,
+		 * is still in_use by user-space.
+		 */
+		if (prb_curr_blk_in_use(pkc, pbd)) {
+			/* Can't record this packet */
+			return NULL;
+		} else {
+			/*
+			 * Ok, the block was released by user-space.
+			 * Now let's open that block.
+			 * opening a block also thaws the queue.
+			 * Thawing is a side effect.
+			 */
+			prb_open_block(pkc, pbd);
+		}
+	}
+
+	smp_mb();
+	curr = pkc->nxt_offset;
+	end = (char *) ((char *)pbd + pkc->kblk_size);
+
+	/* first try the current block */
+	if (curr+TOTAL_PKT_LEN_INCL_ALIGN(len) < end) {
+		prb_fill_curr_block(curr, pkc, pbd, len);
+		return (void *)curr;
+	}
+
+	/* Ok, close the current block */
+	prb_retire_current_block(pkc, po, 0);
+
+	/* Now, try to dispatch the next block */
+	curr = (char *)prb_dispatch_next_block(pkc, po);
+	if (curr) {
+		pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
+		prb_fill_curr_block(curr, pkc, pbd, len);
+		return (void *)curr;
+	}
+
+	/*
+	 * No free blocks are available.user_space hasn't caught up yet.
+	 * Queue was just frozen and now this packet will get dropped.
+	 */
+	return NULL;
+}
+
+static inline void *packet_current_rx_frame(struct packet_sock *po,
+					    struct packet_ring_buffer *rb,
+					    int status, unsigned int len)
+{
+	char *curr = NULL;
+	switch (po->tp_version) {
+	case TPACKET_V1:
+	case TPACKET_V2:
+		curr = packet_lookup_frame(po, rb, rb->head, status);
+		return curr;
+	case TPACKET_V3:
+		return __packet_lookup_frame_in_block(rb, status, len, po);
+	default:
+		pr_err("<%s> TPACKET version:%d not supported\n",
+			__func__, po->tp_version);
+		BUG();
+		return 0;
+	}
+}
+
+static inline void *prb_lookup_block(struct packet_sock *po,
+				     struct packet_ring_buffer *rb,
+				     unsigned int previous,
+				     int status)
+{
+	struct kbdq_core *pkc  = GET_PBDQC_FROM_RB(rb);
+	struct block_desc *pbd = GET_PBLOCK_DESC(pkc, previous);
+
+	if (status != BLOCK_STATUS(pbd))
+		return NULL;
+	return pbd;
+}
+
+static inline int prb_previous_blk_num(struct packet_ring_buffer *rb)
+{
+	unsigned int prev;
+	if (rb->prb_bdqc.kactive_blk_num)
+		prev = rb->prb_bdqc.kactive_blk_num-1;
+	else
+		prev = rb->prb_bdqc.knum_blocks-1;
+	return prev;
+}
+
+/* Assumes caller has held the rx_queue.lock */
+static inline void *__prb_previous_block(struct packet_sock *po,
+					 struct packet_ring_buffer *rb,
+					 int status)
+{
+	unsigned int previous = prb_previous_blk_num(rb);
+	return prb_lookup_block(po, rb, previous, status);
+}
+
+static inline void *packet_previous_rx_frame(struct packet_sock *po,
+					     struct packet_ring_buffer *rb,
+					     int status)
+{
+	if (po->tp_version <= TPACKET_V2)
+		return packet_previous_frame(po, rb, status);
+
+	return __prb_previous_block(po, rb, status);
+}
+
+static inline void packet_increment_rx_head(struct packet_sock *po,
+					    struct packet_ring_buffer *rb)
+{
+	switch (po->tp_version) {
+	case TPACKET_V1:
+	case TPACKET_V2:
+		return packet_increment_head(rb);
+	case TPACKET_V3:
+	default:
+		pr_err("<%s> TPACKET version:%d not supported.\
+			Dumping stack.\n", __func__, po->tp_version);
+		dump_stack();
+		BUG();
+		return;
+	}
+}
+
 static inline void *packet_previous_frame(struct packet_sock *po,
 		struct packet_ring_buffer *rb,
 		int status)
@@ -663,12 +1354,13 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	union {
 		struct tpacket_hdr *h1;
 		struct tpacket2_hdr *h2;
+		struct tpacket3_hdr *h3;
 		void *raw;
 	} h;
 	u8 *skb_head = skb->data;
 	int skb_len = skb->len;
 	unsigned int snaplen, res;
-	unsigned long status = TP_STATUS_LOSING|TP_STATUS_USER;
+	unsigned long status = TP_STATUS_USER;
 	unsigned short macoff, netoff, hdrlen;
 	struct sk_buff *copy_skb = NULL;
 	struct timeval tv;
@@ -714,37 +1406,46 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 			po->tp_reserve;
 		macoff = netoff - maclen;
 	}
-
-	if (macoff + snaplen > po->rx_ring.frame_size) {
-		if (po->copy_thresh &&
-		    atomic_read(&sk->sk_rmem_alloc) + skb->truesize <
-		    (unsigned)sk->sk_rcvbuf) {
-			if (skb_shared(skb)) {
-				copy_skb = skb_clone(skb, GFP_ATOMIC);
-			} else {
-				copy_skb = skb_get(skb);
-				skb_head = skb->data;
+	if (po->tp_version <= TPACKET_V2) {
+		if (macoff + snaplen > po->rx_ring.frame_size) {
+			if (po->copy_thresh &&
+				atomic_read(&sk->sk_rmem_alloc) + skb->truesize <
+				(unsigned)sk->sk_rcvbuf) {
+				if (skb_shared(skb)) {
+					copy_skb = skb_clone(skb, GFP_ATOMIC);
+				} else {
+					copy_skb = skb_get(skb);
+					skb_head = skb->data;
+				}
+				if (copy_skb)
+					skb_set_owner_r(copy_skb, sk);
 			}
-			if (copy_skb)
-				skb_set_owner_r(copy_skb, sk);
+			snaplen = po->rx_ring.frame_size - macoff;
+			if ((int)snaplen < 0)
+				snaplen = 0;
 		}
-		snaplen = po->rx_ring.frame_size - macoff;
-		if ((int)snaplen < 0)
-			snaplen = 0;
 	}
-
 	spin_lock(&sk->sk_receive_queue.lock);
-	h.raw = packet_current_frame(po, &po->rx_ring, TP_STATUS_KERNEL);
+	h.raw = packet_current_rx_frame(po, &po->rx_ring,
+					TP_STATUS_KERNEL, (macoff+snaplen));
 	if (!h.raw)
 		goto ring_is_full;
-	packet_increment_head(&po->rx_ring);
+	if (po->tp_version <= TPACKET_V2) {
+		packet_increment_rx_head(po, &po->rx_ring);
+	/*
+	 * LOSING will be reported till you read the stats,
+	 * because it's COR - Clear On Read.
+	 * Anyways, moving it for V1/V2 only as V3 doesn't need this
+	 * at packet level.
+	 */
+		if (po->stats.tp_drops)
+			status |= TP_STATUS_LOSING;
+	}
 	po->stats.tp_packets++;
 	if (copy_skb) {
 		status |= TP_STATUS_COPY;
 		__skb_queue_tail(&sk->sk_receive_queue, copy_skb);
 	}
-	if (!po->stats.tp_drops)
-		status &= ~TP_STATUS_LOSING;
 	spin_unlock(&sk->sk_receive_queue.lock);
 
 	skb_copy_bits(skb, 0, h.raw + macoff, snaplen);
@@ -789,6 +1490,30 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		h.h2->tp_vlan_tci = vlan_tx_tag_get(skb);
 		hdrlen = sizeof(*h.h2);
 		break;
+	case TPACKET_V3:
+		/* tp_nxt_offset is already populated above.
+		 * So DONT clear those fields here
+		 */
+		h.h3->tp_status = status;
+		h.h3->tp_len = skb->len;
+		h.h3->tp_snaplen = snaplen;
+		h.h3->tp_mac = macoff;
+		h.h3->tp_net = netoff;
+		if ((po->tp_tstamp & SOF_TIMESTAMPING_SYS_HARDWARE)
+				&& shhwtstamps->syststamp.tv64)
+			ts = ktime_to_timespec(shhwtstamps->syststamp);
+		else if ((po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE)
+				&& shhwtstamps->hwtstamp.tv64)
+			ts = ktime_to_timespec(shhwtstamps->hwtstamp);
+		else if (skb->tstamp.tv64)
+			ts = ktime_to_timespec(skb->tstamp);
+		else
+			getnstimeofday(&ts);
+		h.h3->tp_sec  = ts.tv_sec;
+		h.h3->tp_nsec = ts.tv_nsec;
+		h.h3->tp_vlan_tci = vlan_tx_tag_get(skb);
+		hdrlen = sizeof(*h.h3);
+		break;
 	default:
 		BUG();
 	}
@@ -803,18 +1528,22 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		sll->sll_ifindex = orig_dev->ifindex;
 	else
 		sll->sll_ifindex = dev->ifindex;
-
-	__packet_set_status(po, h.raw, status);
+	if (po->tp_version <= TPACKET_V2)
+		__packet_set_status(po, h.raw, status);
 	smp_mb();
 #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1
 	{
 		u8 *start, *end;
 
-		end = (u8 *)PAGE_ALIGN((unsigned long)h.raw + macoff + snaplen);
-		for (start = h.raw; start < end; start += PAGE_SIZE)
-			flush_dcache_page(pgv_to_page(start));
+		if (po->tp_version <= TPACKET_V2) {
+			end = (u8 *)PAGE_ALIGN((unsigned long)h.raw + macoff + snaplen);
+			for (start = h.raw; start < end; start += PAGE_SIZE)
+				flush_dcache_page(pgv_to_page(start));
+		}
 	}
 #endif
+	if (po->tp_version > TPACKET_V2)
+		prb_clear_blk_fill_status(&po->rx_ring);
 
 	sk->sk_data_ready(sk, 0);
 
@@ -1291,7 +2020,7 @@ static int packet_release(struct socket *sock)
 	struct sock *sk = sock->sk;
 	struct packet_sock *po;
 	struct net *net;
-	struct tpacket_req req;
+	union tpacket_req_u req_u;
 
 	if (!sk)
 		return 0;
@@ -1318,13 +2047,13 @@ static int packet_release(struct socket *sock)
 
 	packet_flush_mclist(sk);
 
-	memset(&req, 0, sizeof(req));
+	memset(&req_u, 0, sizeof(req_u));
 
 	if (po->rx_ring.pg_vec)
-		packet_set_ring(sk, &req, 1, 0);
+		packet_set_ring(sk, &req_u, 1, 0);
 
 	if (po->tx_ring.pg_vec)
-		packet_set_ring(sk, &req, 1, 1);
+		packet_set_ring(sk, &req_u, 1, 1);
 
 	synchronize_net();
 	/*
@@ -1949,15 +2678,26 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 	case PACKET_RX_RING:
 	case PACKET_TX_RING:
 	{
-		struct tpacket_req req;
+		union tpacket_req_u req_u;
+		int len;
 
-		if (optlen < sizeof(req))
+		switch (po->tp_version) {
+		case TPACKET_V1:
+		case TPACKET_V2:
+			len = sizeof(req_u.req);
+			break;
+		case TPACKET_V3:
+		default:
+			len = sizeof(req_u.req3);
+			break;
+		}
+		if (optlen < len)
 			return -EINVAL;
 		if (pkt_sk(sk)->has_vnet_hdr)
 			return -EINVAL;
-		if (copy_from_user(&req, optval, sizeof(req)))
+		if (copy_from_user(&req_u.req, optval, len))
 			return -EFAULT;
-		return packet_set_ring(sk, &req, 0, optname == PACKET_TX_RING);
+		return packet_set_ring(sk, &req_u, 0, optname == PACKET_TX_RING);
 	}
 	case PACKET_COPY_THRESH:
 	{
@@ -1984,6 +2724,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		switch (val) {
 		case TPACKET_V1:
 		case TPACKET_V2:
+		case TPACKET_V3:
 			po->tp_version = val;
 			return 0;
 		default:
@@ -2082,6 +2823,7 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	struct packet_sock *po = pkt_sk(sk);
 	void *data;
 	struct tpacket_stats st;
+	union tpacket_stats_u st_u;
 
 	if (level != SOL_PACKET)
 		return -ENOPROTOOPT;
@@ -2094,15 +2836,26 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 
 	switch (optname) {
 	case PACKET_STATISTICS:
-		if (len > sizeof(struct tpacket_stats))
-			len = sizeof(struct tpacket_stats);
+		if (po->tp_version == TPACKET_V3) {
+			len = sizeof(struct tpacket_stats_v3);
+		} else {
+			if (len > sizeof(struct tpacket_stats))
+				len = sizeof(struct tpacket_stats);
+		}
 		spin_lock_bh(&sk->sk_receive_queue.lock);
-		st = po->stats;
+		if (po->tp_version == TPACKET_V3) {
+			memcpy(&st_u.stats3, &po->stats,
+			sizeof(struct tpacket_stats));
+			st_u.stats3.tp_freeze_q_cnt = po->stats_u.stats3.tp_freeze_q_cnt;
+			st_u.stats3.tp_packets += po->stats.tp_drops;
+			data = &st_u.stats3;
+		} else {
+			st = po->stats;
+			st.tp_packets += st.tp_drops;
+			data = &st;
+		}
 		memset(&po->stats, 0, sizeof(st));
 		spin_unlock_bh(&sk->sk_receive_queue.lock);
-		st.tp_packets += st.tp_drops;
-
-		data = &st;
 		break;
 	case PACKET_AUXDATA:
 		if (len > sizeof(int))
@@ -2143,6 +2896,9 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 		case TPACKET_V2:
 			val = sizeof(struct tpacket2_hdr);
 			break;
+		case TPACKET_V3:
+			val = sizeof(struct tpacket3_hdr);
+			break;
 		default:
 			return -EINVAL;
 		}
@@ -2293,7 +3049,7 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
 
 	spin_lock_bh(&sk->sk_receive_queue.lock);
 	if (po->rx_ring.pg_vec) {
-		if (!packet_previous_frame(po, &po->rx_ring, TP_STATUS_KERNEL))
+		if (!packet_previous_rx_frame(po, &po->rx_ring, TP_STATUS_KERNEL))
 			mask |= POLLIN | POLLRDNORM;
 	}
 	spin_unlock_bh(&sk->sk_receive_queue.lock);
@@ -2412,7 +3168,7 @@ out_free_pgvec:
 	goto out;
 }
 
-static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
+static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 		int closing, int tx_ring)
 {
 	struct pgv *pg_vec = NULL;
@@ -2421,7 +3177,17 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 	struct packet_ring_buffer *rb;
 	struct sk_buff_head *rb_queue;
 	__be16 num;
-	int err;
+	int err = -EINVAL;
+	/* Added to avoid minimal code churn */
+	struct tpacket_req *req = &req_u->req;
+
+	/* Opening a Tx-ring is NOT supported in TPACKET_V3 */
+	if (!closing && tx_ring && (po->tp_version > TPACKET_V2)) {
+		pr_err("<%s> Tx-ring is not supported on version:%d.\
+			   Dumping stack.\n", __func__, po->tp_version);
+		dump_stack();
+		goto out;
+	}
 
 	rb = tx_ring ? &po->tx_ring : &po->rx_ring;
 	rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
@@ -2447,6 +3213,9 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		case TPACKET_V2:
 			po->tp_hdrlen = TPACKET2_HDRLEN;
 			break;
+		case TPACKET_V3:
+			po->tp_hdrlen = TPACKET3_HDRLEN;
+			break;
 		}
 
 		err = -EINVAL;
@@ -2472,6 +3241,17 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		pg_vec = alloc_pg_vec(req, order);
 		if (unlikely(!pg_vec))
 			goto out;
+		switch (po->tp_version) {
+		case TPACKET_V3:
+		/* Transmit path is not supported. We checked
+		 * it above but just being paranoid
+		 */
+			if (!tx_ring)
+				init_prb_bdqc(po, rb, pg_vec, req_u, tx_ring);
+				break;
+		default:
+			break;
+		}
 	}
 	/* Done */
 	else {
@@ -2528,7 +3308,11 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		dev_add_pack(&po->prot_hook);
 	}
 	spin_unlock(&po->bind_lock);
-
+	if (closing && (po->tp_version > TPACKET_V2)) {
+		/* Because we don't support block-based V3 on tx-ring */
+		if (!tx_ring)
+			prb_shutdown_retire_blk_timer(po, tx_ring, rb_queue);
+	}
 	release_sock(sk);
 
 	if (pg_vec)
-- 
1.7.5.2


^ permalink raw reply related

* Re: Bug#629604: linux-image-2.6.38-2-686: sky2 eth0 rx length errors (~5/second)
From: Ben Hutchings @ 2011-06-08  3:26 UTC (permalink / raw)
  To: Kate Gordon; +Cc: 629604, Stephen Hemminger, netdev
In-Reply-To: <20110608015806.3543.25300.reportbug@nomad>

[-- Attachment #1: Type: text/plain, Size: 2038 bytes --]

On Wed, 2011-06-08 at 11:58 +1000, Kate Gordon wrote:
> Package: linux-2.6
> Version: 2.6.38-5
> Severity: normal
> 
> On my first ethernet connection after an upgrade from squeeze (2.6.35)

The current kernel package version in squeeze is 2.6.32-34squeeze1, but
that does have the sky2 driver from Linux 2.6.35.

> to wheezy (2.6.38), I started 
> seeing the sky2 errors shown below in the kernel log.  So far my
> internet is still working (I detect
> a slight slowness but that's it).  Looks to be the same as Ubuntu bug 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/719873
> 
> Ethernet controller:
> $lscpi
> <snip>
> Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 22)
> </snip>
> 
> I'm on a MacBook Pro 1,1.
> Please let me know if there's any other pertinent info I can give

The driver should log a message of the form
'Yukon-2 <name> chip revision <number>' at startup.  I think we may need
to know the name and number.

> -- Package-specific info:
> ** Version:
> Linux version 2.6.38-2-686 (Debian 2.6.38-5) (ben@decadent.org.uk) (gcc version 4.4.6 (Debian 4.4.6-3) ) #1 SMP Sun May 8 14:49:45 UTC 2011
> 
> ** Command line:
> BOOT_IMAGE=/boot/vmlinuz-2.6.38-2-686 root=UUID=7e5efdb0-6919-4e56-a96b-45055f82dcf0 ro
> 
> ** Not tainted
> 
> ** Kernel log:
> [ 3118.574281] sky2 0000:02:00.0: eth0: rx error, status 0x5c2300 length 92
> [ 3121.260783] net_ratelimit: 11 callbacks suppressed
> [ 3121.260795] sky2 0000:02:00.0: eth0: rx error, status 0x5c2300 length 92
> [ 3121.366680] sky2 0000:02:00.0: eth0: rx error, status 0x972300 length 151
> [ 3121.763382] sky2 0000:02:00.0: eth0: rx error, status 0x5c2300 length 92
> [ 3122.014215] sky2 0000:02:00.0: eth0: rx error, status 0xd92300 length 217
[...]

This status value indicates a VLAN-tagged packet, so this is probably
related to changes in VLAN support.

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: KVM induced panic on 2.6.38[2367] & 2.6.39
From: Eric Dumazet @ 2011-06-08  3:59 UTC (permalink / raw)
  To: Brad Campbell
  Cc: Patrick McHardy, Bart De Schuymer, kvm, linux-mm, linux-kernel,
	netdev, netfilter-devel
In-Reply-To: <4DEEBFC2.4060102@fnarfbargle.com>

Le mercredi 08 juin 2011 à 08:18 +0800, Brad Campbell a écrit :
> On 08/06/11 06:57, Patrick McHardy wrote:
> > On 07.06.2011 20:31, Eric Dumazet wrote:
> >> Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit :
> >>
> >>> The main suspects would be NAT and TCPMSS. Did you also try whether
> >>> the crash occurs with only one of these these rules?
> >>>
> >>>> I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
> >>>> the address the way I was doing it, so that's a no-go for me.
> >>>
> >>> That's really weird since you're apparently not using any bridge
> >>> netfilter features. It shouldn't have any effect besides changing
> >>> at which point ip_tables is invoked. How are your network devices
> >>> configured (specifically any bridges)?
> >>
> >> Something in the kernel does
> >>
> >> u16 *ptr = addr (given by kmalloc())
> >>
> >> ptr[-1] = 0;
> >>
> >> Could be an off-one error in a memmove()/memcopy() or loop...
> >>
> >> I cant see a network issue here.
> >
> > So far me neither, but netfilter appears to trigger the bug.
> 
> Would it help if I tried some older kernels? This issue only surfaced 
> for me recently as I only installed the VM's in question about 12 weeks 
> ago and have only just started really using them in anger. I could try 
> reproducing it on progressively older kernels to see if I can find one 
> that works and then bisect from there.

Well, a bisection definitely should help, but needs a lot of time in
your case.

Could you try following patch, because this is the 'usual suspect' I had
yesterday :

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 46cbd28..9f548f9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta;
 	}
 
+#if 0
 	if (fastpath &&
 	    size + sizeof(struct skb_shared_info) <= ksize(skb->head)) {
 		memmove(skb->head + size, skb_shinfo(skb),
@@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		off = nhead;
 		goto adjust_others;
 	}
-
+#endif
 	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
 	if (!data)
 		goto nodata;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH 1/3] xfrm: Fix off by one in the replay advance functions
From: David Miller @ 2011-06-08  4:15 UTC (permalink / raw)
  To: steffen.klassert; +Cc: herbert, netdev
In-Reply-To: <20110606064603.GB31505@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Mon, 6 Jun 2011 08:46:03 +0200

> We may write 4 byte too much when we reinitialize the anti replay
> window in the replay advance functions. This patch fixes this by
> adjusting the last index of the initialization loop.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Applied, thanks.

^ permalink raw reply

* Re: [af-packet 2/2] Enhance af-packet to provide (near zero)lossless packet capture functionality
From: Joe Perches @ 2011-06-08  4:18 UTC (permalink / raw)
  To: Chetan Loke; +Cc: netdev, davem, eric.dumazet, kaber, johann.baudy, Chetan Loke
In-Reply-To: <1307502786-1396-3-git-send-email-loke.chetan@gmail.com>

On Tue, 2011-06-07 at 23:13 -0400, Chetan Loke wrote:
> Signed-off-by: Chetan Loke <lokec@ccs.neu.edu>

just trivia:

> ---
>  net/packet/af_packet.c |  878 +++++++++++++++++++++++++++++++++++++++++++++---
[]
> +/* kbdq - kernel block descriptor queue */
> +struct kbdq_core {
> +	struct pgv	*pkbdq;
> +	unsigned int	hdrlen;
> +	unsigned char	reset_pending_on_curr_blk;
> +	unsigned char   delete_blk_timer;
> +	unsigned short	kactive_blk_num;
> +	unsigned short	hole_bytes_size;
> +	char		*pkblk_start;
> +	char		*pkblk_end;
> +	int		kblk_size;
> +	unsigned int	knum_blocks;
> +	uint64_t	knxt_seq_num;
> +	char		*prev;
> +	char		*nxt_offset;
> +
> +	/* last_kactive_blk_num:
> +	 * trick to see if user-space has caught up
> +	 * in order to avoid refreshing timer when every single pkt arrives.
> +	 */
> +	unsigned short	last_kactive_blk_num;
> +
> +	atomic_t	blk_fill_in_prog;
> +
> +	/* Default is set to 8ms */
> +#define DEFAULT_PRB_RETIRE_TOV	(8)
> +
> +	unsigned short  retire_blk_tov;
> +	unsigned long	tov_in_jiffies;
> +
> +	/* timer to retire an outstanding block */
> +	struct timer_list retire_blk_timer;
> +};

You could align the member entries a bit more,
maybe move last_kactive_blk_num after retire_blk_tov

[]

> @@ -248,8 +322,11 @@ static void __packet_set_status(struct packet_sock *po, void *frame, int status)
>  		h.h2->tp_status = status;
>  		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
>  		break;
> +	case TPACKET_V3:
>  	default:
> -		pr_err("TPACKET version not supported\n");
> +		pr_err("<%s> TPACKET version not supported.Who is calling?\
> +			Dumping stack.\n", __func__);

whitespace defect because of line continuation.  Maybe just:
		WARN(1, "TPACKET version not supported\n");

>  		BUG();
>  	}
>  
> @@ -274,8 +351,11 @@ static int __packet_get_status(struct packet_sock *po, void *frame)
>  	case TPACKET_V2:
>  		flush_dcache_page(pgv_to_page(&h.h2->tp_status));
>  		return h.h2->tp_status;
> +	case TPACKET_V3:
>  	default:
> -		pr_err("TPACKET version not supported\n");
> +		pr_err("<%s> TPACKET version:%d not supported.\
> +			Dumping stack.\n", __func__, po->tp_version);
> +		dump_stack();

here too.

		WARN(1, "TPACKET version %d not supported\n", po->tp_version);

>  		BUG();
>  		return 0;
>  	}
[]
> +static void prb_open_block(struct kbdq_core *pkc1, struct block_desc *pbd1)
> +{
> +	pr_err("<%s> ERROR block:%p is NOT FREE status:%d\
> +			kactive_blk_num:%d\n",
> +			__func__, pbd1, BLOCK_STATUS(pbd1), pkc1->kactive_blk_num);
> +	dump_stack();
> +	BUG();

here too.  maybe just:
	WARN(1, "%s: ERROR block:%p is not free.  status: %s kactive_blk_num:%d\n"
	     __func__, pbd1, BLOCK_STATUS(pbd1), pkc1->kactive_blk_num);

> +static inline void packet_increment_rx_head(struct packet_sock *po,
> +					    struct packet_ring_buffer *rb)
> +{
> +	switch (po->tp_version) {
> +	case TPACKET_V1:
> +	case TPACKET_V2:
> +		return packet_increment_head(rb);
> +	case TPACKET_V3:
> +	default:
> +		pr_err("<%s> TPACKET version:%d not supported.\
> +			Dumping stack.\n", __func__, po->tp_version);

whitespace, WARN(1, etc...


> @@ -2412,7 +3168,7 @@ out_free_pgvec:
>  	goto out;
>  }
>  
> -static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
> +static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
>  		int closing, int tx_ring)
>  {
>  	struct pgv *pg_vec = NULL;
> @@ -2421,7 +3177,17 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
>  	struct packet_ring_buffer *rb;
>  	struct sk_buff_head *rb_queue;
>  	__be16 num;
> -	int err;
> +	int err = -EINVAL;
> +	/* Added to avoid minimal code churn */
> +	struct tpacket_req *req = &req_u->req;
> +
> +	/* Opening a Tx-ring is NOT supported in TPACKET_V3 */
> +	if (!closing && tx_ring && (po->tp_version > TPACKET_V2)) {
> +		pr_err("<%s> Tx-ring is not supported on version:%d.\
> +			   Dumping stack.\n", __func__, po->tp_version);

whitespace, WARN(1, etc...



^ permalink raw reply

* Re: [af-packet 1/2] Enhance af-packet to provide (near zero)lossless packet capture functionality.
From: Eric Dumazet @ 2011-06-08  4:35 UTC (permalink / raw)
  To: Chetan Loke; +Cc: netdev, davem, kaber, johann.baudy, Chetan Loke
In-Reply-To: <1307502786-1396-2-git-send-email-loke.chetan@gmail.com>

Le mardi 07 juin 2011 à 23:13 -0400, Chetan Loke a écrit :
>  
> +struct tpacket3_hdr {
> +	__u32		tp_status;
> +	__u32		tp_len;
> +	__u32		tp_snaplen;
> +	__u16		tp_mac;
> +	__u16		tp_net;

> +	__u32		tp_sec;
> +	__u32		tp_nsec;
> +	__u16		tp_vlan_tci;

missing "__u16 tp_padding;" here

check :

http://git2.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=13fcb7bd322164c67926ffe272846d4860196dc6


> +	__u32		tp_next_offset;
> +};




^ permalink raw reply

* Re: [PATCH 2/3] ipv4: Fix packet size calculation for IPsec packets in __ip_append_data
From: Steffen Klassert @ 2011-06-08  5:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Herbert Xu, netdev
In-Reply-To: <1307423217.2642.59.camel@edumazet-laptop>

On Tue, Jun 07, 2011 at 07:06:57AM +0200, Eric Dumazet wrote:
> 
> Nick mail was :
> 
> http://www.spinics.net/lists/netdev/msg141308.html
> 

Thanks for providing these informations.

> Unfortunatly I could not find on my machines where I put my own
> scripts...
> 
> Not a big deal, I suspect we can revert my commit if you say it added a
> regression :)
> 

In between I can confirm that we get the slowpath problem back with my
patch, so we still have a bug somewhere. Reverting your commit would
be just a band aid. I think it is better to find the bug and do a real
fix instead. Unfortunatly I fear I'm not able to track it down before
my vacation that starts tomorrow. I'll continue to work at it once I'm
back...

^ permalink raw reply

* linux-next: build failure after merge of the final tree (net tree related)
From: Stephen Rothwell @ 2011-06-08  5:54 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: linux-next, linux-kernel, Alexey Dobriyan

Hi all,

After merging the final tree, today's linux-next build (powerpc
allyesconfig) failed like this:

drivers/net/ll_temac_main.c: In function 'temac_open':
drivers/net/ll_temac_main.c:859:2: error: implicit declaration of function 'request_irq'
drivers/net/ll_temac_main.c:870:2: error: implicit declaration of function 'free_irq'
drivers/net/ll_temac_main.c: In function 'temac_poll_controller':
drivers/net/ll_temac_main.c:903:2: error: implicit declaration of function 'disable_irq'
drivers/net/ll_temac_main.c:909:2: error: implicit declaration of function 'enable_irq'

Probably caused by commit a6b7a407865a ("net: remove interrupt.h
inclusion from netdevice.h").

I have added this patch for today:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 8 Jun 2011 15:49:33 +1000
Subject: [PATCH] net: add needed interrupt.h

Fixes these errors after the removal of interrupt.h from netdevice.h:

drivers/net/ll_temac_main.c: In function 'temac_open':
drivers/net/ll_temac_main.c:859:2: error: implicit declaration of function 'request_irq'
drivers/net/ll_temac_main.c:870:2: error: implicit declaration of function 'free_irq'
drivers/net/ll_temac_main.c: In function 'temac_poll_controller':
drivers/net/ll_temac_main.c:903:2: error: implicit declaration of function 'disable_irq'
drivers/net/ll_temac_main.c:909:2: error: implicit declaration of function 'enable_irq'

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 drivers/net/ll_temac_main.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ll_temac_main.c b/drivers/net/ll_temac_main.c
index b7948cc..e7b8afe 100644
--- a/drivers/net/ll_temac_main.c
+++ b/drivers/net/ll_temac_main.c
@@ -48,6 +48,7 @@
 #include <linux/io.h>
 #include <linux/ip.h>
 #include <linux/slab.h>
+#include <linux/interrupt.h>
 
 #include "ll_temac.h"
 
-- 
1.7.5.3

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply related

* Re: [net-next 37/40] rtnetlink: Compute and store minimum ifinfo dump size
From: Johannes Berg @ 2011-06-08  6:09 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, Greg Rose, netdev, gospo
In-Reply-To: <1307449995-9458-38-git-send-email-jeffrey.t.kirsher@intel.com>

On Tue, 2011-06-07 at 05:33 -0700, Jeff Kirsher wrote:
> From: Greg Rose <gregory.v.rose@intel.com>
> 
> The message size allocated for rtnl ifinfo dumps was limited to
> a single page.  This is not enough for additional interface info
> available with devices that support SR-IOV and caused a bug in
> which VF info would not be displayed if more than approximately
> 40 VFs were created per interface.
> 
> Implement a new function pointer for the rtnl_register service that will
> calculate the amount of data required for the ifinfo dump and allocate
> enough data to satisfy the request.

Curious. Weren't dumps supposed to be split up into small chunks and
then delivered? Where is this splitting going wrong, and could it be
improved to split into smaller pieces?

johannes


^ permalink raw reply

* Re: [PATCH] netfilter: nf_nat: avoid double nat for loopback
From: Julian Anastasov @ 2011-06-08  6:26 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Pablo Neira Ayuso, netfilter-devel, netdev
In-Reply-To: <4DEEAD6F.8050505@trash.net>


	Hello,

On Wed, 8 Jun 2011, Patrick McHardy wrote:

> >> to the IPS_SEQ_ADJUST_BIT case to at least avoid it in some cases.
> >> Would that work or am I missing something?
> > 
> > 	Logically, the new check can be after
> > test_bit(IPS_SEQ_ADJUST_BIT, &ct->status). But I suspect
> > some modules adjust seqs in the helper->help call,
> > for example, sip_help_tcp if I'm correctly reading the
> > code.
> 
> Yes, you're right. But it's the only one since it's the only helper
> doing possibly many modifications on a single TCP packet, which can't
> be handled by the generic code properly. So if you're worried about
> performance costs, I'd have no problems adding this check to the SIP
> helper.

	OK, I'm posting new version just for seq adjustment.
I'm not fixing sip_help_tcp because I'm not sure what is
the right fix, we must be sure that calling sip_help_tcp
twice is not a problem.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* [PATCH v2] netfilter: avoid double seq_adjust for loopback
From: Julian Anastasov @ 2011-06-08  6:31 UTC (permalink / raw)
  To: Patrick McHardy, Pablo Neira Ayuso, netfilter-devel, netdev


	Avoid double seq adjustment for loopback traffic
because it causes silent repetition of TCP data. One
example is passive FTP with DNAT rule and difference in the
length of IP addresses.

	This patch adds check if packet is sent and
received via loopback device. As the same conntrack is
used both for outgoing and incoming direction, we restrict
seq adjustment to happen only in POSTROUTING.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

diff -urp v2.6.39/linux/include/net/netfilter/nf_conntrack.h linux/include/net/netfilter/nf_conntrack.h
--- v2.6.39/linux/include/net/netfilter/nf_conntrack.h	2011-05-20 10:38:04.000000000 +0300
+++ linux/include/net/netfilter/nf_conntrack.h	2011-06-08 08:29:58.880272586 +0300
@@ -308,6 +308,12 @@ static inline int nf_ct_is_untracked(con
 	return test_bit(IPS_UNTRACKED_BIT, &ct->status);
 }
 
+/* Packet is received from loopback */
+static inline bool nf_is_loopback_packet(const struct sk_buff *skb)
+{
+	return skb->dev && skb->skb_iif && skb->dev->flags & IFF_LOOPBACK;
+}
+
 extern int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
 extern unsigned int nf_conntrack_htable_size;
 extern unsigned int nf_conntrack_max;
diff -urp v2.6.39/linux/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c linux/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
--- v2.6.39/linux/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c	2010-08-02 09:37:49.000000000 +0300
+++ linux/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c	2011-06-08 08:34:15.594269592 +0300
@@ -121,7 +121,9 @@ static unsigned int ipv4_confirm(unsigne
 		return ret;
 	}
 
-	if (test_bit(IPS_SEQ_ADJUST_BIT, &ct->status)) {
+	/* adjust seqs for loopback traffic only in outgoing direction */
+	if (test_bit(IPS_SEQ_ADJUST_BIT, &ct->status) &&
+	    !nf_is_loopback_packet(skb)) {
 		typeof(nf_nat_seq_adjust_hook) seq_adjust;
 
 		seq_adjust = rcu_dereference(nf_nat_seq_adjust_hook);

^ permalink raw reply

* [PATCH] ixgbe: Report PAUSE flags to ethtool
From: Esa-Pekka Pyokkimies @ 2011-06-08  6:33 UTC (permalink / raw)
  To: netdev
In-Reply-To: <op.vwnklt0q6ywr33@esapekka-pc.rad1>

Hello!

I noticed that ixgbe driver doesn't report SUPPORTED_Pause and
ADVERTISED_Pause flags to ethtool. This means that ethtool
always reports:
	Supported pause frame use: No
	Advertised pause frame use: No
I added reporting for capabilities and advertising.
I tested it with our ixgbe card and latest ethtool
   from git repo. I also need to add capability to
change advertising parameters with "ethtool -s advertise %x",
but will send in a different patch if this patch looks ok.

Signed-off-by: Esa-Pekka Pyokkimies <esa-pekka.pyokkimies@stonesoft.com>
---
diff --git a/drivers/net/ixgbe/ixgbe_ethtool.c
b/drivers/net/ixgbe/ixgbe_ethtool.c
index cb1555b..6005116 100644
--- a/drivers/net/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ixgbe/ixgbe_ethtool.c
@@ -150,6 +150,7 @@ static int ixgbe_get_settings(struct net_device
*netdev,
    	bool link_up;

    	ecmd->supported = SUPPORTED_10000baseT_Full;
+	ecmd->supported |= SUPPORTED_Pause;
    	ecmd->autoneg = AUTONEG_ENABLE;
    	ecmd->transceiver = XCVR_EXTERNAL;
    	if ((hw->phy.media_type == ixgbe_media_type_copper) ||
@@ -231,6 +232,21 @@ static int ixgbe_get_settings(struct net_device
*netdev,
    		ecmd->autoneg = AUTONEG_DISABLE;
    	}

+	if (hw->fc.current_mode == ixgbe_fc_full) {
+		ecmd->advertising |= ADVERTISED_Pause;
+	} else if (hw->fc.current_mode == ixgbe_fc_rx_pause) {
+		ecmd->advertising |= ADVERTISED_Pause;
+		ecmd->advertising |= ADVERTISED_Asym_Pause;
+	} else if (hw->fc.current_mode == ixgbe_fc_tx_pause) {
+		ecmd->advertising |= ADVERTISED_Asym_Pause;
+	} else if (hw->fc.current_mode == ixgbe_fc_none) {
+		/* Correctly initialized */
+	} else if (hw->fc.current_mode == ixgbe_fc_pfc) {
+		/* Ethtool doesn't know about this mode */
+	} else {
+		/* Future modes */
+	}
+
    	/* Get PHY type */
    	switch (adapter->hw.phy.type) {
    	case ixgbe_phy_tn:


-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

^ permalink raw reply related

* Re: [PATCHv3] net: Define enum for the bits used in features.
From: Mahesh Bandewar @ 2011-06-08  6:55 UTC (permalink / raw)
  To: David Miller
  Cc: mst, linux-netdev, Tom Herbert, Michał Mirosław,
	Stephen Hemminger
In-Reply-To: <20110606.122059.261517215690508151.davem@davemloft.net>

On Mon, Jun 6, 2011 at 12:20 PM, David Miller <davem@davemloft.net> wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Mon, 6 Jun 2011 18:32:53 +0300
>
>> On Sun, Jun 05, 2011 at 10:15:37PM -0700, David Miller wrote:
>>> Since the GSO accessors deal with mutliple bits, you can create
>>> special GSO specific interfaces to manipulate them.
>>
>> Yes but it's not just GSO.
>> It's anything that includes more than 1 feature.
>> Examples:
>> NETIF_F_ALL_CSUM
>> NETIF_F_ALL_TX_OFFLOADS
>> NETIF_F_V6_CSUM
>> NETIF_F_SOFT_FEATURES
>>
>> etc
>>
>> Creating many accessors for each will need a lot
>> of code duplication ...
>
> Yet this is something you must resolve in order to change the feature
> bit implementation.
>
> Whether this issue is difficult or not to address, it has to be done
> either way.
>

I agree that the cleanup is not really necessary to the feature
extension as such but this along with the other patch that I have
posted is the beginning of that work. It's definitely not complete and
also not as simple as it sounds / feels because of these constants
defined which are "or-ed" flag values (listed above). I think it will
be nice to get this done in as little code as possible, but I think
that should be the constraint.

In these two patches I have created separate header file
"netdev_features.h" where everything related to "features" should
reside including all these accessor macros / functions.

--mahesh..

^ permalink raw reply

* Re: [net-next 37/40] rtnetlink: Compute and store minimum ifinfo dump size
From: David Miller @ 2011-06-08  7:12 UTC (permalink / raw)
  To: johannes; +Cc: jeffrey.t.kirsher, gregory.v.rose, netdev, gospo
In-Reply-To: <1307513397.3961.0.camel@jlt3.sipsolutions.net>

From: Johannes Berg <johannes@sipsolutions.net>
Date: Wed, 08 Jun 2011 08:09:57 +0200

> Curious. Weren't dumps supposed to be split up into small chunks and
> then delivered? Where is this splitting going wrong, and could it be
> improved to split into smaller pieces?

You can only split at the individual object boundary.

And in these cases individual single network device instances are too
large to go in one SKB.

^ permalink raw reply

* Re: linux-next: build failure after merge of the final tree (net tree related)
From: David Miller @ 2011-06-08  7:16 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, adobriyan
In-Reply-To: <20110608155411.c3a2aa09.sfr@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 8 Jun 2011 15:54:11 +1000

> After merging the final tree, today's linux-next build (powerpc
> allyesconfig) failed like this:
> 
> drivers/net/ll_temac_main.c: In function 'temac_open':
> drivers/net/ll_temac_main.c:859:2: error: implicit declaration of function 'request_irq'
> drivers/net/ll_temac_main.c:870:2: error: implicit declaration of function 'free_irq'
> drivers/net/ll_temac_main.c: In function 'temac_poll_controller':
> drivers/net/ll_temac_main.c:903:2: error: implicit declaration of function 'disable_irq'
> drivers/net/ll_temac_main.c:909:2: error: implicit declaration of function 'enable_irq'

Oh well, I hit all the drivers I could with x86 and sparc64 builds.

> Probably caused by commit a6b7a407865a ("net: remove interrupt.h
> inclusion from netdevice.h").
> 
> I have added this patch for today:
> 
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Wed, 8 Jun 2011 15:49:33 +1000
> Subject: [PATCH] net: add needed interrupt.h

Applied, thanks Stephen.

^ permalink raw reply

* [PATCH] gianfar:localized filer table
From: Jiajun Wu @ 2011-06-08  7:46 UTC (permalink / raw)
  To: netdev, davem; +Cc: linuxppc-dev, Jiajun Wu

Each eTSEC device should own localized filer table.

Signed-off-by: Jiajun Wu <b06378@freescale.com>
---
 drivers/net/gianfar.c         |   29 ++++++++----------
 drivers/net/gianfar.h         |    8 +++--
 drivers/net/gianfar_ethtool.c |   64 +++++++++++++++++++++--------------------
 3 files changed, 51 insertions(+), 50 deletions(-)

diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
index ff60b23..2dfcc80 100644
--- a/drivers/net/gianfar.c
+++ b/drivers/net/gianfar.c
@@ -10,7 +10,7 @@
  * Maintainer: Kumar Gala
  * Modifier: Sandeep Gopalpet <sandeep.kumar@freescale.com>
  *
- * Copyright 2002-2009 Freescale Semiconductor, Inc.
+ * Copyright 2002-2009, 2011 Freescale Semiconductor, Inc.
  * Copyright 2007 MontaVista Software, Inc.
  *
  * This program is free software; you can redistribute  it and/or modify it
@@ -476,9 +476,6 @@ static const struct net_device_ops gfar_netdev_ops = {
 #endif
 };
 
-unsigned int ftp_rqfpr[MAX_FILER_IDX + 1];
-unsigned int ftp_rqfcr[MAX_FILER_IDX + 1];
-
 void lock_rx_qs(struct gfar_private *priv)
 {
 	int i = 0x0;
@@ -868,28 +865,28 @@ static u32 cluster_entry_per_class(struct gfar_private *priv, u32 rqfar,
 
 	rqfar--;
 	rqfcr = RQFCR_CLE | RQFCR_PID_MASK | RQFCR_CMP_EXACT;
-	ftp_rqfpr[rqfar] = rqfpr;
-	ftp_rqfcr[rqfar] = rqfcr;
+	priv->ftp_rqfpr[rqfar] = rqfpr;
+	priv->ftp_rqfcr[rqfar] = rqfcr;
 	gfar_write_filer(priv, rqfar, rqfcr, rqfpr);
 
 	rqfar--;
 	rqfcr = RQFCR_CMP_NOMATCH;
-	ftp_rqfpr[rqfar] = rqfpr;
-	ftp_rqfcr[rqfar] = rqfcr;
+	priv->ftp_rqfpr[rqfar] = rqfpr;
+	priv->ftp_rqfcr[rqfar] = rqfcr;
 	gfar_write_filer(priv, rqfar, rqfcr, rqfpr);
 
 	rqfar--;
 	rqfcr = RQFCR_CMP_EXACT | RQFCR_PID_PARSE | RQFCR_CLE | RQFCR_AND;
 	rqfpr = class;
-	ftp_rqfcr[rqfar] = rqfcr;
-	ftp_rqfpr[rqfar] = rqfpr;
+	priv->ftp_rqfcr[rqfar] = rqfcr;
+	priv->ftp_rqfpr[rqfar] = rqfpr;
 	gfar_write_filer(priv, rqfar, rqfcr, rqfpr);
 
 	rqfar--;
 	rqfcr = RQFCR_CMP_EXACT | RQFCR_PID_MASK | RQFCR_AND;
 	rqfpr = class;
-	ftp_rqfcr[rqfar] = rqfcr;
-	ftp_rqfpr[rqfar] = rqfpr;
+	priv->ftp_rqfcr[rqfar] = rqfcr;
+	priv->ftp_rqfpr[rqfar] = rqfpr;
 	gfar_write_filer(priv, rqfar, rqfcr, rqfpr);
 
 	return rqfar;
@@ -904,8 +901,8 @@ static void gfar_init_filer_table(struct gfar_private *priv)
 
 	/* Default rule */
 	rqfcr = RQFCR_CMP_MATCH;
-	ftp_rqfcr[rqfar] = rqfcr;
-	ftp_rqfpr[rqfar] = rqfpr;
+	priv->ftp_rqfcr[rqfar] = rqfcr;
+	priv->ftp_rqfpr[rqfar] = rqfpr;
 	gfar_write_filer(priv, rqfar, rqfcr, rqfpr);
 
 	rqfar = cluster_entry_per_class(priv, rqfar, RQFPR_IPV6);
@@ -921,8 +918,8 @@ static void gfar_init_filer_table(struct gfar_private *priv)
 	/* Rest are masked rules */
 	rqfcr = RQFCR_CMP_NOMATCH;
 	for (i = 0; i < rqfar; i++) {
-		ftp_rqfcr[i] = rqfcr;
-		ftp_rqfpr[i] = rqfpr;
+		priv->ftp_rqfcr[i] = rqfcr;
+		priv->ftp_rqfpr[i] = rqfpr;
 		gfar_write_filer(priv, i, rqfcr, rqfpr);
 	}
 }
diff --git a/drivers/net/gianfar.h b/drivers/net/gianfar.h
index fc86f51..ba36dc7 100644
--- a/drivers/net/gianfar.h
+++ b/drivers/net/gianfar.h
@@ -9,7 +9,7 @@
  * Maintainer: Kumar Gala
  * Modifier: Sandeep Gopalpet <sandeep.kumar@freescale.com>
  *
- * Copyright 2002-2009 Freescale Semiconductor, Inc.
+ * Copyright 2002-2009, 2011 Freescale Semiconductor, Inc.
  *
  * This program is free software; you can redistribute  it and/or modify it
  * under  the terms of  the GNU General  Public License as published by the
@@ -1107,10 +1107,12 @@ struct gfar_private {
 	/* HW time stamping enabled flag */
 	int hwts_rx_en;
 	int hwts_tx_en;
+
+	/*Filer table*/
+	unsigned int ftp_rqfpr[MAX_FILER_IDX + 1];
+	unsigned int ftp_rqfcr[MAX_FILER_IDX + 1];
 };
 
-extern unsigned int ftp_rqfpr[MAX_FILER_IDX + 1];
-extern unsigned int ftp_rqfcr[MAX_FILER_IDX + 1];
 
 static inline int gfar_has_errata(struct gfar_private *priv,
 				  enum gfar_errata err)
diff --git a/drivers/net/gianfar_ethtool.c b/drivers/net/gianfar_ethtool.c
index 493d743..239e333 100644
--- a/drivers/net/gianfar_ethtool.c
+++ b/drivers/net/gianfar_ethtool.c
@@ -9,7 +9,7 @@
  *  Maintainer: Kumar Gala
  *  Modifier: Sandeep Gopalpet <sandeep.kumar@freescale.com>
  *
- *  Copyright 2003-2006, 2008-2009 Freescale Semiconductor, Inc.
+ *  Copyright 2003-2006, 2008-2009, 2011 Freescale Semiconductor, Inc.
  *
  *  This software may be used and distributed according to
  *  the terms of the GNU Public License, Version 2, incorporated herein
@@ -609,15 +609,15 @@ static void ethflow_to_filer_rules (struct gfar_private *priv, u64 ethflow)
 	if (ethflow & RXH_L2DA) {
 		fcr = RQFCR_PID_DAH |RQFCR_CMP_NOMATCH |
 			RQFCR_HASH | RQFCR_AND | RQFCR_HASHTBL_0;
-		ftp_rqfpr[priv->cur_filer_idx] = fpr;
-		ftp_rqfcr[priv->cur_filer_idx] = fcr;
+		priv->ftp_rqfpr[priv->cur_filer_idx] = fpr;
+		priv->ftp_rqfcr[priv->cur_filer_idx] = fcr;
 		gfar_write_filer(priv, priv->cur_filer_idx, fcr, fpr);
 		priv->cur_filer_idx = priv->cur_filer_idx - 1;
 
 		fcr = RQFCR_PID_DAL | RQFCR_AND | RQFCR_CMP_NOMATCH |
 				RQFCR_HASH | RQFCR_AND | RQFCR_HASHTBL_0;
-		ftp_rqfpr[priv->cur_filer_idx] = fpr;
-		ftp_rqfcr[priv->cur_filer_idx] = fcr;
+		priv->ftp_rqfpr[priv->cur_filer_idx] = fpr;
+		priv->ftp_rqfcr[priv->cur_filer_idx] = fcr;
 		gfar_write_filer(priv, priv->cur_filer_idx, fcr, fpr);
 		priv->cur_filer_idx = priv->cur_filer_idx - 1;
 	}
@@ -626,16 +626,16 @@ static void ethflow_to_filer_rules (struct gfar_private *priv, u64 ethflow)
 		fcr = RQFCR_PID_VID | RQFCR_CMP_NOMATCH | RQFCR_HASH |
 				RQFCR_AND | RQFCR_HASHTBL_0;
 		gfar_write_filer(priv, priv->cur_filer_idx, fcr, fpr);
-		ftp_rqfpr[priv->cur_filer_idx] = fpr;
-		ftp_rqfcr[priv->cur_filer_idx] = fcr;
+		priv->ftp_rqfpr[priv->cur_filer_idx] = fpr;
+		priv->ftp_rqfcr[priv->cur_filer_idx] = fcr;
 		priv->cur_filer_idx = priv->cur_filer_idx - 1;
 	}
 
 	if (ethflow & RXH_IP_SRC) {
 		fcr = RQFCR_PID_SIA | RQFCR_CMP_NOMATCH | RQFCR_HASH |
 			RQFCR_AND | RQFCR_HASHTBL_0;
-		ftp_rqfpr[priv->cur_filer_idx] = fpr;
-		ftp_rqfcr[priv->cur_filer_idx] = fcr;
+		priv->ftp_rqfpr[priv->cur_filer_idx] = fpr;
+		priv->ftp_rqfcr[priv->cur_filer_idx] = fcr;
 		gfar_write_filer(priv, priv->cur_filer_idx, fcr, fpr);
 		priv->cur_filer_idx = priv->cur_filer_idx - 1;
 	}
@@ -643,8 +643,8 @@ static void ethflow_to_filer_rules (struct gfar_private *priv, u64 ethflow)
 	if (ethflow & (RXH_IP_DST)) {
 		fcr = RQFCR_PID_DIA | RQFCR_CMP_NOMATCH | RQFCR_HASH |
 			RQFCR_AND | RQFCR_HASHTBL_0;
-		ftp_rqfpr[priv->cur_filer_idx] = fpr;
-		ftp_rqfcr[priv->cur_filer_idx] = fcr;
+		priv->ftp_rqfpr[priv->cur_filer_idx] = fpr;
+		priv->ftp_rqfcr[priv->cur_filer_idx] = fcr;
 		gfar_write_filer(priv, priv->cur_filer_idx, fcr, fpr);
 		priv->cur_filer_idx = priv->cur_filer_idx - 1;
 	}
@@ -652,8 +652,8 @@ static void ethflow_to_filer_rules (struct gfar_private *priv, u64 ethflow)
 	if (ethflow & RXH_L3_PROTO) {
 		fcr = RQFCR_PID_L4P | RQFCR_CMP_NOMATCH | RQFCR_HASH |
 			RQFCR_AND | RQFCR_HASHTBL_0;
-		ftp_rqfpr[priv->cur_filer_idx] = fpr;
-		ftp_rqfcr[priv->cur_filer_idx] = fcr;
+		priv->ftp_rqfpr[priv->cur_filer_idx] = fpr;
+		priv->ftp_rqfcr[priv->cur_filer_idx] = fcr;
 		gfar_write_filer(priv, priv->cur_filer_idx, fcr, fpr);
 		priv->cur_filer_idx = priv->cur_filer_idx - 1;
 	}
@@ -661,8 +661,8 @@ static void ethflow_to_filer_rules (struct gfar_private *priv, u64 ethflow)
 	if (ethflow & RXH_L4_B_0_1) {
 		fcr = RQFCR_PID_SPT | RQFCR_CMP_NOMATCH | RQFCR_HASH |
 			RQFCR_AND | RQFCR_HASHTBL_0;
-		ftp_rqfpr[priv->cur_filer_idx] = fpr;
-		ftp_rqfcr[priv->cur_filer_idx] = fcr;
+		priv->ftp_rqfpr[priv->cur_filer_idx] = fpr;
+		priv->ftp_rqfcr[priv->cur_filer_idx] = fcr;
 		gfar_write_filer(priv, priv->cur_filer_idx, fcr, fpr);
 		priv->cur_filer_idx = priv->cur_filer_idx - 1;
 	}
@@ -670,8 +670,8 @@ static void ethflow_to_filer_rules (struct gfar_private *priv, u64 ethflow)
 	if (ethflow & RXH_L4_B_2_3) {
 		fcr = RQFCR_PID_DPT | RQFCR_CMP_NOMATCH | RQFCR_HASH |
 			RQFCR_AND | RQFCR_HASHTBL_0;
-		ftp_rqfpr[priv->cur_filer_idx] = fpr;
-		ftp_rqfcr[priv->cur_filer_idx] = fcr;
+		priv->ftp_rqfpr[priv->cur_filer_idx] = fpr;
+		priv->ftp_rqfcr[priv->cur_filer_idx] = fcr;
 		gfar_write_filer(priv, priv->cur_filer_idx, fcr, fpr);
 		priv->cur_filer_idx = priv->cur_filer_idx - 1;
 	}
@@ -705,12 +705,12 @@ static int gfar_ethflow_to_filer_table(struct gfar_private *priv, u64 ethflow, u
 	}
 
 	for (i = 0; i < MAX_FILER_IDX + 1; i++) {
-		local_rqfpr[j] = ftp_rqfpr[i];
-		local_rqfcr[j] = ftp_rqfcr[i];
+		local_rqfpr[j] = priv->ftp_rqfpr[i];
+		local_rqfcr[j] = priv->ftp_rqfcr[i];
 		j--;
-		if ((ftp_rqfcr[i] == (RQFCR_PID_PARSE |
+		if ((priv->ftp_rqfcr[i] == (RQFCR_PID_PARSE |
 			RQFCR_CLE |RQFCR_AND)) &&
-			(ftp_rqfpr[i] == cmp_rqfpr))
+			(priv->ftp_rqfpr[i] == cmp_rqfpr))
 			break;
 	}
 
@@ -724,20 +724,22 @@ static int gfar_ethflow_to_filer_table(struct gfar_private *priv, u64 ethflow, u
 	 * if it was already programmed, we need to overwrite these rules
 	 */
 	for (l = i+1; l < MAX_FILER_IDX; l++) {
-		if ((ftp_rqfcr[l] & RQFCR_CLE) &&
-			!(ftp_rqfcr[l] & RQFCR_AND)) {
-			ftp_rqfcr[l] = RQFCR_CLE | RQFCR_CMP_EXACT |
+		if ((priv->ftp_rqfcr[l] & RQFCR_CLE) &&
+			!(priv->ftp_rqfcr[l] & RQFCR_AND)) {
+			priv->ftp_rqfcr[l] = RQFCR_CLE | RQFCR_CMP_EXACT |
 				RQFCR_HASHTBL_0 | RQFCR_PID_MASK;
-			ftp_rqfpr[l] = FPR_FILER_MASK;
-			gfar_write_filer(priv, l, ftp_rqfcr[l], ftp_rqfpr[l]);
+			priv->ftp_rqfpr[l] = FPR_FILER_MASK;
+			gfar_write_filer(priv, l, priv->ftp_rqfcr[l],
+				priv->ftp_rqfpr[l]);
 			break;
 		}
 
-		if (!(ftp_rqfcr[l] & RQFCR_CLE) && (ftp_rqfcr[l] & RQFCR_AND))
+		if (!(priv->ftp_rqfcr[l] & RQFCR_CLE) &&
+			(priv->ftp_rqfcr[l] & RQFCR_AND))
 			continue;
 		else {
-			local_rqfpr[j] = ftp_rqfpr[l];
-			local_rqfcr[j] = ftp_rqfcr[l];
+			local_rqfpr[j] = priv->ftp_rqfpr[l];
+			local_rqfcr[j] = priv->ftp_rqfcr[l];
 			j--;
 		}
 	}
@@ -750,8 +752,8 @@ static int gfar_ethflow_to_filer_table(struct gfar_private *priv, u64 ethflow, u
 
 	/* Write back the popped out rules again */
 	for (k = j+1; k < MAX_FILER_IDX; k++) {
-		ftp_rqfpr[priv->cur_filer_idx] = local_rqfpr[k];
-		ftp_rqfcr[priv->cur_filer_idx] = local_rqfcr[k];
+		priv->ftp_rqfpr[priv->cur_filer_idx] = local_rqfpr[k];
+		priv->ftp_rqfcr[priv->cur_filer_idx] = local_rqfcr[k];
 		gfar_write_filer(priv, priv->cur_filer_idx,
 				local_rqfcr[k], local_rqfpr[k]);
 		if (!priv->cur_filer_idx)
-- 
1.5.6.5



^ permalink raw reply related

* Re: Bug#629604: linux-image-2.6.38-2-686: sky2 eth0 rx length errors (~5/second)
From: Ben Hutchings @ 2011-06-08 13:45 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Kate Gordon, 629604, netdev
In-Reply-To: <1307503575.22348.495.camel@localhost>

[-- Attachment #1: Type: text/plain, Size: 466 bytes --]

-------- Forwarded Message --------
From: Freedom Tea <freedomtea@gmail.com>
Reply-to: Freedom Tea <freedomtea@gmail.com>, 629604@bugs.debian.org
To: 629604@bugs.debian.org
Subject: Bug#629604: linux-image-2.6.38-2-686: sky2 eth0 rx length errors (~5/second)
Date: Wed, 8 Jun 2011 13:38:52 +1000

Sorry, yes I was on 2.6.32.  The 5 came from "2.6.32-5-686" (seen many times in grub).

kernel: [    1.084097] sky2 0000:02:00.0: Yukon-2 EC chip revision 2


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* [ANNOUNCE]: Release of iptables-1.4.11.1
From: Patrick McHardy @ 2011-06-08 13:54 UTC (permalink / raw)
  To: Netfilter Development Mailinglist, Linux Netdev List,
	netfilter-announce, "'netfilter@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 892 bytes --]

The netfilter coreteam presents:

    iptables version 1.4.11.1

a minor release containing bugfixes for problems discovered in
the 1.4.11 release as well as some minor documentation updates.

Due to the large number of changes in 1.4.11, some regressions
and a few minor bugs made it into the release. This version
addresses all identified problems:

- broken inversion support in the owner match

- broken inversion in implicitly loaded protocol extensions, most
  importantly causing "-p tcp ! --syn" to not negate --syn

- symlink installation fixes

- xml translation missing in IPv6 only builds

See the attached changelogs for the full list of changes.

Version 1.4.11.1 can be obtained from:

http://www.netfilter.org/projects/iptables/downloads.html
ftp://ftp.netfilter.org/pub/iptables/
git://git.netfilter.org/iptables.git

On behalf of the Netfilter Core Team.
Happy firewalling!

[-- Attachment #2: changes-iptables-1.4.11.1.txt --]
[-- Type: text/plain, Size: 1089 bytes --]

Elie De Brauwer (1):
      doc: fix trivial typo in libipt_SNAT

Jan Engelhardt (13):
      libxt_owner: restore inversion support
      build: remove dead code parts
      build: fix installation of symlinks
      build: fix absence of xml translator in IPv6-only builds
      doc: update GPL license text
      doc: iptables-xml should be in manpage section 1
      build: move basic preprocessor flags to regular_CPPFLAGS
      build: move kinclude's preprocessor flags to kinclude_CPPFLAGS
      src: move all libiptc pieces into its directory
      src: move all iptables pieces into a separate directory
      tests: add some sample rulesets to test save-restore cycle
      option: fix ignored negation before implicit extension loading
      build: re-add missing CPPFLAGS for libiptc

Maciej Żenczykowski (1):
      xtables-multi: fix absence of xml translator in IPv6-only builds

Mike Frysinger (1):
      build: move remaining preprocessor flags to CPPFLAGS

Patrick McHardy (1):
      Bump version to 1.4.11.1

Vlad Dogaru (1):
      doc: fix MASQUERADE section of man page


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox