Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 11/11] bridge: Dump vlan information from a bridge port
From: Vlad Yasevich @ 2012-12-18 17:51 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, davem, mst, john.r.fastabend
In-Reply-To: <20121218090332.2f18f393@nehalam.linuxnetplumber.net>

On 12/18/2012 12:03 PM, Stephen Hemminger wrote:
> On Wed, 12 Dec 2012 15:01:17 -0500
> Vlad Yasevich <vyasevic@redhat.com> wrote:
>
>> Using the RTM_GETLINK dump the vlan filter list of a given
>> bridge port.  The information depends on setting the filter
>> flag similar to how nic VF info is dumped.
>>
>> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
>
> I will put these in the bridge tree with other patches that
> are being staged for net-next.
>

Hold on a bit.  I was running some more tests and found some
scenarios that didn't work right.

I am preparing v2 of the patches.

Thanks
-vlad

^ permalink raw reply

* Re: [PATCH 04/11] bridge: Cache vlan in the cb for faster egress lookup.
From: Vlad Yasevich @ 2012-12-18 17:50 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, davem, mst, john.r.fastabend
In-Reply-To: <20121218090448.69c21daa@nehalam.linuxnetplumber.net>

On 12/18/2012 12:04 PM, Stephen Hemminger wrote:
> On Wed, 12 Dec 2012 15:01:10 -0500
> Vlad Yasevich <vyasevic@redhat.com> wrote:
>
>> On input, cache the pointer to the bridge vlan info, so that
>> on egress, we have can simply look at the port bitmap instead
>> of traversing a vlan list.
>>
>> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
>
> This isn't going to be safe. Once packet is passed up, the cb[]
> can get overwritten by other things.
>

Right, but only care about it while in bridging code.  We don't look
at it anywhere else...

Or are you saying that cb is not guaranteed to be preserved between 
br_handle_frame_finish and br_forward?

-vlad

^ permalink raw reply

* Re: [PATCH] build: unbreak linkage of m_xt.so
From: Stephen Hemminger @ 2012-12-18 17:21 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: stephen.hemminger, vapier, netdev, jhs, urykhy, shemonc, pablo,
	netfilter-devel
In-Reply-To: <1355617968-26138-1-git-send-email-jengelh@inai.de>

On Sun, 16 Dec 2012 01:32:48 +0100
Jan Engelhardt <jengelh@inai.de> wrote:

> Commit v3.7.0~10 caused the variable new PKG_CONFIG variable never
> to be present at the time of calling make, leading to tc/m_xt.so
> not linked with -lxtables (result from pkg-config xtables --libs),
> that in turn leading to
> 
> tc: symbol lookup error: /usr/lib64/tc//m_xt.so: undefined symbol:
> xtables_init_all
> 
> Fixing that.
> 
> Signed-off-by: Jan Engelhardt <jengelh@inai.de>
> ---
>  configure |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/configure b/configure
> index 9912114..573ee55 100755
> --- a/configure
> +++ b/configure
> @@ -4,7 +4,6 @@
>  INCLUDE=${1:-"$PWD/include"}
>  : ${PKG_CONFIG:=pkg-config}
>  : ${CC=gcc}
> -echo "PKG_CONFIG:=${PKG_CONFIG}" >>Config
>  
>  # Make a temp directory in build tree.
>  TMPDIR=$(mktemp -d config.XXXXXX)
> @@ -224,6 +223,7 @@ rm -f $TMPDIR/ipsettest.c $TMPDIR/ipsettest
>  }
>  
>  echo "# Generated config based on" $INCLUDE >Config
> +echo "PKG_CONFIG:=${PKG_CONFIG}" >>Config
>  
>  echo "TC schedulers"
>  

Ok, manually did the diff (conflicted with other previous changes).

^ permalink raw reply

* Re: network namespace and DNS lookups
From: Dan Williams @ 2012-12-18 17:22 UTC (permalink / raw)
  To: Ravi Aysola; +Cc: netdev
In-Reply-To: <CAFaHj6HT0c59us_7F9Uh7BYh_YXGRt30=yLOrV=hcrG29YO-qA@mail.gmail.com>

On Tue, 2012-12-18 at 10:49 -0500, Ravi Aysola wrote:
> I think I sent my earlier email a bit prematurely.  I do have
> /etc/netns/<namespace-name>/resolv.conf
> files under each of my namespaces. Now the question is, how does a
> user space process
> (say bind)  look at a namespace specific resolv.conf instead of
> default one?  Have any of
> these standard applications been modified to work with namespace
> specific config files?

Wouldn't that be the glibc resolver's domain?  DNS lookups aren't done
by the kernel, but by glibc in userspace.  And glibc is also what reads
resolv.conf, so most things DNS-namespace related would need to happen
there.

Dan

> thanks again
> ravi/
> 
> On Tue, Dec 18, 2012 at 10:09 AM, Ravi Aysola <ravi.mlists@gmail.com> wrote:
> > Has there been any work in any of the recent kernels to limit the DNS lookup
> > to a particular network namespace?  Do we have any facility to specify the
> > DNS resolvers on network namespace basis (such as /etc/ns/resolv.conf)?
> >
> > thank you
> > ravi/
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2] netlink: align attributes on 64-bits
From: Thomas Graf @ 2012-12-18 17:11 UTC (permalink / raw)
  To: David Laight; +Cc: nicolas.dichtel, bhutchings, netdev, davem
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B70F5@saturn3.aculab.com>

On 12/18/12 at 04:50pm, David Laight wrote:
> > 2/ Suppose that the attribute is:
> > 
> >    struct foo {
> >    	__u64 bar1;
> >    	__u32 bar2;
> >    }
> >    => sizeof(struct foo) = 12 (= payload)
> 
> That is only true if the host architecture aligns 64bit items
> on 32 it boundaries (as i386 does).
> Otherwise there are 4 bytes of padding at the end and the
> size is 16.
> 
> Actually it is worse than that.
> Consider the structure:
> 	struct bar {
> 		__u32 foo1;
> 		__u64 foo2;
> 	}
> On i386 it will have size 12 and foo2 will be at offset 4.
> On sparc32 (and most 64bit) it will have size 16 with foo2
> at offset 8 (and 4 bytes of pad after foo1).

This is a known problem and I can't think of anything
that can be done about it except for memcpy()ing the
data before accessing it.

If you have ideas, I'm more that willing to listen :)

> Do these messages move between systems?
> If they do then any 64bit items need an explicit alignment
> eg tag with __attribute__((aligned(8))) (or aligned(4)).

They don't. Netlink has and will be host bound. It also
uses host byte order for that reason.

^ permalink raw reply

* Re: [PATCH v2] netlink: align attributes on 64-bits
From: Thomas Graf @ 2012-12-18 17:08 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: bhutchings, netdev, davem, David.Laight
In-Reply-To: <50D09897.8030508@6wind.com>

On 12/18/12 at 05:23pm, Nicolas Dichtel wrote:
> Le 18/12/2012 13:57, Thomas Graf a écrit :
> >-static inline int nla_padlen(int payload)
> >-{
> >-	return nla_total_size(payload) - nla_attr_size(payload);
> >+	if (!IS_ALIGNED(len, NLA_ATTR_ALIGN))
> >+		len = ALIGN(len + NLA_HDRLEN, NLA_ATTR_ALIGN);
> Two comments:
> 1/ should it be ALIGN(len, NLA_ATTR_ALIGN)? If we want to add a __u64:
>    => nla_attr_size(sizeof(__u64)) = 12
>    => NLA_ALIGN(nla_attr_size(sizeof(__u64))) => 12 (= len)
>    => ALIGN(len + NLA_HDRLEN, NLA_ATTR_ALIGN) = 0 but it should be 4

We can't add 1-3 bytes of padding, therefore we need to add
NLA_HDRLEN to len before aligning it to enforce a minimal
padding. We can't hit it right now because 4 byte alignment
of the previous attribute is a given but if we ever change
the alignment it could become an issue and the above should
be bullet proof.

Your example would come out like this:
  nla_attr_size(8) = 12
  ALIGN(12 + 4, 8) = 16

> 2/ Suppose that the attribute is:
> 
>   struct foo {
>   	__u64 bar1;
>   	__u32 bar2;
>   }
>   => sizeof(struct foo) = 12 (= payload)
>   => nla_attr_size(payload) = 16
>   => NLA_ALIGN(nla_attr_size(payload)) = 16 (= len)
>   => IS_ALIGNED(len, NLA_ATTR_ALIGN) = true
>   => extra room is not reserved
>   But it's not guaranteed that bar1 is aligned on 8 bytes, only on 4 bytes.

That's correct, that's why I have added the additional
NLA_ATTR_ALIGN of room in nlmsg_new(). It will account
for the one time padding that is needed before we add
the very first attribute.

If all attributes after that have a size aligned to 8
bytes no padding is needed. Padding will only be needed
again if a struct is missized in which case we reserve
room with the above. Correct?

> >+	offset = (size_t) skb_tail_pointer(skb);
> >+	if (!IS_ALIGNED(offset + NLA_HDRLEN, NLA_ATTR_ALIGN)) {
> With the previous struct foo, this test may be true even if we don't
> have reserved extra room. This test depends on previous attribute.
> I think the exact size of the netlink message depends on the order
> of attributes, not only on the attribute itself.
> What about taking the assumption that the start will never be
> aligned and always allocating extra room: ALIGN(NLA_ALIGNTO,
> NLA_ATTR_ALIGN) (= 4)?

See my explanation above. I think this works. The order does not
matter, the sum of all padding required will always be the same.

> >+static bool nla_insufficient_space(struct sk_buff *skb, int attrlen)
> >+{
> >+	size_t needed = nla_pre_padlen(skb) + nla_total_size(attrlen);
> If nla_total_size() was right, nla_pre_padlen(skb) should already be
> included. Am I wrong?

No, nla_pre_padlen() contains the number of bytes needed to align
skb_tail_pointer() to an alignment of 8. If that is > 0 but the
attribute to follow is already aligned.

The tricky part here is that accounting for padding in
nla_total_size() only works for the sum of all attributes.
It does not account for the specific padding required for the
previous attribute.

Therefore the above check. The above could be changed to
nla_attr_size() theoretically as we don't need space for the
final padding eventually but we checked for space before so I
kept it that way.

I realize it's slightly confusign and needs better documentation
and please double check my thinking :-)

^ permalink raw reply

* Re: [PATCH 04/11] bridge: Cache vlan in the cb for faster egress lookup.
From: Stephen Hemminger @ 2012-12-18 17:04 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: netdev, davem, mst, john.r.fastabend
In-Reply-To: <1355342477-4971-5-git-send-email-vyasevic@redhat.com>

On Wed, 12 Dec 2012 15:01:10 -0500
Vlad Yasevich <vyasevic@redhat.com> wrote:

> On input, cache the pointer to the bridge vlan info, so that
> on egress, we have can simply look at the port bitmap instead
> of traversing a vlan list.
> 
> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>

This isn't going to be safe. Once packet is passed up, the cb[]
can get overwritten by other things.

^ permalink raw reply

* Re: [PATCH 11/11] bridge: Dump vlan information from a bridge port
From: Stephen Hemminger @ 2012-12-18 17:03 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: netdev, davem, mst, john.r.fastabend
In-Reply-To: <1355342477-4971-12-git-send-email-vyasevic@redhat.com>

On Wed, 12 Dec 2012 15:01:17 -0500
Vlad Yasevich <vyasevic@redhat.com> wrote:

> Using the RTM_GETLINK dump the vlan filter list of a given
> bridge port.  The information depends on setting the filter
> flag similar to how nic VF info is dumped.
> 
> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>

I will put these in the bridge tree with other patches that
are being staged for net-next.

^ permalink raw reply

* RE: [PATCH v2] netlink: align attributes on 64-bits
From: David Laight @ 2012-12-18 16:50 UTC (permalink / raw)
  To: nicolas.dichtel, Thomas Graf; +Cc: bhutchings, netdev, davem
In-Reply-To: <50D09897.8030508@6wind.com>

> 2/ Suppose that the attribute is:
> 
>    struct foo {
>    	__u64 bar1;
>    	__u32 bar2;
>    }
>    => sizeof(struct foo) = 12 (= payload)

That is only true if the host architecture aligns 64bit items
on 32 it boundaries (as i386 does).
Otherwise there are 4 bytes of padding at the end and the
size is 16.

Actually it is worse than that.
Consider the structure:
	struct bar {
		__u32 foo1;
		__u64 foo2;
	}
On i386 it will have size 12 and foo2 will be at offset 4.
On sparc32 (and most 64bit) it will have size 16 with foo2
at offset 8 (and 4 bytes of pad after foo1).

Do these messages move between systems?
If they do then any 64bit items need an explicit alignment
eg tag with __attribute__((aligned(8))) (or aligned(4)).

	David

^ permalink raw reply

* Re: inaccurate packet scheduling
From: Eric Dumazet @ 2012-12-18 16:51 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Jiri Pirko, jhs, davem, edumazet, tgraf, netdev
In-Reply-To: <e2f71dc7-bee5-41ca-bd64-f4569fa953da@tahiti.vyatta.com>

On Tue, 2012-12-18 at 08:26 -0800, Stephen Hemminger wrote:

> Check kernel log for messages about clock. It could be that on the
> machines with issues TSC is not usable for kernel clock.
> Also turn off TSO since it screws up any form of rate control.

Yes, but we should fix it eventually. I'll take a look.

^ permalink raw reply

* RE: TCP delayed ACK heuristic
From: David Laight @ 2012-12-18 16:39 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: Ben Greear, David Miller, Eric Dumazet, Stephen Hemminger,
	Rick Jones, Thomas Graf
In-Reply-To: <2088500005.27728019.1355843484620.JavaMail.root@redhat.com>

> David's point is that we can do some heuristics for TCP
> delayed ACK, so the question is that what kind of heuristics
> can we use?
> 
> RFC1122 explicitly mentions:
> 
>             A TCP SHOULD implement a delayed ACK, but an ACK should not
>             be excessively delayed; in particular, the delay MUST be
>             less than 0.5 seconds, and in a stream of full-sized
>             segments there SHOULD be an ACK for at least every second
>             segment.
> 
> so this prevents us from using any heuristic for the number
> of coalesced delayed ACK.

There are problems with only implementing the acks
specified by RFC1122.

I've seen problems when the sending side is doing (I think)
'slow start' with Nagle disabled.
The sender would only send 4 segments before waiting for an
ACK - even when it had more than a full sized segment waiting.
Sender was Linux 2.6.something (probably low 20s).
I changed the application flow to send data in the reverse
direction to avoid the problem.
That was on a ~0 delay local connection - which means that
there is almost never outstanding data, and the 'slow start'
happened almost all the time.
Nagle is completely the wrong algorithm for the data flow.

	David

^ permalink raw reply

* Re: TCP delayed ACK heuristic
From: Eric Dumazet @ 2012-12-18 16:30 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, Ben Greear, David Miller, Stephen Hemminger, Rick Jones,
	Thomas Graf
In-Reply-To: <2088500005.27728019.1355843484620.JavaMail.root@redhat.com>

On Tue, 2012-12-18 at 10:11 -0500, Cong Wang wrote:
> Hello, TCP experts!
> 
> Some time ago, Ben sent a patch [1] to add some knobs for
> tuning TCP delayed ACK, but rejected by David.
> 
> David's point is that we can do some heuristics for TCP
> delayed ACK, so the question is that what kind of heuristics
> can we use?
> 
> RFC1122 explicitly mentions:
> 
>             A TCP SHOULD implement a delayed ACK, but an ACK should not
>             be excessively delayed; in particular, the delay MUST be
>             less than 0.5 seconds, and in a stream of full-sized
>             segments there SHOULD be an ACK for at least every second
>             segment.
> 
> so this prevents us from using any heuristic for the number
> of coalesced delayed ACK.
> 
> For the timeout of a delayed ACK, my idea is guessing how many
> packet we suppose to receive is the TCP stream is fully utilized,
> something like below:
> 
> +static inline u32 tcp_expect_packets(struct sock *sk)
> +{
> +       struct tcp_sock *tp = tcp_sk(sk);
> +       int rtt = tp->srtt >> 3;
> +       u32 idle = tcp_time_stamp - inet_csk(sk)->icsk_ack.lrcvtime;
> +
> +       return idle * 2 / rtt;
> +}
> ...
> +       ato -= tcp_expect_packets(sk) * delta;
> 
> 
> The more we expect, the less we should delay. However this is
> not accurate due to congestion control.
> 
> Meanwhile, we can also check how many packets are pending in TCP
> sending queue, the more we pend, the more we can piggyback with
> a single ACK, but not beyond how much we are able to send at
> that time.
> 
> Comments? Ideas?
> 

ACKS might also be delayed because of bidirectional traffic, and is more
controlled by the application response time. TCP stack can not easily
estimate it.

If you focus on bulk receive, LRO/GRO should already lower number of
ACKS to an acceptable level and without major disruption.

Stretch acks are not only the receiver concern, there are issues for the
sender that you cannot always control/change.

I recommend reading RFC2525 2.13

^ permalink raw reply

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Eric Dumazet @ 2012-12-18 16:35 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Konrad Rzeszutek Wilk,
	annie li, xen-devel@lists.xensource.com
In-Reply-To: <1355847758.14620.265.camel@zakaz.uk.xensource.com>

On Tue, 2012-12-18 at 16:22 +0000, Ian Campbell wrote:
> On Tue, 2012-12-18 at 16:13 +0000, Eric Dumazet wrote:
> > On Tue, 2012-12-18 at 15:26 +0000, Ian Campbell wrote:
> > 
> > > So actually we want += PAGE_SIZE * skb_shinfo(skb)->nr_frags ?
> > > 
> > 
> > I dont know what are the real frag sizes in your case.
> 
> I think it's a page, see xennet_alloc_rx_buffers and the alloc_page
> therein.
> 

If they are order-0 pages, then PAGE_SIZE * nr_frags  is OK.


> > Some drivers allocate a full page for an ethernet frame, others use half
> > of a page, it really depends.
> > 

> > As the frag ABI doesnt contain real size, its ok in this case to account
> > the actual frag size.
> > 
> > (skb->data_len in your driver)
> 
> I guess I'm a bit confused by what truesize means again then ;-),
> because in that case the original patch is correct although it would
> have been less confusing to do:
> 	skb->truesize += skb->data_len; 
> in xennet_poll() and then do the subtraction of
> NETFRONT_SKB_CB(skb)->pull_to in handle_incoming_queue() where we
> actually do the pull up.
> 
> Unless __pskb_pull_tail does that adjustment for us, but if it does I
> can't see where.

Thats because skb frags only contain :

a page pointer.
An offset
A size. (Exact number or used bytes in this frag)

And not the 'originally allocated size. It could be 256, 768, 2048,
4096, 65536 bytes, nobody but the driver really knows.

So when we pull X bytes from a fragment to skb->head, there is no way to
remember what was the original size of the fragment.

Only the driver allocating the frag knows its truesize.

Once skb is given to the stack, we lose this information, and rely on
skb->truesize being an accurate estimation.

^ permalink raw reply

* Re: [PATCH v2] netlink: align attributes on 64-bits
From: Nicolas Dichtel @ 2012-12-18 16:23 UTC (permalink / raw)
  To: Thomas Graf; +Cc: bhutchings, netdev, davem, David.Laight
In-Reply-To: <20121218125735.GG27746@casper.infradead.org>

Le 18/12/2012 13:57, Thomas Graf a écrit :
> On 12/17/12 at 05:49pm, Nicolas Dichtel wrote:
>> @@ -492,6 +492,15 @@ static inline struct nlmsghdr *nlmsg_put_answer(struct sk_buff *skb,
>>    */
>>   static inline struct sk_buff *nlmsg_new(size_t payload, gfp_t flags)
>>   {
>> +	/* Because attributes may be aligned on 64-bits boundary with fake
>> +	 * attribute (type 0, size 4 (attributes are 32-bits align by default)),
>> +	 * an exact payload size cannot be calculated. Hence, we need to reserve
>> +	 * more space for these attributes.
>> +	 * 128 is arbitrary: it allows to align up to 32 attributes.
>> +	 */
>> +	if (payload < NLMSG_DEFAULT_SIZE)
>> +		payload = min(payload + 128, (size_t)NLMSG_DEFAULT_SIZE);
>
> This is doomed to fail eventually. A netlink message may carry
> hundreds of attributes eventually. See my suggestion below.
>
>> diff --git a/lib/nlattr.c b/lib/nlattr.c
>> index 18eca78..7440a80 100644
>> --- a/lib/nlattr.c
>> +++ b/lib/nlattr.c
>> @@ -450,9 +450,18 @@ EXPORT_SYMBOL(__nla_put_nohdr);
>>    */
>>   int nla_put(struct sk_buff *skb, int attrtype, int attrlen, const void *data)
>>   {
>> -	if (unlikely(skb_tailroom(skb) < nla_total_size(attrlen)))
>> +	int align = IS_ALIGNED((unsigned long)skb_tail_pointer(skb), 8) ? 0 : 4;
>
> This does not look right. In order for the attribute data to be
> aligned properly you would need to check skb_tail_pointer(skb) +
> NLA_HDRLEN for proper alignment or you end up aligning the
> attribute header.
Good point.

>
> How about we change nla_total_size() to return the size with
> needed padding taken into account. That should fix the message
> size caluclation problem and we only need to reserve room for
> the initial padding to align the very first attribute.
>
> Below is an untested patch that does this. What do you think?
I still have some doubts about the size calculation (see bellow).
For the rest of the patch, it seems ok (except some minor point). I will test it.

>
> diff --git a/include/net/netlink.h b/include/net/netlink.h
> index 9690b0f..7ce8e76 100644
> --- a/include/net/netlink.h
> +++ b/include/net/netlink.h
> @@ -114,7 +114,6 @@
>    * Attribute Length Calculations:
>    *   nla_attr_size(payload)		length of attribute w/o padding
>    *   nla_total_size(payload)		length of attribute w/ padding
> - *   nla_padlen(payload)		length of padding
>    *
>    * Attribute Payload Access:
>    *   nla_data(nla)			head of attribute payload
> @@ -492,6 +491,14 @@ static inline struct nlmsghdr *nlmsg_put_answer(struct sk_buff *skb,
>    */
>   static inline struct sk_buff *nlmsg_new(size_t payload, gfp_t flags)
>   {
> +	/* If an exact size if specified, reserve some additional space to
> +	 * align the first attribute, all subsequent attributes should have
> +	 * padding accounted for.
> +	 */
> +	if (payload != NLMSG_DEFAULT_SIZE)
> +		payload = min_t(size_t, payload + NLA_ATTR_ALIGN,
> +				NLMSG_DEFAULT_SIZE);
> +
>   	return alloc_skb(nlmsg_total_size(payload), flags);
>   }
>
> @@ -653,16 +660,12 @@ static inline int nla_attr_size(int payload)
>    */
>   static inline int nla_total_size(int payload)
>   {
> -	return NLA_ALIGN(nla_attr_size(payload));
> -}
> +	size_t len = NLA_ALIGN(nla_attr_size(payload));
>
> -/**
> - * nla_padlen - length of padding at the tail of attribute
> - * @payload: length of payload
> - */
> -static inline int nla_padlen(int payload)
> -{
> -	return nla_total_size(payload) - nla_attr_size(payload);
> +	if (!IS_ALIGNED(len, NLA_ATTR_ALIGN))
> +		len = ALIGN(len + NLA_HDRLEN, NLA_ATTR_ALIGN);
Two comments:
1/ should it be ALIGN(len, NLA_ATTR_ALIGN)? If we want to add a __u64:
    => nla_attr_size(sizeof(__u64)) = 12
    => NLA_ALIGN(nla_attr_size(sizeof(__u64))) => 12 (= len)
    => ALIGN(len + NLA_HDRLEN, NLA_ATTR_ALIGN) = 0 but it should be 4

2/ Suppose that the attribute is:

   struct foo {
   	__u64 bar1;
   	__u32 bar2;
   }
   => sizeof(struct foo) = 12 (= payload)
   => nla_attr_size(payload) = 16
   => NLA_ALIGN(nla_attr_size(payload)) = 16 (= len)
   => IS_ALIGNED(len, NLA_ATTR_ALIGN) = true
   => extra room is not reserved
   But it's not guaranteed that bar1 is aligned on 8 bytes, only on 4 bytes.

> +
> +	return len;
>   }
>
>   /**
> diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
> index 78d5b8a..1856729 100644
> --- a/include/uapi/linux/netlink.h
> +++ b/include/uapi/linux/netlink.h
> @@ -149,5 +149,10 @@ struct nlattr {
>   #define NLA_ALIGN(len)		(((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1))
>   #define NLA_HDRLEN		((int) NLA_ALIGN(sizeof(struct nlattr)))
>
> +/* Padding attribute type, added automatically to align attributes,
> + * must be ignored by readers. */
> +#define NLA_PADDING		0
> +#define NLA_ATTR_ALIGN		8
> +
>
>   #endif /* _UAPI__LINUX_NETLINK_H */
> diff --git a/lib/nlattr.c b/lib/nlattr.c
> index 18eca78..b09473c 100644
> --- a/lib/nlattr.c
> +++ b/lib/nlattr.c
> @@ -322,18 +322,36 @@ int nla_strcmp(const struct nlattr *nla, const char *str)
>    * Adds a netlink attribute header to a socket buffer and reserves
>    * room for the payload but does not copy it.
>    *
> + * May add a padding attribute of type NLA_PADDING before the
> + * real attribute to ensure proper alignment.
> + *
>    * The caller is responsible to ensure that the skb provides enough
>    * tailroom for the attribute header and payload.
>    */
>   struct nlattr *__nla_reserve(struct sk_buff *skb, int attrtype, int attrlen)
>   {
>   	struct nlattr *nla;
> +	size_t offset;
> +
> +	offset = (size_t) skb_tail_pointer(skb);
> +	if (!IS_ALIGNED(offset + NLA_HDRLEN, NLA_ATTR_ALIGN)) {
With the previous struct foo, this test may be true even if we don't have 
reserved extra room. This test depends on previous attribute.
I think the exact size of the netlink message depends on the order of 
attributes, not only on the attribute itself.
What about taking the assumption that the start will never be aligned and always 
allocating extra room: ALIGN(NLA_ALIGNTO, NLA_ATTR_ALIGN) (= 4)?

> +		struct nlattr *pad;
> +		size_t padlen;
> +
> +		padlen = nla_total_size(offset) - offset - NLA_HDRLEN;
> +		pad = (struct nlattr *) skb_put(skb, nla_attr_size(padlen));
> +		pad->nla_type = 0;
> +		pad->nla_len = nla_attr_size(padlen);
> +
> +		memset((unsigned char *) pad + NLA_HDRLEN, 0, padlen);
> +	}
>
> -	nla = (struct nlattr *) skb_put(skb, nla_total_size(attrlen));
> +	nla = (struct nlattr *) skb_put(skb, NLA_ALIGN(nla_attr_size(attrlen)));
>   	nla->nla_type = attrtype;
>   	nla->nla_len = nla_attr_size(attrlen);
>
> -	memset((unsigned char *) nla + nla->nla_len, 0, nla_padlen(attrlen));
> +	memset((unsigned char *) nla + nla->nla_len, 0,
> +	       NLA_ALIGN(nla->nla_len) - nla->nla_len);
>
>   	return nla;
>   }
> @@ -360,6 +378,23 @@ void *__nla_reserve_nohdr(struct sk_buff *skb, int attrlen)
>   }
>   EXPORT_SYMBOL(__nla_reserve_nohdr);
>
> +static size_t nla_pre_padlen(struct sk_buff *skb)
> +{
> +	size_t offset = (size_t) skb_tail_pointer(skb);
> +
> +	if (!IS_ALIGNED(offset + NLA_HDRLEN, NLA_ATTR_ALIGN))
> +		return nla_total_size(offset) - offset - NLA_HDRLEN;
> +
> +	return 0;
> +}
> +
> +static bool nla_insufficient_space(struct sk_buff *skb, int attrlen)
> +{
> +	size_t needed = nla_pre_padlen(skb) + nla_total_size(attrlen);
If nla_total_size() was right, nla_pre_padlen(skb) should already be included. 
Am I wrong?

> +
> +	return (skb_tailroom(skb) < needed);
> +}
> +
>   /**
>    * nla_reserve - reserve room for attribute on the skb
>    * @skb: socket buffer to reserve room on
> @@ -374,7 +409,7 @@ EXPORT_SYMBOL(__nla_reserve_nohdr);
>    */
>   struct nlattr *nla_reserve(struct sk_buff *skb, int attrtype, int attrlen)
>   {
> -	if (unlikely(skb_tailroom(skb) < nla_total_size(attrlen)))
> +	if (unlikely(nla_insufficient_space(skb, attrlen)))
>   		return NULL;
>
>   	return __nla_reserve(skb, attrtype, attrlen);
> @@ -450,7 +485,7 @@ EXPORT_SYMBOL(__nla_put_nohdr);
>    */
>   int nla_put(struct sk_buff *skb, int attrtype, int attrlen, const void *data)
>   {
> -	if (unlikely(skb_tailroom(skb) < nla_total_size(attrlen)))
> +	if (unlikely(nla_insufficient_space(skb, attrlen)))
>   		return -EMSGSIZE;
>
>   	__nla_put(skb, attrtype, attrlen, data);
>

^ permalink raw reply

* Re: inaccurate packet scheduling
From: Stephen Hemminger @ 2012-12-18 16:26 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: jhs, davem, edumazet, tgraf, netdev
In-Reply-To: <20121218150407.GB1533@minipsycho.orion>



----- Original Message -----
> Hi all.
> 
> Run one of the following 2 scripts on machine A:
> 
> #!/bin/bash
> tc qdisc del dev eth0 root
> sleep 1
> tc -batch << EOF
> qdisc add dev eth0 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0 0
> qdisc add dev eth0 parent 1:1 handle 10: pfifo limit 50
> qdisc add dev eth0 parent 1:2 handle 20 tbf latency 100ms rate 4mbit
> burst 2m
> filter add dev eth0 parent 1: protocol ip u32 match ip dst
> $machineB_ip flowid 1:2
> EOF
> 
> #!/bin/bash
> tc qdisc del dev eth0 root
> sleep 1
> tc -batch << EOF
> qdisc add dev eth0 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0 0
> qdisc add dev eth0 parent 1:1 handle 10: pfifo limit 20
> qdisc add dev eth0 parent 1:2 handle 20: pfifo limit 20
> filter add dev eth0 parent 1: protocol ip pref 10 \
> u32 match ip dst $machineB_ip \
> flowid 1:2 \
> police rate 4Mbit burst 2m conform-exceed drop
> EOF
> 
> And run:
> [machineB ~]# iperf -s
> [machineA ~]# iperf -c machineB_ip -t 60
> 
> Expected results are: ~3.8-4.2 Mbits/s
> But actual results are: ~130-170 Kbits/s with tbf, ~70-300 Kbits/s
> with policy rate
> 
> [machineA ~]# tc -s qdisc list dev eth0
> qdisc prio 1: root refcnt 9 bands 2 priomap  0 0 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0
>  Sent 1512384 bytes 1032 pkt (dropped 729, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo 10: parent 1:1 limit 50p
>  Sent 4560 bytes 32 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc tbf 20: parent 1:2 rate 4000Kbit burst 2Mb lat 100.0ms
>  Sent 1507824 bytes 1000 pkt (dropped 729, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> 
> 
> Tested with kernel pulled from linus's git today. This happens with
> older
> kernels as well (I tried 2.6.32-based rhel6 kernels).
> 
> This happens to me on following machines:
> HP DL360G8 (x86_64) http://people.redhat.com/jpirko/aThoo2Ei/dl380g8/
> HP DL360G3 (i686)
> IBM JS22 (ppc64) http://people.redhat.com/jpirko/aThoo2Ei/ibmjs22/
> 
> On following machines, I do not observe this issue:
> qemu kvm (x86_64)
> IBM Zseries (s390x) http://people.redhat.com/jpirko/aThoo2Ei/ibmz/
> 
> Please ask in case you need me to provide any other details.
> 
> Thanks.

Check kernel log for messages about clock. It could be that on the
machines with issues TSC is not usable for kernel clock.
Also turn off TSO since it screws up any form of rate control.

^ permalink raw reply

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Ian Campbell @ 2012-12-18 16:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Konrad Rzeszutek Wilk,
	annie li, xen-devel@lists.xensource.com
In-Reply-To: <1355847180.9380.21.camel@edumazet-glaptop>

On Tue, 2012-12-18 at 16:13 +0000, Eric Dumazet wrote:
> On Tue, 2012-12-18 at 15:26 +0000, Ian Campbell wrote:
> 
> > So actually we want += PAGE_SIZE * skb_shinfo(skb)->nr_frags ?
> > 
> 
> I dont know what are the real frag sizes in your case.

I think it's a page, see xennet_alloc_rx_buffers and the alloc_page
therein.

> Some drivers allocate a full page for an ethernet frame, others use half
> of a page, it really depends.
> 
> As the frag ABI doesnt contain real size, its ok in this case to account
> the actual frag size.
> 
> (skb->data_len in your driver)

I guess I'm a bit confused by what truesize means again then ;-),
because in that case the original patch is correct although it would
have been less confusing to do:
	skb->truesize += skb->data_len; 
in xennet_poll() and then do the subtraction of
NETFRONT_SKB_CB(skb)->pull_to in handle_incoming_queue() where we
actually do the pull up.

Unless __pskb_pull_tail does that adjustment for us, but if it does I
can't see where.

Ian.

^ permalink raw reply

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Eric Dumazet @ 2012-12-18 16:13 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Konrad Rzeszutek Wilk,
	annie li, xen-devel@lists.xensource.com
In-Reply-To: <1355844398.14620.254.camel@zakaz.uk.xensource.com>

On Tue, 2012-12-18 at 15:26 +0000, Ian Campbell wrote:

> So actually we want += PAGE_SIZE * skb_shinfo(skb)->nr_frags ?
> 

I dont know what are the real frag sizes in your case.

Some drivers allocate a full page for an ethernet frame, others use half
of a page, it really depends.

As the frag ABI doesnt contain real size, its ok in this case to account
the actual frag size.

(skb->data_len in your driver)

^ permalink raw reply

* Re: [PATCH V4 0/5] Add LE hash collision bug fix for active and passive offloaded connections
From: Steve Wise @ 2012-12-18 16:11 UTC (permalink / raw)
  To: Vipul Pandya
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	roland-BHEL68pLQRGGvPXPguhicg, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	divy-ut6Up61K2wZBDgjK7y7TUQ, dm-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ, abhishek-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <1355131856-29142-1-git-send-email-vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

Reviewed-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>

On 12/10/2012 3:30 AM, Vipul Pandya wrote:
> This patch series fixes the LE hash collision issue in cxgb4 and RDMA/cxgb4
> drivers in kernel.org.
>
> If the hash functionality is enabled in T4 then tuple information of active and
> passive offloaded connections are stored in DDR3 memory. LE (Lookup Engine)
> implements the interface to search this tuple entries using hash algorithm. To
> avoid LE hash collision, firmware provides new mechanisms to setup active and
> passive open connections.
>
> In case of active open connection, firmware detects LE hash collision situation
> and notifies driver. Driver uses fw_ofld_connection work request to offload
> that connection. Its tuple information will be stored in TCAM memory array.
>
> In case of passive open connection, server filter region is created in TCAM.
> This region stores the filter which will redirect the incoming SYN packet to
> offload queues. After this driver tries to establish the connection using
> firmware work request.
>
> This patch series also adds framework for managing filters and to use T4's
> filter capabilities.
>
> Following testing has been carried out successfully on this patch series.
> 1. 1000 active open connections created and saw that tcam_full counter is
> getting incremented through debugfs file.
> 2. Multiple passive open connections were created and ran the same test
> repetatively to verify server filter is getting created and deleted properly.
>
> The patch-series is based on Roland's infiniband tree (for-next branch),
> and involves changes on cxgb4 and RDMA/cxgb4 drivers. Both linux-rdma and netdev
> are included in this post for review.
>
> We have some more bug fixes in RDMA/cxgb4 after this patch series. So, we would
> like to request this series to get mereged through Roland's infiniband for-next
> tree.
>
> V2: Changed commenting style from kernel-doc format('/**') to normal
> V3: Resending the whole patch series again with the missing patches in V2
> V4: Changed comments for networking from '/*' to '/* Comment' format.
>
> Thanks,
> Vipul Pandya
>
> Vipul Pandya (5):
>    cxgb4: Add T4 filter support
>    cxgb4: Add LE hash collision bug fix path in LLD driver
>    RDMA/cxgb4: Fix LE hash collision bug for active open connection
>    RDMA/cxgb4: Fix LE hash collision bug for passive open connection
>    RDMA/cxgb4: Fix bug for active and passive LE hash collision path
>
>   drivers/infiniband/hw/cxgb4/cm.c                |  789 +++++++++++++++++++---
>   drivers/infiniband/hw/cxgb4/device.c            |  210 ++++++-
>   drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   33 +
>   drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  146 +++++
>   drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  473 +++++++++++++-
>   drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |   23 +-
>   drivers/net/ethernet/chelsio/cxgb4/l2t.c        |   34 +
>   drivers/net/ethernet/chelsio/cxgb4/l2t.h        |    3 +
>   drivers/net/ethernet/chelsio/cxgb4/t4_hw.c      |   23 +-
>   drivers/net/ethernet/chelsio/cxgb4/t4_msg.h     |   66 ++
>   drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   37 ++
>   drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   |  388 +++++++++++
>   12 files changed, 2100 insertions(+), 125 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH V4 0/5] Add LE hash collision bug fix for active and passive offloaded connections
From: Vipul Pandya @ 2012-12-18 16:07 UTC (permalink / raw)
  To: roland@purestorage.com, davem@davemloft.net
  Cc: Vipul Pandya, linux-rdma@vger.kernel.org, netdev@vger.kernel.org,
	Divy Le Ray, Dimitrios Michailidis, Kumar A S, Steve Wise,
	Abhishek Agrawal
In-Reply-To: <1355131856-29142-1-git-send-email-vipul@chelsio.com>

Hi Roland and David,

Any feedback on this?

Thanks,
Vipul

On 10-12-2012 15:00, Vipul Pandya wrote:
> This patch series fixes the LE hash collision issue in cxgb4 and RDMA/cxgb4
> drivers in kernel.org.
> 
> If the hash functionality is enabled in T4 then tuple information of active and
> passive offloaded connections are stored in DDR3 memory. LE (Lookup Engine)
> implements the interface to search this tuple entries using hash algorithm. To
> avoid LE hash collision, firmware provides new mechanisms to setup active and
> passive open connections.
> 
> In case of active open connection, firmware detects LE hash collision situation
> and notifies driver. Driver uses fw_ofld_connection work request to offload
> that connection. Its tuple information will be stored in TCAM memory array. 
> 
> In case of passive open connection, server filter region is created in TCAM.
> This region stores the filter which will redirect the incoming SYN packet to
> offload queues. After this driver tries to establish the connection using
> firmware work request.
> 
> This patch series also adds framework for managing filters and to use T4's
> filter capabilities.
> 
> Following testing has been carried out successfully on this patch series.
> 1. 1000 active open connections created and saw that tcam_full counter is
> getting incremented through debugfs file.
> 2. Multiple passive open connections were created and ran the same test
> repetatively to verify server filter is getting created and deleted properly.
> 
> The patch-series is based on Roland's infiniband tree (for-next branch),
> and involves changes on cxgb4 and RDMA/cxgb4 drivers. Both linux-rdma and netdev
> are included in this post for review.
> 
> We have some more bug fixes in RDMA/cxgb4 after this patch series. So, we would
> like to request this series to get mereged through Roland's infiniband for-next
> tree.
> 
> V2: Changed commenting style from kernel-doc format('/**') to normal
> V3: Resending the whole patch series again with the missing patches in V2
> V4: Changed comments for networking from '/*' to '/* Comment' format.
> 
> Thanks,
> Vipul Pandya
> 
> Vipul Pandya (5):
>   cxgb4: Add T4 filter support
>   cxgb4: Add LE hash collision bug fix path in LLD driver
>   RDMA/cxgb4: Fix LE hash collision bug for active open connection
>   RDMA/cxgb4: Fix LE hash collision bug for passive open connection
>   RDMA/cxgb4: Fix bug for active and passive LE hash collision path
> 
>  drivers/infiniband/hw/cxgb4/cm.c                |  789 +++++++++++++++++++---
>  drivers/infiniband/hw/cxgb4/device.c            |  210 ++++++-
>  drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   33 +
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  146 +++++
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  473 +++++++++++++-
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |   23 +-
>  drivers/net/ethernet/chelsio/cxgb4/l2t.c        |   34 +
>  drivers/net/ethernet/chelsio/cxgb4/l2t.h        |    3 +
>  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c      |   23 +-
>  drivers/net/ethernet/chelsio/cxgb4/t4_msg.h     |   66 ++
>  drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   37 ++
>  drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   |  388 +++++++++++
>  12 files changed, 2100 insertions(+), 125 deletions(-)
> 

^ permalink raw reply

* Re: network namespace and DNS lookups
From: Ravi Aysola @ 2012-12-18 15:49 UTC (permalink / raw)
  To: netdev
In-Reply-To: <CAFaHj6GVgDf2QXUWyaKYnvOqLJRjG+kA32eEk5psPkmc-RPY1Q@mail.gmail.com>

I think I sent my earlier email a bit prematurely.  I do have
/etc/netns/<namespace-name>/resolv.conf
files under each of my namespaces. Now the question is, how does a
user space process
(say bind)  look at a namespace specific resolv.conf instead of
default one?  Have any of
these standard applications been modified to work with namespace
specific config files?

thanks again
ravi/

On Tue, Dec 18, 2012 at 10:09 AM, Ravi Aysola <ravi.mlists@gmail.com> wrote:
> Has there been any work in any of the recent kernels to limit the DNS lookup
> to a particular network namespace?  Do we have any facility to specify the
> DNS resolvers on network namespace basis (such as /etc/ns/resolv.conf)?
>
> thank you
> ravi/

^ permalink raw reply

* inaccurate packet scheduling
From: Jiri Pirko @ 2012-12-18 15:04 UTC (permalink / raw)
  To: netdev; +Cc: jhs, davem, edumazet, tgraf, shemminger

Hi all.

Run one of the following 2 scripts on machine A:

#!/bin/bash
tc qdisc del dev eth0 root
sleep 1
tc -batch << EOF
qdisc add dev eth0 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
qdisc add dev eth0 parent 1:1 handle 10: pfifo limit 50
qdisc add dev eth0 parent 1:2 handle 20 tbf latency 100ms rate 4mbit burst 2m
filter add dev eth0 parent 1: protocol ip u32 match ip dst $machineB_ip flowid 1:2
EOF

#!/bin/bash
tc qdisc del dev eth0 root
sleep 1
tc -batch << EOF
qdisc add dev eth0 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
qdisc add dev eth0 parent 1:1 handle 10: pfifo limit 20
qdisc add dev eth0 parent 1:2 handle 20: pfifo limit 20
filter add dev eth0 parent 1: protocol ip pref 10 \
u32 match ip dst $machineB_ip \
flowid 1:2 \
police rate 4Mbit burst 2m conform-exceed drop
EOF

And run:
[machineB ~]# iperf -s
[machineA ~]# iperf -c machineB_ip -t 60

Expected results are: ~3.8-4.2 Mbits/s
But actual results are: ~130-170 Kbits/s with tbf, ~70-300 Kbits/s with policy rate

[machineA ~]# tc -s qdisc list dev eth0
qdisc prio 1: root refcnt 9 bands 2 priomap  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 Sent 1512384 bytes 1032 pkt (dropped 729, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc pfifo 10: parent 1:1 limit 50p
 Sent 4560 bytes 32 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc tbf 20: parent 1:2 rate 4000Kbit burst 2Mb lat 100.0ms 
 Sent 1507824 bytes 1000 pkt (dropped 729, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 


Tested with kernel pulled from linus's git today. This happens with older
kernels as well (I tried 2.6.32-based rhel6 kernels).

This happens to me on following machines:
HP DL360G8 (x86_64) http://people.redhat.com/jpirko/aThoo2Ei/dl380g8/
HP DL360G3 (i686) 
IBM JS22 (ppc64) http://people.redhat.com/jpirko/aThoo2Ei/ibmjs22/

On following machines, I do not observe this issue:
qemu kvm (x86_64)
IBM Zseries (s390x) http://people.redhat.com/jpirko/aThoo2Ei/ibmz/

Please ask in case you need me to provide any other details.

Thanks.

Jiri

^ permalink raw reply

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Ian Campbell @ 2012-12-18 15:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Konrad Rzeszutek Wilk,
	annie li, xen-devel@lists.xensource.com
In-Reply-To: <1355843525.9380.18.camel@edumazet-glaptop>

On Tue, 2012-12-18 at 15:12 +0000, Eric Dumazet wrote:
> On Tue, 2012-12-18 at 13:51 +0000, Ian Campbell wrote:
> > Using RX_COPY_THRESHOLD is incorrect if the SKB is actually smaller
> > than that. We have already accounted for this in
> > NETFRONT_SKB_CB(skb)->pull_to so use that instead.
> > 
> > Fixes WARN_ON from skb_try_coalesce.
> > 
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Cc: Sander Eikelenboom <linux@eikelenboom.it>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: annie li <annie.li@oracle.com>
> > Cc: xen-devel@lists.xensource.com
> > Cc: netdev@vger.kernel.org
> > Cc: stable@kernel.org # 3.7.x only
> > ---
> >  drivers/net/xen-netfront.c |   15 +++++----------
> >  1 files changed, 5 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> > index caa0110..b06ef81 100644
> > --- a/drivers/net/xen-netfront.c
> > +++ b/drivers/net/xen-netfront.c
> > @@ -971,17 +971,12 @@ err:
> >  		 * overheads. Here, we add the size of the data pulled
> >  		 * in xennet_fill_frags().
> >  		 *
> > -		 * We also adjust for any unused space in the main
> > -		 * data area by subtracting (RX_COPY_THRESHOLD -
> > -		 * len). This is especially important with drivers
> > -		 * which split incoming packets into header and data,
> > -		 * using only 66 bytes of the main data area (see the
> > -		 * e1000 driver for example.)  On such systems,
> > -		 * without this last adjustement, our achievable
> > -		 * receive throughout using the standard receive
> > -		 * buffer size was cut by 25%(!!!).
> > +		 * We also adjust for the __pskb_pull_tail done in
> > +		 * handle_incoming_queue which pulls data from the
> > +		 * frags into the head area, which is already
> > +		 * accounted in RX_COPY_THRESHOLD.
> >  		 */
> > -		skb->truesize += skb->data_len - RX_COPY_THRESHOLD;
> > +		skb->truesize += skb->data_len - NETFRONT_SKB_CB(skb)->pull_to;
> >  		skb->len += skb->data_len;
> >  
> >  		if (rx->flags & XEN_NETRXF_csum_blank)
> 
> 
> But skb truesize is not what you think.

Indeed, it seems I was completely backwards about what it means!

> You must account the exact memory used by this skb, not only the used
> part of it.
> 
> At the very minimum, it should be
> 
> skb->truesize += skb->data_len;
> 
> But it really should be the allocated size of the fragment.
> 
> If its a page, then its a page, even if you use one single byte in it.

So actually we want += PAGE_SIZE * skb_shinfo(skb)->nr_frags ?

Sander, can you try that change?

Ian.

^ permalink raw reply

* [PATCH v2 2/2] net: asix: handle packets crossing URB boundaries
From: Lucas Stach @ 2012-12-18 15:21 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: Oliver Neukum, linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1355844083-14285-1-git-send-email-dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>

ASIX AX88772B started to pack data even more tightly. Packets and the ASIX packet
header may now cross URB boundaries. To handle this we have to introduce
some state between individual calls to asix_rx_fixup().

Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
---
I've running this patch for some weeks already now and it gets rid of all
the commonly seen rx failures with AX88772B.

v2: don't forget to free driver_private
---
 drivers/net/usb/asix.h         | 11 ++++++
 drivers/net/usb/asix_common.c  | 81 +++++++++++++++++++++++++++++-------------
 drivers/net/usb/asix_devices.c | 17 +++++++++
 3 Dateien geändert, 85 Zeilen hinzugefügt(+), 24 Zeilen entfernt(-)

diff --git a/drivers/net/usb/asix.h b/drivers/net/usb/asix.h
index 7afe8ac..4dfdbf6 100644
--- a/drivers/net/usb/asix.h
+++ b/drivers/net/usb/asix.h
@@ -167,6 +167,17 @@ struct asix_data {
 	u8 res;
 };
 
+struct asix_rx_fixup_info {
+	struct sk_buff *ax_skb;
+	u32 header;
+	u16 size;
+	bool split_head;
+};
+
+struct asix_private {
+	struct asix_rx_fixup_info rx_fixup_info;
+};
+
 /* ASIX specific flags */
 #define FLAG_EEPROM_MAC		(1UL << 0)  /* init device MAC from eeprom */
 
diff --git a/drivers/net/usb/asix_common.c b/drivers/net/usb/asix_common.c
index 50d1673..17f9801 100644
--- a/drivers/net/usb/asix_common.c
+++ b/drivers/net/usb/asix_common.c
@@ -53,44 +53,77 @@ void asix_write_cmd_async(struct usbnet *dev, u8 cmd, u16 value, u16 index,
 
 int asix_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 {
+	struct asix_private *dp = dev->driver_priv;
+	struct asix_rx_fixup_info *rx = &dp->rx_fixup_info;
 	int offset = 0;
 
-	while (offset + sizeof(u32) < skb->len) {
-		struct sk_buff *ax_skb;
-		u16 size;
-		u32 header = get_unaligned_le32(skb->data + offset);
-
-		offset += sizeof(u32);
-
-		/* get the packet length */
-		size = (u16) (header & 0x7ff);
-		if (size != ((~header >> 16) & 0x07ff)) {
-			netdev_err(dev->net, "asix_rx_fixup() Bad Header Length\n");
-			return 0;
+	while (offset + sizeof(u16) <= skb->len) {
+		u16 remaining = 0;
+		unsigned char *data;
+
+		if (!rx->size) {
+			if ((skb->len - offset == sizeof(u16)) ||
+			    rx->split_head) {
+				if(!rx->split_head) {
+					rx->header = get_unaligned_le16(
+							skb->data + offset);
+					rx->split_head = true;
+					offset += sizeof(u16);
+					break;
+				} else {
+					rx->header |= (get_unaligned_le16(
+							skb->data + offset)
+							<< 16);
+					rx->split_head = false;
+					offset += sizeof(u16);
+				}
+			} else {
+				rx->header = get_unaligned_le32(skb->data +
+								offset);
+				offset += sizeof(u32);
+			}
+
+			/* get the packet length */
+			rx->size = (u16) (rx->header & 0x7ff);
+			if (rx->size != ((~rx->header >> 16) & 0x7ff)) {
+				netdev_err(dev->net, "asix_rx_fixup() Bad Header Length 0x%x, offset %d\n",
+					   rx->header, offset);
+				rx->size = 0;
+				return 0;
+			}
+			rx->ax_skb = netdev_alloc_skb_ip_align(dev->net,
+							       rx->size);
+			if (!rx->ax_skb)
+				return 0;
 		}
 
-		if ((size > dev->net->mtu + ETH_HLEN + VLAN_HLEN) ||
-		    (size + offset > skb->len)) {
+		if (rx->size > dev->net->mtu + ETH_HLEN + VLAN_HLEN) {
 			netdev_err(dev->net, "asix_rx_fixup() Bad RX Length %d\n",
-				   size);
+				   rx->size);
+			kfree_skb(rx->ax_skb);
 			return 0;
 		}
-		ax_skb = netdev_alloc_skb_ip_align(dev->net, size);
-		if (!ax_skb)
-			return 0;
 
-		skb_put(ax_skb, size);
-		memcpy(ax_skb->data, skb->data + offset, size);
-		usbnet_skb_return(dev, ax_skb);
+		if (rx->size > skb->len - offset) {
+			remaining = rx->size - (skb->len - offset);
+			rx->size = skb->len - offset;
+		}
+
+		data = skb_put(rx->ax_skb, rx->size);
+		memcpy(data, skb->data + offset, rx->size);
+		if (!remaining)
+			usbnet_skb_return(dev, rx->ax_skb);
 
-		offset += (size + 1) & 0xfffe;
+		offset += (rx->size + 1) & 0xfffe;
+		rx->size = remaining;
 	}
 
 	if (skb->len != offset) {
-		netdev_err(dev->net, "asix_rx_fixup() Bad SKB Length %d\n",
-			   skb->len);
+		netdev_err(dev->net, "asix_rx_fixup() Bad SKB Length %d, %d\n",
+			   skb->len, offset);
 		return 0;
 	}
+
 	return 1;
 }
 
diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 0ecc3bc..c651243 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -495,9 +495,19 @@ static int ax88772_bind(struct usbnet *dev, struct usb_interface *intf)
 		dev->rx_urb_size = 2048;
 	}
 
+	dev->driver_priv = kzalloc(sizeof(struct asix_private), GFP_KERNEL);
+	if (!dev->driver_priv)
+		return -ENOMEM;
+
 	return 0;
 }
 
+void ax88772_unbind(struct usbnet *dev, struct usb_interface *intf)
+{
+	if (dev->driver_priv)
+		kfree(dev->driver_priv);
+}
+
 static const struct ethtool_ops ax88178_ethtool_ops = {
 	.get_drvinfo		= asix_get_drvinfo,
 	.get_link		= asix_get_link,
@@ -829,6 +839,10 @@ static int ax88178_bind(struct usbnet *dev, struct usb_interface *intf)
 		dev->rx_urb_size = 2048;
 	}
 
+	dev->driver_priv = kzalloc(sizeof(struct asix_private), GFP_KERNEL);
+	if (!dev->driver_priv)
+			return -ENOMEM;
+
 	return 0;
 }
 
@@ -875,6 +889,7 @@ static const struct driver_info hawking_uf200_info = {
 static const struct driver_info ax88772_info = {
 	.description = "ASIX AX88772 USB 2.0 Ethernet",
 	.bind = ax88772_bind,
+	.unbind = ax88772_unbind,
 	.status = asix_status,
 	.link_reset = ax88772_link_reset,
 	.reset = ax88772_reset,
@@ -886,6 +901,7 @@ static const struct driver_info ax88772_info = {
 static const struct driver_info ax88772b_info = {
 	.description = "ASIX AX88772B USB 2.0 Ethernet",
 	.bind = ax88772_bind,
+	.unbind = ax88772_unbind,
 	.status = asix_status,
 	.link_reset = ax88772_link_reset,
 	.reset = ax88772_reset,
@@ -899,6 +915,7 @@ static const struct driver_info ax88772b_info = {
 static const struct driver_info ax88178_info = {
 	.description = "ASIX AX88178 USB 2.0 Ethernet",
 	.bind = ax88178_bind,
+	.unbind = ax88772_unbind,
 	.status = asix_status,
 	.link_reset = ax88178_link_reset,
 	.reset = ax88178_reset,
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v2 1/2] net: asix: init ASIX AX88772B MAC from EEPROM
From: Lucas Stach @ 2012-12-18 15:21 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: Oliver Neukum, linux-usb-u79uwXL29TY76Z2rM5mHXA

The device comes up with a MAC address of all zeros. We need to read the
initial device MAC from EEPROM so it can be set properly later.

Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
---
A similar fix was added into U-Boot:
http://patchwork.ozlabs.org/patch/179409/

v2: pass flag in the data field instead of clobbering the global flags
    space
---
 drivers/net/usb/asix.h         |  3 +++
 drivers/net/usb/asix_devices.c | 30 +++++++++++++++++++++++++++---
 2 Dateien geändert, 30 Zeilen hinzugefügt(+), 3 Zeilen entfernt(-)

diff --git a/drivers/net/usb/asix.h b/drivers/net/usb/asix.h
index e889631..7afe8ac 100644
--- a/drivers/net/usb/asix.h
+++ b/drivers/net/usb/asix.h
@@ -167,6 +167,9 @@ struct asix_data {
 	u8 res;
 };
 
+/* ASIX specific flags */
+#define FLAG_EEPROM_MAC		(1UL << 0)  /* init device MAC from eeprom */
+
 int asix_read_cmd(struct usbnet *dev, u8 cmd, u16 value, u16 index,
 		  u16 size, void *data);
 
diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 7a6e758..0ecc3bc 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -422,14 +422,25 @@ static const struct net_device_ops ax88772_netdev_ops = {
 
 static int ax88772_bind(struct usbnet *dev, struct usb_interface *intf)
 {
-	int ret, embd_phy;
+	int ret, embd_phy, i;
 	u8 buf[ETH_ALEN];
 	u32 phyid;
 
 	usbnet_get_endpoints(dev,intf);
 
 	/* Get the MAC address */
-	ret = asix_read_cmd(dev, AX_CMD_READ_NODE_ID, 0, 0, ETH_ALEN, buf);
+	if (dev->driver_info->data & FLAG_EEPROM_MAC) {
+		for (i = 0; i < (ETH_ALEN >> 1); i++) {
+			ret = asix_read_cmd(dev, AX_CMD_READ_EEPROM, 0x04 + i,
+					0, 2, buf + i * 2);
+			if (ret < 0)
+				break;
+		}
+	} else {
+		ret = asix_read_cmd(dev, AX_CMD_READ_NODE_ID,
+				0, 0, ETH_ALEN, buf);
+	}
+
 	if (ret < 0) {
 		netdev_dbg(dev->net, "Failed to read MAC address: %d\n", ret);
 		return ret;
@@ -872,6 +883,19 @@ static const struct driver_info ax88772_info = {
 	.tx_fixup = asix_tx_fixup,
 };
 
+static const struct driver_info ax88772b_info = {
+	.description = "ASIX AX88772B USB 2.0 Ethernet",
+	.bind = ax88772_bind,
+	.status = asix_status,
+	.link_reset = ax88772_link_reset,
+	.reset = ax88772_reset,
+	.flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_LINK_INTR |
+	         FLAG_MULTI_PACKET,
+	.rx_fixup = asix_rx_fixup,
+	.tx_fixup = asix_tx_fixup,
+	.data = FLAG_EEPROM_MAC,
+};
+
 static const struct driver_info ax88178_info = {
 	.description = "ASIX AX88178 USB 2.0 Ethernet",
 	.bind = ax88178_bind,
@@ -953,7 +977,7 @@ static const struct usb_device_id	products [] = {
 }, {
 	// ASIX AX88772B 10/100
 	USB_DEVICE (0x0b95, 0x772b),
-	.driver_info = (unsigned long) &ax88772_info,
+	.driver_info = (unsigned long) &ax88772b_info,
 }, {
 	// ASIX AX88772 10/100
 	USB_DEVICE (0x0b95, 0x7720),
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* network namespace and DNS lookups
From: Ravi Aysola @ 2012-12-18 15:09 UTC (permalink / raw)
  To: netdev

Has there been any work in any of the recent kernels to limit the DNS lookup
to a particular network namespace?  Do we have any facility to specify the
DNS resolvers on network namespace basis (such as /etc/ns/resolv.conf)?

thank you
ravi/

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox