Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Benjamin LaHaise @ 2014-12-15 18:03 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jamal Hadi Salim, Jiri Pirko, sfeldma, tgraf, john.fastabend,
	stephen, linville, vyasevic, netdev, davem, shm, gospo
In-Reply-To: <548F1852.7090507@cumulusnetworks.com>

On Mon, Dec 15, 2014 at 09:20:18AM -0800, Roopa Prabhu wrote:
> On 12/15/14, 7:26 AM, Jamal Hadi Salim wrote:
> >
> >Sorry - i didnt quiet follow the discussion, but i can see the value
> >of propagating things from parent to children netdevs as part of the
> >generic approach. And in that spirit:
> >
> >Ben's patches (and I am sure the cumulus folk do this) expose ports.
> >i.e you boot up the hardware and you see ports. You can then put these
> >ports in a bridge and you can offload fdbs and do other parametrization
> >to the ASIC. IOW, this only becomes a bridge because you created one
> >in the kernel and attached bridge ports to it.
> >
> >Lets say i didnt want a bridge. I want instead to take these exposed
> >ports and create a bond (and maybe play with LACP). How does this
> >propagation from parent->child->child work then? I think the idea
> >of just bonding and not exposing it as a switch is a reasonable use
> >case.
> 
> We have not come to pure bonding and lacp yet (but i have mentioned it 
> in many contexts before).
> The use case you mention is offloading bond attributes. This will be 
> addressed as part of ongoing switchdev work
> for all other offloads (bonds, vxlans etc).
> Right now we are only talking bridge port attribute offload 
> (learn/flood/port state etc). This could still be a bridge port 
> attribute on a bond
> when the bond is a bridge port.

This raises the question: do we track which attributes are configured 
onto a port in the switchdev code (in an attempt to maintain the state 
on behalf of the underlying device), or do we simply pass them in as 
attributes of the config request?  With stacking, I can see the need 
for different layers to add different attributes to the config for a 
given switch port.  Things like bonding would need to make note of the 
bond interface with a common identifier so that the switch can figure 
out to put the different ports into the same group.

The rtl8366s has support for one 802.3ad group, so it looks like I will 
need to tackle this.

		-ben

> >Also how does it work when i start doing L3 and the bond's port doesnt
> >support L3? Is it time to revive the thing we called TheThing in Du?
> yes, exactly. This is what i had indicated in my initial emails on this 
> thread when the stacked devices topic came up.
> Since there was reluctance in introducing a switch device (theThing), My 
> current patch tries to do that without a switch device.
> Since this is still l2, and we are dealing with bridge port attributes, 
> my current patch traverses the stacked netdevs to call the 
> ndo_bridge_setlink on the switch port.
> 
> When it comes to l3, we can follow the same.., but as discussed in Du, 
> there are other reasons where a switch device becomes necessary.
> I can submit an alternate series to cover the switch device approach for 
> l2 as well.
> 
> Thanks,
> Roopa
> 

-- 
"Thought is the essence of where you are now."

^ permalink raw reply

* Re: Bug: mv643xxx fails with highmem
From: Russell King - ARM Linux @ 2014-12-15 18:04 UTC (permalink / raw)
  To: fugang.duan@freescale.com
  Cc: David Miller, Fabio.Estevam@freescale.com,
	ezequiel.garcia@free-electrons.com, netdev@vger.kernel.org
In-Reply-To: <BLUPR03MB37339496B52755D0F32B002F5600@BLUPR03MB373.namprd03.prod.outlook.com>

On Fri, Dec 12, 2014 at 05:34:01AM +0000, fugang.duan@freescale.com wrote:
> I will submit one patch to fix the issue.

There's more bugs in the FEC driver... here's the relevant bits:

static void
fec_enet_tx_queue(struct net_device *ndev, u16 queue_id)
{
        bdp = txq->dirty_tx;

        bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);

        while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
                /* current queue is empty */
                if (bdp == txq->cur_tx)
                        break;

                skb = txq->tx_skbuff[index];
                txq->tx_skbuff[index] = NULL;
                if (!IS_TSO_HEADER(txq, bdp->cbd_bufaddr))
                        dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
                                        bdp->cbd_datlen, DMA_TO_DEVICE);
                bdp->cbd_bufaddr = 0;
                if (!skb) {
                        bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
                        continue;
                }
...
                txq->dirty_tx = bdp;
                bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
        }

Consider the following code path:
- we enter this function
- get the dirty_tx pointer
- move to the next descriptor (which we'll call descriptor A)
- next descriptor indicates that TX_READY = 0
- bdp != txq->cur_tx
- we unmap if needed
- we set bdp->cmdbufaddr = 0
- assume skb is NULL, so we move to the next descriptor (we'll call this B)
- next descriptor _may_ have TX_READY = 1
- we break out of the loop, and return

Some time later, we re-enter:
- get the dirty_tx pointer
- move to the next descriptor (which is descriptor A above)
- next descriptor indicates that TX_READY = 0
- bdp != txq->cur_tx
- we call dma_unmap_single(..., bdp->cbd_bufaddr, which we previously zeroed
  - the DMA API debugging complains that FEC is unmapping memory which it
    doesn't own

Unfortunately, this does appear to happen - from a paste from Jon
Nettleton from iMX6Q:

 32. [   45.033001] unmapping this address 0x0 size 66
 33. [   45.037470] ------------[ cut here ]------------
 34. [   45.042127] WARNING: CPU: 0 PID: 102 at lib/dma-debug.c:1080 check_unmap+0x784/0x9f4()
 35. [   45.050066] fec 2188000.ethernet: DMA-API: device driver tries to free DMA memory it has not a]

(where the printk at line 32 is something that was added to debug this.)

The sad thing is that the remainder of my FEC patches did go a long way
to clean up these kinds of issues in the driver (and there's /many/ of
them), but unfortunately other conflicting changes got merged before I
could finish rebasing them, I decided to move on to other things and
discard the remainder of my patch set.  Marek showed some interest in
taking the patch set over, but I've not heard anything more - and I'm
not about to resurect my efforts only to get into the same situation
where I'm carrying 50 odd patches which I can't merge back into mainline
without spending weeks endlessly rebasing them.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
From: Wolfgang Walter @ 2014-12-15 18:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
	Steffen Klassert
In-Reply-To: <1418429740.13491.27.camel@edumazet-glaptop2.roam.corp.google.com>

Hello Eric!

Am Freitag, 12. Dezember 2014, 16:15:40 schrieb Eric Dumazet:
> On Sat, 2014-12-13 at 00:47 +0100, Wolfgang Walter wrote:
> > I can't disable it as the driver will not allow it:
> > # ethtool -K eth0 tx off
> > Cannot change tx-checksumming
> > Could not change any device features
> 
> Sounds a bug in itself :(

I think the "gso disabled for interface" case is broken because:

tcp_sendmsg() sets tcp_gso_segs to zero

tcp_sendmsg() calls tcp_push_one() or __tcp_push_pending_frames()

tcp_push_one() and  __tcp_push_pending_frames() call tcp_write_xmit()

tcp_write_xmit() call tcp_init_tso_segs()

tcp_init_tso_segs() calls tcp_set_skb_tso_segs() because !tso_segs is true

tcp_init_tso_segs() creates a gso-packet if packet size is larger than 
mss_now.


I think tcp_init_tso_segs() assumed that tcp_init_tso_segs() checks if the 
socket supports gso.

I didn't check the other callers of tcp_init_tso_segs().


Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply

* GOOD DAY
From: Sage Mothibi @ 2014-12-15 17:15 UTC (permalink / raw)
  To: sagemothibi

[-- Attachment #1: Type: text/plain, Size: 66 bytes --]



Please view the attachment for more details
Thanks
Engr. Mothibi

[-- Attachment #2: Hello.pdf --]
[-- Type: application/pdf, Size: 180137 bytes --]

^ permalink raw reply

* Re: [PATCH] ioc3: fix incorrect use of htons/ntohs
From: Ralf Baechle @ 2014-12-15 18:14 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Lino Sanfilippo, linux-mips, netdev, linux-kernel
In-Reply-To: <1417406976.7215.126.camel@decadent.org.uk>

On Mon, Dec 01, 2014 at 04:09:36AM +0000, Ben Hutchings wrote:

> >  	/* Same as tx - compute csum of pseudo header  */
> >  	csum = hwsum +
> > -	       (ih->tot_len - (ih->ihl << 2)) +
> > -	       htons((uint16_t)ih->protocol) +
> > +	       (ih->tot_len - (ih->ihl << 2)) + ih->protocol +
> >  	       (ih->saddr >> 16) + (ih->saddr & 0xffff) +
> >  	       (ih->daddr >> 16) + (ih->daddr & 0xffff);
> >
> 
> The pseudo-header is specified as:
> 
>                      +--------+--------+--------+--------+
>                      |           Source Address          |
>                      +--------+--------+--------+--------+
>                      |         Destination Address       |
>                      +--------+--------+--------+--------+
>                      |  zero  |  PTCL  |    TCP Length   |
>                      +--------+--------+--------+--------+
> 
> The current code zero-extends the protocol number to produce the 5th
> 16-bit word of the pseudo-header, then uses htons() to put it in
> big-endian order, consistent with the other fields.  (Yes, it's doing
> addition on big-endian words; this works even on little-endian machines
> due to the way the checksum is specified.)
> 
> The driver should not be doing this at all, though.  It should set
> skb->csum = hwsum; skb->ip_summed = CHECKSUM_COMPLETE; and let the
> network stack adjust the hardware checksum.

Really?  The IOC3 isn't the exactly the smartest NIC around; it does add up
everything and the kitchen sink, that is ethernet headers, IP headers and
on RX the frame's trailing CRC.  All that needs to be subtracted in software
which is what this does.  I think others NICs are all smarted and don't
need this particular piece of magic.

I agree with your other comment wrt. to htons().

  Ralf

^ permalink raw reply

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-15 18:36 UTC (permalink / raw)
  To: John Fastabend, netdev@vger.kernel.org
  Cc: Jamal Hadi Salim, Roopa Prabhu, Jiri Pirko, sfeldma@gmail.com,
	bcrl@kvack.org, tgraf@suug.ch, stephen@networkplumber.org,
	linville@tuxdriver.com, vyasevic@redhat.com, davem@davemloft.net,
	shm@cumulusnetworks.com, gospo@cumulusnetworks.com
In-Reply-To: <548F20FF.9080508@gmail.com>



> -----Original Message-----
> From: John Fastabend [mailto:john.fastabend@gmail.com]
> Sent: Monday, December 15, 2014 7:57 PM
> To: Arad, Ronen
> Cc: Jamal Hadi Salim; Roopa Prabhu; Jiri Pirko; netdev@vger.kernel.org;
> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> On 12/15/2014 09:25 AM, Arad, Ronen wrote:
> >
> >
> >> -----Original Message-----
> >> From: netdev-owner@vger.kernel.org [mailto:netdev-
> >> owner@vger.kernel.org] On Behalf Of Jamal Hadi Salim
> >> Sent: Monday, December 15, 2014 5:26 PM
> >> To: Roopa Prabhu; Jiri Pirko
> >> Cc: sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> >> john.fastabend@gmail.com; stephen@networkplumber.org;
> >> linville@tuxdriver.com; vyasevic@redhat.com; netdev@vger.kernel.org;
> >> davem@davemloft.net; shm@cumulusnetworks.com;
> >> gospo@cumulusnetworks.com
> >> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
> >> del bridge port attributes
> >>
> >>
> >> Sorry - i didnt quiet follow the discussion, but i can see the value
> >> of propagating things from parent to children netdevs as part of the
> >> generic approach. And in that spirit:
> >>
> >> Ben's patches (and I am sure the cumulus folk do this) expose ports.
> >> i.e you boot up the hardware and you see ports. You can then put
> >> these ports in a bridge and you can offload fdbs and do other
> >> parametrization to the ASIC. IOW, this only becomes a bridge because
> >> you created one in the kernel and attached bridge ports to it.
> >>
> >> Lets say i didnt want a bridge. I want instead to take these exposed
> >> ports and create a bond (and maybe play with LACP). How does this
> >> propagation from
> >> parent->child->child work then? I think the idea of just bonding and
> >> parent->child->not
> >> exposing it as a switch is a reasonable use case.
> >
> 
> > Are you saying that the software should reflect the same functionality
> > the HW provides?
> > In other words is creating a bridge device mandatory for supporting
> > standard VLAN-bridging (as in IEEE 802.1Q) in the HW?
> 
> No it shouldn't be mandatory. And I don't think it is at the moment.
> Users are free to manage the FDB tables directly via netlink or configure the
> software bridge to sync them. This seems like a good model to follow to me
> and we should try to get as many features as it makes sense to follow this
> model.
> 
> > VLAN-bridging including port VLAN membership, VLAN filtering, PVID,
> > Egress un-tagging could be supported without an explicit bridge device
> > when port devices implement bridge ndos (ndo_bridge_{set,del,get}link).
> > What is lost is the ability to have common handling of VLAN-aware FDB
> > in the bridge module.
> 
> not sure what is lost here? Its seems the SW bridge does (or at least
> could) support the same vlan capabilities. And the bridge could push these
> into hardware when Roopa's offload bit is set. Or if users want to manage
> everything outside bridge calling the ndo_bridge_ ops directly works as well.
> By the way I believe the software linux bridge supports most of those
> features you listed today. If we missed something we can add it.
> 
> > Do we expect switch port devices to support L2 functionality both ways
> > (with and without an explicit bridge device)?
> 
> My opinion yes. But in fact the driver shouldn't care what is driving it. The
> paths should be the same for direct user manipulation via netlink and
> SELF|MASTER bit, bridge module, or any other in-kernel sub-system.
> 
The behavior of a driver could depend on the presence of a bridge and features such as FDB LEARNING and LEARNING_SYNC.
A switch port driver which is not enslaved to a bridge might need to implement VLAN-aware FDB within the driver and report its content to user-space using ndo_fdb_dump. 
A switch port driver which is enslaved to a bridge could do with only pass through for static FDB configuration to the HW when LEARNING_SYNC is configured. FDB reporting to user-space and soft aging are left to the bridge module FDB.
Such driver, without LEARNING_SYNC could still avoid maintaing in-driver FDB as long as it could dump the HW FDB on demand.
LEARNING_SYNC also requires periodic updates of freshness information from the driver to the bridge module.

> > Will the decision about using a bridge device or avoiding it be left
> > to the end-user?
> 
> Its a user policy decision. Again the offload bit gets us this in a reasonably
> configurable way IMO.
> 
> > (This requires switch port drivers to be able to work and provide
> > similar functionality in both setups).
> 
> Right, but if the drivers "care" who is calling their ndo ops something is
> seriously broken. For the driver it should not need to know anything about
> the callers so it doesn't matter to the driver if its a netlink call from user
> space or an internal call fro bridge.ko

LEARNING_SYNC only makes sense when a switch port driver is enslaved to a bridge. Rocker switch driver indeed monitors upper change notifications and keep track of master bridge presence. So bridge presence is not transparent.
> 
> > I think that we need to outline the handling of L3 as it could
> > determine or at least impact some of the answers to my above questions.
> 
> L3 should follow the same model. Admittedly I've not worked through the
> L3 cases closely but I don't see why we can't apply the same model.
> And maybe this is where we need to introduce a container to hold some
> state as Jamal says. The easiest way to see this will be to look at some
> proposed code.
> 
> 
> > cheers,
> > ronen
> >
> >> Also how does it work when i start doing L3 and the bond's port
> >> doesnt support L3? Is it time to revive the thing we called TheThing in Du?
> >>
> >> cheers,
> >> jamal
> >>
> >> On 12/14/14 14:41, Roopa Prabhu wrote:
> >>> On 12/14/14, 7:35 AM, Jiri Pirko wrote:
> >>
> >> [..chopped off for brevity and saving electrons..]
> >>
> >> cheers,
> >> jamal
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org More majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> John Fastabend         Intel Corporation

^ permalink raw reply

* [PATCH 3.13.y-ckt 60/96] net/ping: handle protocol mismatching scenario
From: Kamal Mostafa @ 2014-12-15 19:26 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, Jane Zhou, Yiwei Zhao,
	Kamal Mostafa
In-Reply-To: <1418671616-25482-1-git-send-email-kamal@canonical.com>

3.13.11-ckt13 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Jane Zhou <a17711@motorola.com>

commit 91a0b603469069cdcce4d572b7525ffc9fd352a6 upstream.

ping_lookup() may return a wrong sock if sk_buff's and sock's protocols
dont' match. For example, sk_buff's protocol is ETH_P_IPV6, but sock's
sk_family is AF_INET, in that case, if sk->sk_bound_dev_if is zero, a wrong
sock will be returned.
the fix is to "continue" the searching, if no matching, return NULL.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Jane Zhou <a17711@motorola.com>
Signed-off-by: Yiwei Zhao <gbjc64@motorola.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 net/ipv4/ping.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 3ef2919..8e0f65c 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -213,6 +213,8 @@ static struct sock *ping_lookup(struct net *net, struct sk_buff *skb, u16 ident)
 					     &ipv6_hdr(skb)->daddr))
 				continue;
 #endif
+		} else {
+			continue;
 		}
 
 		if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif)
-- 
1.9.1

^ permalink raw reply related

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Yuchung Cheng @ 2014-12-15 19:56 UTC (permalink / raw)
  To: Blake Matheny
  Cc: Eric Dumazet, Alexei Starovoitov, Laurent Chavey, Martin Lau,
	netdev@vger.kernel.org, David S. Miller, Hannes Frederic Sowa,
	Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <D0B44739.74E8A%bmatheny@fb.com>

On Mon, Dec 15, 2014 at 8:08 AM, Blake Matheny <bmatheny@fb.com> wrote:
>
> We have an additional set of patches for web10g that builds on these
> tracepoints. It can be made to work either way, but I agree the idea of
> something like a sockopt would be really nice.

I'd like to compare these patches  with tools that parse pcap files to
generate per-flow counters to collect RTTs, #dupacks, etc. What
additional values or insights do they provide to improve/debug TCP
performance? maybe an example?

IMO these stats provide a general pictures of how TCP works of a
specific network, but not enough to really nail specific bugs in TCP
protocol or implementation. Then SNMP stats or sampling with pcap
traces with offline analysis can achieve the same purpose.

>
>
> -Blake
>
> On 12/15/14, 8:03 AM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:
>
> >On Sun, 2014-12-14 at 22:55 -0800, Alexei Starovoitov wrote:
> >
> >> I think patches 1 and 3 are good additions, since they establish
> >> few permanent points of instrumentation in tcp stack.
> >> Patches 4-5 look more like use cases of tracepoints established
> >> before. They may feel like simple additions and, no doubt,
> >> they are useful, but since they expose things via tracing
> >> infra they become part of api and cannot be changed later,
> >> when more stats would be needed.
> >> I think systemtap like scripting on top of patches 1 and 3
> >> should solve your use case ?
> >> Also, have you looked at recent eBPF work?
> >> Though it's not completely ready yet, soon it should
> >> be able to do the same stats collection as you have
> >> in 4/5 without adding permanent pieces to the kernel.
> >
> >So it looks like web10g like interfaces are very often requested by
> >various teams.
> >
> >And we have many different views on how to hack this. I am astonished by
> >number of hacks I saw about this stuff going on.
> >
> >What about a clean way, extending current TCP_INFO, which is both
> >available as a getsockopt() for socket owners and ss/iproute2
> >information for 'external entities'
> >
> >If we consider web10g info needed, then adding a ftrace/eBPF like
> >interface is simply yet another piece of code we need to maintain,
> >and the argument of 'this should cost nothing if not activated' is
> >nonsense since major players need to constantly monitor TCP metrics and
> >behavior.
> >
> >It seems both FaceBook and Google are working on a subset of web10g.
> >
> >I suggest we meet together and establish a common ground, preferably
> >after Christmas holidays.
> >
> >Thanks
> >
> >
>

^ permalink raw reply

* Re: [PATCH] ioc3: fix incorrect use of htons/ntohs
From: Ben Hutchings @ 2014-12-15 21:09 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Lino Sanfilippo, linux-mips, netdev, linux-kernel
In-Reply-To: <20141215181444.GD26674@linux-mips.org>

[-- Attachment #1: Type: text/plain, Size: 2676 bytes --]

On Mon, 2014-12-15 at 19:14 +0100, Ralf Baechle wrote:
> On Mon, Dec 01, 2014 at 04:09:36AM +0000, Ben Hutchings wrote:
> 
> > >  	/* Same as tx - compute csum of pseudo header  */
> > >  	csum = hwsum +
> > > -	       (ih->tot_len - (ih->ihl << 2)) +
> > > -	       htons((uint16_t)ih->protocol) +
> > > +	       (ih->tot_len - (ih->ihl << 2)) + ih->protocol +
> > >  	       (ih->saddr >> 16) + (ih->saddr & 0xffff) +
> > >  	       (ih->daddr >> 16) + (ih->daddr & 0xffff);
> > >
> > 
> > The pseudo-header is specified as:
> > 
> >                      +--------+--------+--------+--------+
> >                      |           Source Address          |
> >                      +--------+--------+--------+--------+
> >                      |         Destination Address       |
> >                      +--------+--------+--------+--------+
> >                      |  zero  |  PTCL  |    TCP Length   |
> >                      +--------+--------+--------+--------+
> > 
> > The current code zero-extends the protocol number to produce the 5th
> > 16-bit word of the pseudo-header, then uses htons() to put it in
> > big-endian order, consistent with the other fields.  (Yes, it's doing
> > addition on big-endian words; this works even on little-endian machines
> > due to the way the checksum is specified.)
> > 
> > The driver should not be doing this at all, though.  It should set
> > skb->csum = hwsum; skb->ip_summed = CHECKSUM_COMPLETE; and let the
> > network stack adjust the hardware checksum.
> 
> Really?  The IOC3 isn't the exactly the smartest NIC around; it does add up
> everything and the kitchen sink, that is ethernet headers, IP headers and
> on RX the frame's trailing CRC.

That is almost exactly what CHECKSUM_COMPLETE means on receive; only the
CRC would need to be subtracted.  Then the driver can validate
{TCP,UDP}/IPv{4,6} checksums without any header parsing.

> All that needs to be subtracted in software
> which is what this does.  I think others NICs are all smarted and don't
> need this particular piece of magic.

It may not be smart, but that allows it to cover more cases than most
smart network controllers!

On transmit, the driver should:
- Calculate the partial checksum of data up to offset csum_start
- Subtract this from the checksum at offset (csum_start + csum_offset)
It should set the NETIF_F_GEN_CSUM feature flag rather than
NETIF_F_IP_CSUM.  Then it will be able to generate {TCP,UDP}/IPv{4,6}
checksums.

Ben.

> I agree with your other comment wrt. to htons().


-- 
Ben Hutchings
The two most common things in the universe are hydrogen and stupidity.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply

* [PATCH 1/3] vringh: 64 bit features
From: Michael S. Tsirkin @ 2014-12-15 21:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, kvm, virtualization
In-Reply-To: <1418678019-31629-1-git-send-email-mst@redhat.com>

Pass u64 everywhere.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/vringh.h | 4 ++--
 drivers/vhost/vringh.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/vringh.h b/include/linux/vringh.h
index 749cde2..f696dd0 100644
--- a/include/linux/vringh.h
+++ b/include/linux/vringh.h
@@ -105,7 +105,7 @@ struct vringh_kiov {
 #define VRINGH_IOV_ALLOCATED 0x8000000
 
 /* Helpers for userspace vrings. */
-int vringh_init_user(struct vringh *vrh, u32 features,
+int vringh_init_user(struct vringh *vrh, u64 features,
 		     unsigned int num, bool weak_barriers,
 		     struct vring_desc __user *desc,
 		     struct vring_avail __user *avail,
@@ -167,7 +167,7 @@ bool vringh_notify_enable_user(struct vringh *vrh);
 void vringh_notify_disable_user(struct vringh *vrh);
 
 /* Helpers for kernelspace vrings. */
-int vringh_init_kern(struct vringh *vrh, u32 features,
+int vringh_init_kern(struct vringh *vrh, u64 features,
 		     unsigned int num, bool weak_barriers,
 		     struct vring_desc *desc,
 		     struct vring_avail *avail,
diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
index 5174eba..ac3fe27 100644
--- a/drivers/vhost/vringh.c
+++ b/drivers/vhost/vringh.c
@@ -577,7 +577,7 @@ static inline int xfer_to_user(void *dst, void *src, size_t len)
  * Returns an error if num is invalid: you should check pointers
  * yourself!
  */
-int vringh_init_user(struct vringh *vrh, u32 features,
+int vringh_init_user(struct vringh *vrh, u64 features,
 		     unsigned int num, bool weak_barriers,
 		     struct vring_desc __user *desc,
 		     struct vring_avail __user *avail,
@@ -836,7 +836,7 @@ static inline int xfer_kern(void *src, void *dst, size_t len)
  *
  * Returns an error if num is invalid.
  */
-int vringh_init_kern(struct vringh *vrh, u32 features,
+int vringh_init_kern(struct vringh *vrh, u64 features,
 		     unsigned int num, bool weak_barriers,
 		     struct vring_desc *desc,
 		     struct vring_avail *avail,
-- 
MST

^ permalink raw reply related

* [PATCH 2/3] vringh: initial virtio 1.0 support
From: Michael S. Tsirkin @ 2014-12-15 21:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, kvm, virtualization
In-Reply-To: <1418678019-31629-1-git-send-email-mst@redhat.com>

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/vringh.h |  33 ++++++++++++++
 drivers/vhost/vringh.c | 121 ++++++++++++++++++++++++++++++-------------------
 2 files changed, 107 insertions(+), 47 deletions(-)

diff --git a/include/linux/vringh.h b/include/linux/vringh.h
index f696dd0..a3fa537 100644
--- a/include/linux/vringh.h
+++ b/include/linux/vringh.h
@@ -24,12 +24,16 @@
 #ifndef _LINUX_VRINGH_H
 #define _LINUX_VRINGH_H
 #include <uapi/linux/virtio_ring.h>
+#include <linux/virtio_byteorder.h>
 #include <linux/uio.h>
 #include <linux/slab.h>
 #include <asm/barrier.h>
 
 /* virtio_ring with information needed for host access. */
 struct vringh {
+	/* Everything is little endian */
+	bool little_endian;
+
 	/* Guest publishes used event idx (note: we always do). */
 	bool event_indices;
 
@@ -222,4 +226,33 @@ static inline void vringh_notify(struct vringh *vrh)
 		vrh->notify(vrh);
 }
 
+static inline u16 vringh16_to_cpu(const struct vringh *vrh, __virtio16 val)
+{
+	return __virtio16_to_cpu(vrh->little_endian, val);
+}
+
+static inline __virtio16 cpu_to_vringh16(const struct vringh *vrh, u16 val)
+{
+	return __cpu_to_virtio16(vrh->little_endian, val);
+}
+
+static inline u32 vringh32_to_cpu(const struct vringh *vrh, __virtio32 val)
+{
+	return __virtio32_to_cpu(vrh->little_endian, val);
+}
+
+static inline __virtio32 cpu_to_vringh32(const struct vringh *vrh, u32 val)
+{
+	return __cpu_to_virtio32(vrh->little_endian, val);
+}
+
+static inline u64 vringh64_to_cpu(const struct vringh *vrh, __virtio64 val)
+{
+	return __virtio64_to_cpu(vrh->little_endian, val);
+}
+
+static inline __virtio64 cpu_to_vringh64(const struct vringh *vrh, u64 val)
+{
+	return __cpu_to_virtio64(vrh->little_endian, val);
+}
 #endif /* _LINUX_VRINGH_H */
diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
index ac3fe27..3bb02c6 100644
--- a/drivers/vhost/vringh.c
+++ b/drivers/vhost/vringh.c
@@ -11,6 +11,7 @@
 #include <linux/uaccess.h>
 #include <linux/slab.h>
 #include <linux/export.h>
+#include <uapi/linux/virtio_config.h>
 
 static __printf(1,2) __cold void vringh_bad(const char *fmt, ...)
 {
@@ -28,13 +29,14 @@ static __printf(1,2) __cold void vringh_bad(const char *fmt, ...)
 
 /* Returns vring->num if empty, -ve on error. */
 static inline int __vringh_get_head(const struct vringh *vrh,
-				    int (*getu16)(u16 *val, const u16 *p),
+				    int (*getu16)(const struct vringh *vrh,
+						  u16 *val, const __virtio16 *p),
 				    u16 *last_avail_idx)
 {
 	u16 avail_idx, i, head;
 	int err;
 
-	err = getu16(&avail_idx, &vrh->vring.avail->idx);
+	err = getu16(vrh, &avail_idx, &vrh->vring.avail->idx);
 	if (err) {
 		vringh_bad("Failed to access avail idx at %p",
 			   &vrh->vring.avail->idx);
@@ -49,7 +51,7 @@ static inline int __vringh_get_head(const struct vringh *vrh,
 
 	i = *last_avail_idx & (vrh->vring.num - 1);
 
-	err = getu16(&head, &vrh->vring.avail->ring[i]);
+	err = getu16(vrh, &head, &vrh->vring.avail->ring[i]);
 	if (err) {
 		vringh_bad("Failed to read head: idx %d address %p",
 			   *last_avail_idx, &vrh->vring.avail->ring[i]);
@@ -144,28 +146,32 @@ static inline bool no_range_check(struct vringh *vrh, u64 addr, size_t *len,
 }
 
 /* No reason for this code to be inline. */
-static int move_to_indirect(int *up_next, u16 *i, void *addr,
+static int move_to_indirect(const struct vringh *vrh,
+			    int *up_next, u16 *i, void *addr,
 			    const struct vring_desc *desc,
 			    struct vring_desc **descs, int *desc_max)
 {
+	u32 len;
+
 	/* Indirect tables can't have indirect. */
 	if (*up_next != -1) {
 		vringh_bad("Multilevel indirect %u->%u", *up_next, *i);
 		return -EINVAL;
 	}
 
-	if (unlikely(desc->len % sizeof(struct vring_desc))) {
+	len = vringh32_to_cpu(vrh, desc->len);
+	if (unlikely(len % sizeof(struct vring_desc))) {
 		vringh_bad("Strange indirect len %u", desc->len);
 		return -EINVAL;
 	}
 
 	/* We will check this when we follow it! */
-	if (desc->flags & VRING_DESC_F_NEXT)
-		*up_next = desc->next;
+	if (desc->flags & cpu_to_vringh16(vrh, VRING_DESC_F_NEXT))
+		*up_next = vringh16_to_cpu(vrh, desc->next);
 	else
 		*up_next = -2;
 	*descs = addr;
-	*desc_max = desc->len / sizeof(struct vring_desc);
+	*desc_max = len / sizeof(struct vring_desc);
 
 	/* Now, start at the first indirect. */
 	*i = 0;
@@ -287,22 +293,25 @@ __vringh_iov(struct vringh *vrh, u16 i,
 		if (unlikely(err))
 			goto fail;
 
-		if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
+		if (unlikely(desc.flags &
+			     cpu_to_vringh16(vrh, VRING_DESC_F_INDIRECT))) {
+			u64 a = vringh64_to_cpu(vrh, desc.addr);
+
 			/* Make sure it's OK, and get offset. */
-			len = desc.len;
-			if (!rcheck(vrh, desc.addr, &len, &range, getrange)) {
+			len = vringh32_to_cpu(vrh, desc.len);
+			if (!rcheck(vrh, a, &len, &range, getrange)) {
 				err = -EINVAL;
 				goto fail;
 			}
 
-			if (unlikely(len != desc.len)) {
+			if (unlikely(len != vringh32_to_cpu(vrh, desc.len))) {
 				slow = true;
 				/* We need to save this range to use offset */
 				slowrange = range;
 			}
 
-			addr = (void *)(long)(desc.addr + range.offset);
-			err = move_to_indirect(&up_next, &i, addr, &desc,
+			addr = (void *)(long)(a + range.offset);
+			err = move_to_indirect(vrh, &up_next, &i, addr, &desc,
 					       &descs, &desc_max);
 			if (err)
 				goto fail;
@@ -315,7 +324,7 @@ __vringh_iov(struct vringh *vrh, u16 i,
 			goto fail;
 		}
 
-		if (desc.flags & VRING_DESC_F_WRITE)
+		if (desc.flags & cpu_to_vringh16(vrh, VRING_DESC_F_WRITE))
 			iov = wiov;
 		else {
 			iov = riov;
@@ -336,12 +345,14 @@ __vringh_iov(struct vringh *vrh, u16 i,
 
 	again:
 		/* Make sure it's OK, and get offset. */
-		len = desc.len;
-		if (!rcheck(vrh, desc.addr, &len, &range, getrange)) {
+		len = vringh32_to_cpu(vrh, desc.len);
+		if (!rcheck(vrh, vringh64_to_cpu(vrh, desc.addr), &len, &range,
+			    getrange)) {
 			err = -EINVAL;
 			goto fail;
 		}
-		addr = (void *)(unsigned long)(desc.addr + range.offset);
+		addr = (void *)(unsigned long)(vringh64_to_cpu(vrh, desc.addr) +
+					       range.offset);
 
 		if (unlikely(iov->used == (iov->max_num & ~VRINGH_IOV_ALLOCATED))) {
 			err = resize_iovec(iov, gfp);
@@ -353,14 +364,16 @@ __vringh_iov(struct vringh *vrh, u16 i,
 		iov->iov[iov->used].iov_len = len;
 		iov->used++;
 
-		if (unlikely(len != desc.len)) {
-			desc.len -= len;
-			desc.addr += len;
+		if (unlikely(len != vringh32_to_cpu(vrh, desc.len))) {
+			desc.len = cpu_to_vringh32(vrh,
+				   vringh32_to_cpu(vrh, desc.len) - len);
+			desc.addr = cpu_to_vringh64(vrh,
+				    vringh64_to_cpu(vrh, desc.addr) + len);
 			goto again;
 		}
 
-		if (desc.flags & VRING_DESC_F_NEXT) {
-			i = desc.next;
+		if (desc.flags & cpu_to_vringh16(vrh, VRING_DESC_F_NEXT)) {
+			i = vringh16_to_cpu(vrh, desc.next);
 		} else {
 			/* Just in case we need to finish traversing above. */
 			if (unlikely(up_next > 0)) {
@@ -387,7 +400,8 @@ fail:
 static inline int __vringh_complete(struct vringh *vrh,
 				    const struct vring_used_elem *used,
 				    unsigned int num_used,
-				    int (*putu16)(u16 *p, u16 val),
+				    int (*putu16)(const struct vringh *vrh,
+						  __virtio16 *p, u16 val),
 				    int (*putused)(struct vring_used_elem *dst,
 						   const struct vring_used_elem
 						   *src, unsigned num))
@@ -420,7 +434,7 @@ static inline int __vringh_complete(struct vringh *vrh,
 	/* Make sure buffer is written before we update index. */
 	virtio_wmb(vrh->weak_barriers);
 
-	err = putu16(&vrh->vring.used->idx, used_idx + num_used);
+	err = putu16(vrh, &vrh->vring.used->idx, used_idx + num_used);
 	if (err) {
 		vringh_bad("Failed to update used index at %p",
 			   &vrh->vring.used->idx);
@@ -433,7 +447,9 @@ static inline int __vringh_complete(struct vringh *vrh,
 
 
 static inline int __vringh_need_notify(struct vringh *vrh,
-				       int (*getu16)(u16 *val, const u16 *p))
+				       int (*getu16)(const struct vringh *vrh,
+						     u16 *val,
+						     const __virtio16 *p))
 {
 	bool notify;
 	u16 used_event;
@@ -447,7 +463,7 @@ static inline int __vringh_need_notify(struct vringh *vrh,
 	/* Old-style, without event indices. */
 	if (!vrh->event_indices) {
 		u16 flags;
-		err = getu16(&flags, &vrh->vring.avail->flags);
+		err = getu16(vrh, &flags, &vrh->vring.avail->flags);
 		if (err) {
 			vringh_bad("Failed to get flags at %p",
 				   &vrh->vring.avail->flags);
@@ -457,7 +473,7 @@ static inline int __vringh_need_notify(struct vringh *vrh,
 	}
 
 	/* Modern: we know when other side wants to know. */
-	err = getu16(&used_event, &vring_used_event(&vrh->vring));
+	err = getu16(vrh, &used_event, &vring_used_event(&vrh->vring));
 	if (err) {
 		vringh_bad("Failed to get used event idx at %p",
 			   &vring_used_event(&vrh->vring));
@@ -478,20 +494,22 @@ static inline int __vringh_need_notify(struct vringh *vrh,
 }
 
 static inline bool __vringh_notify_enable(struct vringh *vrh,
-					  int (*getu16)(u16 *val, const u16 *p),
-					  int (*putu16)(u16 *p, u16 val))
+					  int (*getu16)(const struct vringh *vrh,
+							u16 *val, const __virtio16 *p),
+					  int (*putu16)(const struct vringh *vrh,
+							__virtio16 *p, u16 val))
 {
 	u16 avail;
 
 	if (!vrh->event_indices) {
 		/* Old-school; update flags. */
-		if (putu16(&vrh->vring.used->flags, 0) != 0) {
+		if (putu16(vrh, &vrh->vring.used->flags, 0) != 0) {
 			vringh_bad("Clearing used flags %p",
 				   &vrh->vring.used->flags);
 			return true;
 		}
 	} else {
-		if (putu16(&vring_avail_event(&vrh->vring),
+		if (putu16(vrh, &vring_avail_event(&vrh->vring),
 			   vrh->last_avail_idx) != 0) {
 			vringh_bad("Updating avail event index %p",
 				   &vring_avail_event(&vrh->vring));
@@ -503,7 +521,7 @@ static inline bool __vringh_notify_enable(struct vringh *vrh,
 	 * sure it's written, then check again. */
 	virtio_mb(vrh->weak_barriers);
 
-	if (getu16(&avail, &vrh->vring.avail->idx) != 0) {
+	if (getu16(vrh, &avail, &vrh->vring.avail->idx) != 0) {
 		vringh_bad("Failed to check avail idx at %p",
 			   &vrh->vring.avail->idx);
 		return true;
@@ -516,11 +534,13 @@ static inline bool __vringh_notify_enable(struct vringh *vrh,
 }
 
 static inline void __vringh_notify_disable(struct vringh *vrh,
-					   int (*putu16)(u16 *p, u16 val))
+					   int (*putu16)(const struct vringh *vrh,
+							 __virtio16 *p, u16 val))
 {
 	if (!vrh->event_indices) {
 		/* Old-school; update flags. */
-		if (putu16(&vrh->vring.used->flags, VRING_USED_F_NO_NOTIFY)) {
+		if (putu16(vrh, &vrh->vring.used->flags,
+			   VRING_USED_F_NO_NOTIFY)) {
 			vringh_bad("Setting used flags %p",
 				   &vrh->vring.used->flags);
 		}
@@ -528,14 +548,18 @@ static inline void __vringh_notify_disable(struct vringh *vrh,
 }
 
 /* Userspace access helpers: in this case, addresses are really userspace. */
-static inline int getu16_user(u16 *val, const u16 *p)
+static inline int getu16_user(const struct vringh *vrh, u16 *val, const __virtio16 *p)
 {
-	return get_user(*val, (__force u16 __user *)p);
+	__virtio16 v = 0;
+	int rc = get_user(v, (__force __virtio16 __user *)p);
+	*val = vringh16_to_cpu(vrh, v);
+	return rc;
 }
 
-static inline int putu16_user(u16 *p, u16 val)
+static inline int putu16_user(const struct vringh *vrh, __virtio16 *p, u16 val)
 {
-	return put_user(val, (__force u16 __user *)p);
+	__virtio16 v = cpu_to_vringh16(vrh, val);
+	return put_user(v, (__force __virtio16 __user *)p);
 }
 
 static inline int copydesc_user(void *dst, const void *src, size_t len)
@@ -589,6 +613,7 @@ int vringh_init_user(struct vringh *vrh, u64 features,
 		return -EINVAL;
 	}
 
+	vrh->little_endian = (features & (1ULL << VIRTIO_F_VERSION_1));
 	vrh->event_indices = (features & (1 << VIRTIO_RING_F_EVENT_IDX));
 	vrh->weak_barriers = weak_barriers;
 	vrh->completed = 0;
@@ -729,8 +754,8 @@ int vringh_complete_user(struct vringh *vrh, u16 head, u32 len)
 {
 	struct vring_used_elem used;
 
-	used.id = head;
-	used.len = len;
+	used.id = cpu_to_vringh32(vrh, head);
+	used.len = cpu_to_vringh32(vrh, len);
 	return __vringh_complete(vrh, &used, 1, putu16_user, putused_user);
 }
 EXPORT_SYMBOL(vringh_complete_user);
@@ -792,15 +817,16 @@ int vringh_need_notify_user(struct vringh *vrh)
 EXPORT_SYMBOL(vringh_need_notify_user);
 
 /* Kernelspace access helpers. */
-static inline int getu16_kern(u16 *val, const u16 *p)
+static inline int getu16_kern(const struct vringh *vrh,
+			      u16 *val, const __virtio16 *p)
 {
-	*val = ACCESS_ONCE(*p);
+	*val = vringh16_to_cpu(vrh, ACCESS_ONCE(*p));
 	return 0;
 }
 
-static inline int putu16_kern(u16 *p, u16 val)
+static inline int putu16_kern(const struct vringh *vrh, __virtio16 *p, u16 val)
 {
-	ACCESS_ONCE(*p) = val;
+	ACCESS_ONCE(*p) = cpu_to_vringh16(vrh, val);
 	return 0;
 }
 
@@ -848,6 +874,7 @@ int vringh_init_kern(struct vringh *vrh, u64 features,
 		return -EINVAL;
 	}
 
+	vrh->little_endian = (features & (1ULL << VIRTIO_F_VERSION_1));
 	vrh->event_indices = (features & (1 << VIRTIO_RING_F_EVENT_IDX));
 	vrh->weak_barriers = weak_barriers;
 	vrh->completed = 0;
@@ -962,8 +989,8 @@ int vringh_complete_kern(struct vringh *vrh, u16 head, u32 len)
 {
 	struct vring_used_elem used;
 
-	used.id = head;
-	used.len = len;
+	used.id = cpu_to_vringh32(vrh, head);
+	used.len = cpu_to_vringh32(vrh, len);
 
 	return __vringh_complete(vrh, &used, 1, putu16_kern, putused_kern);
 }
-- 
MST

^ permalink raw reply related

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Tom Herbert @ 2014-12-15 22:01 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Eric Dumazet, Alexei Starovoitov, Laurent Chavey, Yuchung Cheng,
	Martin KaFai Lau, netdev@vger.kernel.org, David S. Miller,
	Hannes Frederic Sowa, Steven Rostedt, Lawrence Brakmo,
	Kernel Team
In-Reply-To: <548F0F62.7080704@fb.com>

On Mon, Dec 15, 2014 at 8:42 AM, Josef Bacik <jbacik@fb.com> wrote:
> On 12/15/2014 11:03 AM, Eric Dumazet wrote:
>>
>> On Sun, 2014-12-14 at 22:55 -0800, Alexei Starovoitov wrote:
>>
>>> I think patches 1 and 3 are good additions, since they establish
>>> few permanent points of instrumentation in tcp stack.
>>> Patches 4-5 look more like use cases of tracepoints established
>>> before. They may feel like simple additions and, no doubt,
>>> they are useful, but since they expose things via tracing
>>> infra they become part of api and cannot be changed later,
>>> when more stats would be needed.
>>> I think systemtap like scripting on top of patches 1 and 3
>>> should solve your use case ?
>>> Also, have you looked at recent eBPF work?
>>> Though it's not completely ready yet, soon it should
>>> be able to do the same stats collection as you have
>>> in 4/5 without adding permanent pieces to the kernel.
>>
>>
>> So it looks like web10g like interfaces are very often requested by
>> various teams.
>>
>> And we have many different views on how to hack this. I am astonished by
>> number of hacks I saw about this stuff going on.
>>
>> What about a clean way, extending current TCP_INFO, which is both
>> available as a getsockopt() for socket owners and ss/iproute2
>> information for 'external entities'
>>
>> If we consider web10g info needed, then adding a ftrace/eBPF like
>> interface is simply yet another piece of code we need to maintain,
>> and the argument of 'this should cost nothing if not activated' is
>> nonsense since major players need to constantly monitor TCP metrics and
>> behavior.
>>
>> It seems both FaceBook and Google are working on a subset of web10g.
>>
>> I suggest we meet together and establish a common ground, preferably
>> after Christmas holidays.
>>
>
> We've set up something for exactly this case at the end of January but have
> yet to get a response from Google.  If any of the Google people cc'ed (or
> really anybody, its not a strictly FB/Google thing) is interested please
> email me directly and I'll send you the details, we will be meeting face to
> face in the bay area at the end of January.  Thanks,
>

Maybe this would be good for discussion at netdev01?

> Josef
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Steven Rostedt @ 2014-12-15 22:29 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Josef Bacik, Eric Dumazet, Alexei Starovoitov, Laurent Chavey,
	Yuchung Cheng, Martin KaFai Lau, netdev@vger.kernel.org,
	David S. Miller, Hannes Frederic Sowa, Lawrence Brakmo,
	Kernel Team
In-Reply-To: <CA+mtBx8tB6EE6i9C5KdOmwJ1D1nnaX3bvia71oj=N9U5h3KKBA@mail.gmail.com>

On Mon, 15 Dec 2014 14:01:43 -0800
Tom Herbert <therbert@google.com> wrote:

> >
> > We've set up something for exactly this case at the end of January but have
> > yet to get a response from Google.  If any of the Google people cc'ed (or
> > really anybody, its not a strictly FB/Google thing) is interested please
> > email me directly and I'll send you the details, we will be meeting face to
> > face in the bay area at the end of January.  Thanks,
> >
> 
> Maybe this would be good for discussion at netdev01?

Is this something I should attend too? For this discussion that is.
Weather permitting, Ottawa is only a 4 1/2 hour drive for me.

-- Steve

^ permalink raw reply

* [iproute2] tc: Show classes more hierarchically]
From: vadim4j @ 2014-12-15 22:48 UTC (permalink / raw)
  To: netdev; +Cc: vadim4j

Hi All,

I am playing with showing classes in more hierarchically format and I
have some code and example of output from my TC looks like:

# tc/tc -t class show dev tap0

 \---1:2 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:40 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:50 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:60 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
 \---1:1 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:10 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
               \---1:11 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
                      \---1:111 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:20 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:30 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 


which in standart output mode it looks like:

# tc/tc class show dev tap0

class htb 1:11 parent 1:10 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:111 parent 1:11 prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:10 parent 1:1 rate 5Mbit ceil 5Mbit burst 15Kb cburst 1600b 
class htb 1:1 root rate 6Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:20 parent 1:1 leaf 20: prio 0 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:2 root rate 6Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:30 parent 1:1 leaf 30: prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:40 parent 1:2 leaf 40: prio 0 rate 5Mbit ceil 5Mbit burst 15Kb cburst 1600b 
class htb 1:50 parent 1:2 leaf 50: prio 0 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:60 parent 1:2 leaf 60: prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 

So I'd like to ask if it might be useful for the TC users (may be
better format ?) to have this ?

Thanks,

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: rapier @ 2014-12-15 22:17 UTC (permalink / raw)
  To: Tom Herbert, Josef Bacik
  Cc: Eric Dumazet, Alexei Starovoitov, Laurent Chavey, Yuchung Cheng,
	Martin KaFai Lau, netdev@vger.kernel.org, David S. Miller,
	Hannes Frederic Sowa, Steven Rostedt, Lawrence Brakmo,
	Kernel Team
In-Reply-To: <CA+mtBx8tB6EE6i9C5KdOmwJ1D1nnaX3bvia71oj=N9U5h3KKBA@mail.gmail.com>

The Web10g development team at PSC (we've been working with
a number of other organizations on this) will be submitting
the kernel instrument set tomorrow morning. We'd be happy to
join any discussion then.

Chris rapier

On 12/15/14, 5:01 PM, Tom Herbert wrote:
> On Mon, Dec 15, 2014 at 8:42 AM, Josef Bacik <jbacik@fb.com> wrote:
>> On 12/15/2014 11:03 AM, Eric Dumazet wrote:
>>>
>>> On Sun, 2014-12-14 at 22:55 -0800, Alexei Starovoitov wrote:
>>>
>>>> I think patches 1 and 3 are good additions, since they establish
>>>> few permanent points of instrumentation in tcp stack.
>>>> Patches 4-5 look more like use cases of tracepoints established
>>>> before. They may feel like simple additions and, no doubt,
>>>> they are useful, but since they expose things via tracing
>>>> infra they become part of api and cannot be changed later,
>>>> when more stats would be needed.
>>>> I think systemtap like scripting on top of patches 1 and 3
>>>> should solve your use case ?
>>>> Also, have you looked at recent eBPF work?
>>>> Though it's not completely ready yet, soon it should
>>>> be able to do the same stats collection as you have
>>>> in 4/5 without adding permanent pieces to the kernel.
>>>
>>>
>>> So it looks like web10g like interfaces are very often requested by
>>> various teams.
>>>
>>> And we have many different views on how to hack this. I am astonished by
>>> number of hacks I saw about this stuff going on.
>>>
>>> What about a clean way, extending current TCP_INFO, which is both
>>> available as a getsockopt() for socket owners and ss/iproute2
>>> information for 'external entities'
>>>
>>> If we consider web10g info needed, then adding a ftrace/eBPF like
>>> interface is simply yet another piece of code we need to maintain,
>>> and the argument of 'this should cost nothing if not activated' is
>>> nonsense since major players need to constantly monitor TCP metrics and
>>> behavior.
>>>
>>> It seems both FaceBook and Google are working on a subset of web10g.
>>>
>>> I suggest we meet together and establish a common ground, preferably
>>> after Christmas holidays.
>>>
>>
>> We've set up something for exactly this case at the end of January but have
>> yet to get a response from Google.  If any of the Google people cc'ed (or
>> really anybody, its not a strictly FB/Google thing) is interested please
>> email me directly and I'll send you the details, we will be meeting face to
>> face in the bay area at the end of January.  Thanks,
>>
>
> Maybe this would be good for discussion at netdev01?
>
>> Josef
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* ethtool 3.18 released
From: Ben Hutchings @ 2014-12-14 19:35 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 448 bytes --]

ethtool version 3.18 has been released.

Home page: https://www.kernel.org/pub/software/network/ethtool/
Download link:
https://www.kernel.org/pub/software/network/ethtool/ethtool-3.18.tar.xz

Release notes:

	* Fix: Lookup of SFP Tx bias in SFF-8472 module diagnostics (-m option)
	* Fix: Build with musl by using more common typedefs

Ben.

-- 
Ben Hutchings
The two most common things in the universe are hydrogen and stupidity.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply

* [ANNOUNCE] libnftnl 1.0.3 release
From: Pablo Neira Ayuso @ 2014-12-15 23:21 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, netfilter, netfilter-announce, lwn

[-- Attachment #1: Type: text/plain, Size: 559 bytes --]

Hi!

The Netfilter project proudly presents:

        libnftnl 1.0.3

libnftnl is a userspace library providing a low-level netlink
programming interface (API) to the in-kernel nf_tables subsystem. The
library libnftnl has been previously known as libnftables. This
library is currently used by the nft command line tool.

This release comes with new features available up to 3.18, see
ChangeLog for more details.

You can download this library from:

http://www.netfilter.org/projects/libnftnl/downloads.html
ftp://ftp.netfilter.org/pub/libnftnl/

Have fun!

[-- Attachment #2: changes-libnftnl-1.0.3.txt --]
[-- Type: text/plain, Size: 3665 bytes --]

Alvaro Neira (2):
      ruleset: add set id to parsed sets
      src: internal set id allocation from nft_ruleset_parse*()

Ana Rey (15):
      chain: Free memory in the same function that is reserved
      chain: Use nft_rule_expr_set_* in the xml parsing code
      table: Free memory in the same function that is reserved
      table: Use nft_table_attr_set_* in the xml functions
      table: Add set, unset and parse implementation for the use attribute
      table: Do not print unset values in xml file
      table: Do not print unset values in json file
      chain: Add all support of use attribute
      chain: Do not print unset attributes in xml
      chain: Rename variables in nft_jansson_parse_chain functions
      chain: Do not print unset attributes in json
      expr: meta: Add pkttype support
      expr: meta: Add cpu support for meta expresion
      expr: meta: Add devgroup support
      expr: meta: Add cgroup support

Arturo Borrero (33):
      set: add support for set mechanism selection
      examples: nft-set-add: use batch infraestructure
      examples: nft-chain-del: add chain_del_parse()
      examples: nft-chain-del: support new batching interface
      set_elem: use proper free function
      examples: merge nft-chain-{xml|json}-add.c
      examples: nft-chain-parse-add: add batching support
      examples: merge nft-table-{xml|json}-add.c
      examples: nft-table-parse-add: add batching support
      examples: nft-table-add: add table_add_parse()
      examples: nft-table-add: add batching support
      examples: nft-table-del: add table_del_parse()
      examples: nft-table-del: add batching support
      src: fix printing of XML/JSON event wrapper header/footer
      expr: nat: add support for the new flags attribute
      expr: add new nft_masq expression
      nf_tables.h: add NFTA_MASQ_UNSPEC
      utils: nft_fprintf: prevent an empty buffer from being printed
      set: fix set nlmsg desc parsing
      examples: merge nft-rule-{xml|json}-add.c
      examples: nft-rule-parse-add: add batching support
      examples: nft-set-json-add: generalize parsing format support
      examples: nft-set-parse-add: add batching support
      examples: nft-table-add: fix wrong buffer pointer
      expr: masq: optional printing of flags attr in snprintf_default
      tests: add tests for the masq expression
      tests: also test nat flags attribute
      src: cleanup in mxml and jansson regarding set_id parsing
      utils: fix arp family number
      ruleset: deconstify _get interface
      src: add support for nft_redir expression
      tests: add tests for nft_redir expression
      examples: nft-rule-parse-add: fix wrong buffer usage when building rule header

Giuseppe Longo (1):
      buffer: include stdarg header

Pablo Neira Ayuso (16):
      expr: log: add support for level and flags
      src: stricter netlink attribute length validation
      set_elem: add nft_set_elems_nlmsg_build_payload_iter()
      common: add batching interfaces
      examples: nft-chain-add: add chain_add_parse()
      examples: nft-chain-add: support new batching interface
      utils: define xfree() as macro
      src: get rid of cached copies of x_tables.h and xt_LOG.h
      src: add ruleset generation class
      src: fix compilation without xml/json support
      remove empty src/attr.c
      expr: nat: use 'nat_type' instead of 'type' in the parser
      src: consolidate XML/JSON exportation
      expr: data_reg: use 'reg' instead of 'data_reg'
      bump version to 1.0.3
      include: add missing gen.h to Makefile.am

Álvaro Neira Ayuso (1):
      expr: log: define variable flags in xml parser


^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Jamal Hadi Salim @ 2014-12-15 23:27 UTC (permalink / raw)
  To: Arad, Ronen, John Fastabend, netdev@vger.kernel.org
  Cc: Roopa Prabhu, Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org,
	tgraf@suug.ch, stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505D9FA8C@ORSMSX106.amr.corp.intel.com>

On 12/15/14 13:36, Arad, Ronen wrote:
>
>
>> -----Original Message-----

> The behavior of a driver could depend on the presence of a bridge and features such as FDB LEARNING and LEARNING_SYNC.

Indeed, those are bridge attributes.

> A switch port driver which is not enslaved to a bridge might need to implement VLAN-aware FDB
>within the driver and report its content to user-space using ndo_fdb_dump.
 >
> A switch port driver which is enslaved to a bridge could do with only pass through for static FDB configuration
 > to the HW when LEARNING_SYNC is configured. FDB reporting to 
user-space and soft aging are left to the bridge module FDB.
> Such driver, without LEARNING_SYNC could still avoid maintaing in-driver FDB as long as it could dump the HW FDB on demand.
> LEARNING_SYNC also requires periodic updates of freshness information from the driver to the bridge module.
>


If you have an fdb - shouldnt that be exposed only if you have a bridge
abstraction exposed? i.e thats where the Linux tools would work.
What i was refering to was a scenario where i have no interest in the
fdb despite such a hardware capabilities. VLANs is a different issue;

>>> Will the decision about using a bridge device or avoiding it be left
>>> to the end-user?
>>
>> Its a user policy decision. Again the offload bit gets us this in a reasonably
>> configurable way IMO.
>>
>>> (This requires switch port drivers to be able to work and provide
>>> similar functionality in both setups).
>>
>> Right, but if the drivers "care" who is calling their ndo ops something is
>> seriously broken. For the driver it should not need to know anything about
>> the callers so it doesn't matter to the driver if its a netlink call from user
>> space or an internal call fro bridge.ko
>
> LEARNING_SYNC only makes sense when a switch port driver is enslaved to a bridge.
 > Rocker switch driver indeed monitors upper change notifications and 
keep track of master bridge presence.
> So bridge presence is not transparent.
>

Agreed - the challenge so far is that people have been fascinated by
"switch" point of view. I think we are learning and the class device
will eventually become obvious as useful.

cheers,
jamal

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Jamal Hadi Salim @ 2014-12-15 23:28 UTC (permalink / raw)
  To: Tom Herbert, Josef Bacik
  Cc: Eric Dumazet, Alexei Starovoitov, Laurent Chavey, Yuchung Cheng,
	Martin KaFai Lau, netdev@vger.kernel.org, David S. Miller,
	Hannes Frederic Sowa, Steven Rostedt, Lawrence Brakmo,
	Kernel Team
In-Reply-To: <CA+mtBx8tB6EE6i9C5KdOmwJ1D1nnaX3bvia71oj=N9U5h3KKBA@mail.gmail.com>

On 12/15/14 17:01, Tom Herbert wrote:


>
> Maybe this would be good for discussion at netdev01?
>

Yes it would be a good fit,
I just pinged Eric when i saw his email saying the same thing ;->

cheers,
jamal

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Eric Dumazet @ 2014-12-15 23:40 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, Josef Bacik, Alexei Starovoitov, Laurent Chavey,
	Yuchung Cheng, Martin KaFai Lau, netdev@vger.kernel.org,
	David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
	Lawrence Brakmo, Kernel Team
In-Reply-To: <548F6EB7.8040802@mojatatu.com>

On Mon, 2014-12-15 at 18:28 -0500, Jamal Hadi Salim wrote:
> On 12/15/14 17:01, Tom Herbert wrote:
> 
> 
> >
> > Maybe this would be good for discussion at netdev01?
> >
> 
> Yes it would be a good fit,
> I just pinged Eric when i saw his email saying the same thing ;->
> 

For the record, I made this suggestion to Josef in a private mail, sent
at 10am PST ;)

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Bjorn Helgaas @ 2014-12-16  0:31 UTC (permalink / raw)
  To: Nils Holland; +Cc: David Miller, netdev, linux-pci@vger.kernel.org, Rajat Jain
In-Reply-To: <20141213210251.GA12812@teela.fritz.box>

On Sat, Dec 13, 2014 at 2:02 PM, Nils Holland <nholland@tisys.org> wrote:
> rajatxjain@gmail.com
> Bcc:
> Subject: Re: [bisected] tg3 broken in 3.18.0?
> Reply-To:
> In-Reply-To: <20141212.201831.186234837340644301.davem@davemloft.net>
>
> On Fri, Dec 12, 2014 at 08:18:31PM -0500, David Miller wrote:
>> From: Nils Holland <nholland@tisys.org>
>> Date: Sat, 13 Dec 2014 02:14:08 +0100
>>
>> >
>> > My bisect exercise suggests that the following commit is the culprit:
>> >
>> > 89665a6a71408796565bfd29cfa6a7877b17a667 (PCI: Check only the Vendor
>> > ID to identify Configuration Request Retry)
>>
>> You definitely need to bring this up with the author of that change
>> and the relevent list for the PCI subsystem and/or linux-kernel.
>
> I've now already sent an inquiry to Rajat Jain, the author of the
> patch in question, and this message here is now also CC'd to
> linux-pci@.
>
> With this message, I'd like to add one last result of investigation
> I've done today, in the hope that it will aid the folks with more
> knowledge to go after the issue.
>
> Basically, I've added a little debug output to tg3.c in the function
> tg3_poll_fw(), as that function contained the code that would print
> out the "No firmware running" line that was visible in dmesg on those
> kernels where tg3 would not work for me. So, I basically had this:
>
> static int tg3_poll_fw(struct tg3 *tp)
> {
>         int i;
>         u32 val;
>
>         netdev_info(tp->dev, "XX: Boom!\n");
>         [...]
> }
>
> Now, I was looking through dmesg searching for occurances of this
> debug output, using a standard 3.18.0 kernel (where my tg3 doesn't
> work) as well as using a 3.18.0 kernel with
> 89665a6a71408796565bfd29cfa6a7877b17a667 reverted (where my tg3
> works). Here's the results:
>
> [standard 3.18.0 (=problematic)]:
> [    2.197653] libphy: tg3 mdio bus: probed
> [    2.257488] tg3 0000:02:00.0 eth0:
>         Tigon3 [partno(BCM57780) rev 57780001] (PCI Express) MAC address
>         00:19:99:ce:13:a6
> [    2.259589] tg3 0000:02:00.0 eth0:
>         attached PHY driver [Broadcom BCM57780] (mii_bus:phy_addr=200:01)
> [    2.261740] tg3 0000:02:00.0 eth0:
>         RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
> [    2.263912] tg3 0000:02:00.0 eth0:
>         dma_rwctrl[76180000] dma_mask[64-bit]
> [...]
> [   10.028002] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
> [   10.028247] tg3 0000:02:00.0 enp2s0: XX: Boom!
> [   12.157034] tg3 0000:02:00.0 enp2s0: No firmware running
>
>
> [3.18.0 without above mentioned patch, 3.17.3 is the same, both result
> in a working tg3]:
> [    1.397167] libphy: tg3 mdio bus: probed
> [    1.456473] tg3 0000:02:00.0
>         (unnamed net_device) (uninitialized): XX: Boom!
> [    1.464987] tg3 0000:02:00.0 eth0:
>         Tigon3 [partno(BCM57780) rev 57780001] (PCI Express) MAC address
>         00:19:99:ce:13:a6
> [    1.467118] tg3 0000:02:00.0 eth0:
>         attached PHY driver [Broadcom BCM57780] (mii_bus:phy_addr=200:01)
> [    1.469311] tg3 0000:02:00.0 eth0:
>         RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
> [    1.471500] tg3 0000:02:00.0 eth0:
>         dma_rwctrl[76180000] dma_mask[64-bit]
> [...]
> [    9.631629] tg3 0000:02:00.0: irq 25 for MSI/MSI-X
> [    9.631962] tg3 0000:02:00.0 enp2s0: XX: Boom!
> [    9.634339] tg3 0000:02:00.0 enp2s0: XX: Boom!
> [    9.642741] IPv6:
>         ADDRCONF(NETDEV_UP): enp2s0: link is not ready
> [   10.479636] tg3 0000:02:00.0
>         enp2s0: Link is down
> [   11.484498] tg3 0000:02:00.0
>         enp2s0: Link is up at 100 Mbps, full duplex
>
> As can be seen, there are two tg3-related sections in my dmesg in both
> the working and non-working scenarios: At about 1 - 2 secs, the card
> seems to begin initializing, and at about 9 - 10 seconds it is (or
> should be) ready to establish a network connection.
>
> My debug section, or tg3.c's tg3_poll_fw(), seems to be called thrice
> in the working situation: The first hit occurs at 1.456473 where the tg3
> device is still reported as "(unnamed net_device) (uninitialized)".
> Then, the section gets hit twice again at around 9.63 - at this point
> the driver already reports the card as initialized / by its real name.
>
> In the non-working situation, the debug sections seems to be hit only
> once, at 10.028247. At this point, the tg3 is already reported as
> initialized - just like when it's hit the second and third time in the
> working situation.
>
> Bottom line is that commit 89665a6a71408796565bfd29cfa6a7877b17a667
> really makes a difference regarding the way the tg3 card is
> initialized, which seems to cause the problem.

Hi Nils,

Thanks a lot for the bug report.  Can you open a bugzilla at
http://bugzilla.kernel.org, put it in the drivers/PCI component, mark
it as a regression, and attach the complete dmesg log for both the
working and non-working cases, as well as "lspci -vv" output for the
working case?

I don't yet see how 89665a6a7140 makes a difference here.  We must
eventually read PCI_VENDOR_ID_BROADCOM (0x14e4) because the tg3 driver
claimed the device.

Can you still reproduce the problem if you print out the value of "l"
every time we read PCI_VENDOR_ID in pci_bus_read_dev_vendor_id()?
That will change the timing, so it's possible that will make it harder
to reproduce.

Bjorn

^ permalink raw reply

* [ANNOUNCE] nftables 0.4 release
From: Pablo Neira Ayuso @ 2014-12-16  0:40 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, netfilter, netfilter-announce, lwn, kaber

[-- Attachment #1: Type: text/plain, Size: 4742 bytes --]

Hi!

The Netfilter project proudly presents:

        nftables 0.4

This release contains a lot of bug fixes and new features contained up
to the recent 3.18 kernel release (and some features coming up in the
yet unreleased 3.19-rc).

New features
============

* Add support for global ruleset operations (available since 3.18).
  Get rid of all tables, chains, and rules in one go:

        # nft flush ruleset

  List the ruleset for all existing families:

        # nft list ruleset

  You can save the ruleset and restore it via:

        # echo "flush ruleset" > ruleset.file
        # nft list ruleset >> ruleset.file
        # nft -f ruleset.file

  This mimics iptables-restore, including the ruleset for all
  supported families: ip, ip6, inet, bridge and arp.

* Full logging support for all the families, including nfnetlink_log
  support (available since 3.17).

* Automatic selection of the optimal set implementation (available
  since 3.16).

  You can tell the kernel to optimize your set representation base
  according to the space-time tradeoff, eg. optimize memory:

        # nft add set filter set1 { type ipv4_addr ; policy memory ; }

  Or optimize performance:

        # nft add set filter set1 { type ipv4_addr ; policy performance ; }

  You can also use this in maps:

        # nft add map filter map1 { type ipv4_addr : verdict ; policy performace ; }

  And indicate the expected size to assist the set selection routine:

        # nft add set filter set1 { type ipv4_addr ; size 1024 ; }

* Complete reject support (available for ip, ip6 and inet since 3.14.
  bridge support and the icmpx abstraction since 3.18).

        # nft add rule filter input reject with icmp type host-unreachable

  and for IPv6:

        # nft add rule ip6 filter input reject with icmpv6 type no-route

  you can the ICMPx abstraction from the inet table:

        # nft add rule inet filter input reject with icmpx type no-route

  and TCP traffic with the reset packets:

        # nft add rule filter input reject with tcp reset

* Masquerading support (available since 3.18).

        # nft add rule nat postrouting masquerade

* Redirect support (available since upcoming Linux kernel 3.19-rc).

        # nft add rule nat prerouting tcp dport 22 redirect to 2222

* Support for NAT flag: random, fully-random, persistent.

* Consistency checks for interferences between updates and ruleset dumps
  (initially available since 3.16, enhanced with ruleset generations
   since 3.18).

* Extend meta to support pkttype, cpu and devgroup matching.

* Automatic regression tests through our customized python shell
  script.

* Allow to disable libreadline and debug at configure stage.

* Full conversion to autotools.

Syntax changes
==============

* 'queue' flags are now expressed as a list of comma-separated symbols:

        # nft add filter input counter queue num 0-3 fanout,bypass

  for consistency with flags, that are always expressed like this.

* nft doesn't resolve names by default anymore. IP addresses are
  always expressed in the numeric representation. A new '-N' option
  allows you to request for the resolution.

Bug fixes
=========

* Crash with anonymous sets with lots of elements.

* Several annoying byteorder issues that resulted in incorrect bytecode
  generation and wrong listings.

* Endianness problems reported from little endian archs.

* Named verdict maps, eg.

        # nft add map filter my_vmap { type ipv4_addr : verdict\; }
        # nft add element filter my_vmap { 1.1.1.1 : drop, 2.2.2.2 : drop}
        # nft add rule filter input ip saddr vmap @my_vmap

* Crash in 'nft describe' with wrong expressions.

* Parsing of ether types.

* Crash on usage of basetypes, eg.

        # nft add rule filter input ct state 8 accept

  instead of 'ct state new'.

* Crash on wrong values when performing basetype parsing, eg.

     <cmdline>:1:29-31: Error: Could not parse conntrack state
     add rule test test ct state xxx accept
                                 ^^^

* Broken listing og meta and ct range expressions, eg.

        nft add rule filter input meta length 33-55 counter

* Don't display a BUG message on too large decimal/hexadecimal values.

Resources
=========

The nftables code can be obtained from:

* http://netfilter.org/projects/nftables/downloads.html
* ftp://ftp.netfilter.org/pub/nftables
* git://git.netfilter.org/nftables

To build the code, you libnftnl and libmnl are required:

* http://netfilter.org/projects/libnftnl/index.html

Thanks
======

Thanks to all our contributors, testers and bug reporters, whom have
all helped to get rid of a good bunch of bugs and push new features.

On behalf of the Netfilter Core Team,
Happy bytecode execution :)

[-- Attachment #2: changes-nftables-0.4.txt --]
[-- Type: text/plain, Size: 7597 bytes --]

Alvaro Neira (15):
      linealize: generate unary expression with the appropiate operation
      payload: generate dependency in the appropriate byteorder
      src: Enhance payload_gen_dependency()
      datatype: Enhance symbolic_constant_parse()
      nft: complete reject support
      evaluate: fix a crash if we specify ether type or meta nfproto in reject
      delinearize: list the icmpx reason with the string associated
      evaluate: reject: fix crash if we specify ether type or meta nfproto
      evaluate: reject: fix crash if we have transport protocol conflict from inet
      test: update and add the reject tests for ip, ip6, bridge and inet.
      evaluate: reject: accept a reject reason with incorrect network context
      evaluate: reject: check in bridge and inet the network context in reject
      evaluate: reject: check the context in reject without reason for bridge and inet tables
      evaluate: reject: enhance the error support throwing message with more details
      evaluate: reject: fix crash on NULL location with bridge and tcp reset

Alvaro Neira Ayuso (1):
      src: add specific byteorder to the struct proto_hdr_template

Ana Rey (15):
      src: Add support for pkttype in meta expresion
      src: Add support for cpu in meta expresion
      src: meta: Fix the size of cpu attribute
      src: Add devgroup support in meta expresion
      tests: Add automated regression testing
      tests: Add ip folder with test files
      tests: Add ip6 folder with test files.
      tests: Add inet folder with test files.
      tests: Add arp folder with test files.
      tests: Add bridge folder with test files.
      tests: Add any folder with test files.
      tests: regression: Delete all reference to wlan0 in test files
      tests: regression: Delete an unnecessary whitespace in an output messages
      meta: Add support for datatype devgroup
      src: Add cgroup support in meta expresion

Arturo Borrero (18):
      netlink: monitor: add a helper function to handle sets referenced by a rule
      netlink: monitor: fix how rules with intervals are printed
      doc: update documentation with 'monitor' and 'export'
      src: add `flush ruleset'
      netlink: include file and line in netlink ABI errors
      src: add set optimization options
      rule: rename do_command_list_cleanup() to table_cleanup()
      rule: factorize chain and table listing code
      src: add list ruleset command
      src: add nat persistent and random options
      src: add masquerade support
      tests: add tests for masquerade
      mnl: delete useless parameter nf_sock in batch functions
      src: add redirect support
      nft: don't resolve hostnames by default
      tests/regression: masquerade: fix invalid syntax
      tests/regression: redirect: fix invalid syntax
      parser: allow both nat_flags and port specification in redirect

David Kozub (1):
      build: add missing \ in src/Makefile.am (AM_CPPFLAGS)

Eric Leblond (2):
      scanner: fix reading of really long line
      datatype: fix name of icmp* code

Giorgio Dal Molin (2):
      build: add autotools support for the 'doc' subdir
      build: add autotools support for the 'files' subdir

Kevin Fenzi (1):
      doc: nft: Fix trivial error in man page where flush should be rename

Pablo Neira Ayuso (53):
      proto: initialize result expression in ethertype_parse()
      mnl: immediately return on errors in mnl_nft_ruleset_dump()
      mnl: check for NLM_F_DUMP_INTR when dumping object lists
      mnl: add nft_batch_continue() helper
      mnl: add nft_nlmsg_batch_current() helper
      src: rework batching logic to fix possible use of uninitialized pages
      main: propagate error to shell
      mnl: introduce NFT_NLMSG_MAXSIZE
      mnl: fix crashes when using sets with many elements
      src: add level option to the log statement
      src: don't return error in netlink_linearize_rule()
      include: refresh include/linux/nf_tables.h cached copy
      log: netlink_linearize: don't set level if user didn't specify
      src: fix 'describe' command when passing wrong expressions
      mnl: consistency checks across several netlink dumps
      mnl: use nft_batch_begin and nft_batch_end from libnftnl
      src: interpret the event type from the evaluation step
      netlink: use switch whenever possible in the monitor code
      utils: indicate file and line on memory allocation errors
      include: refresh cached copy of nf_tables.h
      build: use PKG_CHECK_MODULES to check for libmnl and libnftnl
      build: use AC_PROG_YACC and AM_PROG_LEX
      rename parser.y to parser_bison.y
      include: add cli.h
      build: autotools conversion
      netlink: don't bug on unknown events
      src: restore nft --debug
      parser: restore named vmap
      tests: regression: any/queue.t: use new syntax
      tests: regression: don't use -nnn for non-list commands
      tests: regression: fix bogus error due to bash
      tests: regression: test masquerade from nat/postrouting too
      datatype: fix crash when using basetype instead of symbolic constants
      datatype: relax datatype check in integer_type_parse()
      netlink_delinearize: clone on netlink_get_register(), release previous on _set()
      meta: set base field on clones
      tests: regression: fix "Listing is broken" instead of output mismatch
      tests: regression: any/ct: remove wrong output
      scanner: don't bug on too large values
      payload: fix endianess issue in payload_expr_pctx_update()
      src: generate set members using integer_type in the appropriate byteorder
      netlink_delinearize: fix listing of set members in host byteorder using integer_type
      netlink: fix listing of range set elements in host byteorder
      rule: fix segmentation faults on kernels without nftables support
      tests: regression: adapt nat tests to use random-fully
      tests: regression: redirect.t: fix bogus errors
      parser: use 'redirect to PORT' instead of 'redirect :PORT'
      tests: regression: fix wrong number of test files
      tests: regression: simplify run_test_file() in case `-e' is used
      tests: regression: log.t: this works for bridge and arp since 3.17
      build: restore --disable-debug
      datatype: missing byteorder in string_type
      Bump version to v0.4

Patrick McHardy (16):
      netlink: check and handle errors from netlink_delinearize_set()
      evaluate: fix concat expressions as map arguments
      payload: take endianess into account when updating the payload context
      datatype: take endianess into account in symbolic_constant_print()
      proto: fix byteorder of ETH_P_* values
      verdict type: handle verdict flags and encoded additional information
      parser: simplify monitor command parsing
      parser: compact log level grammar
      expr: make range_low()/range_high() usable outside of segtree
      queue: clean up queue statement
      parser: rearrange monitor/export rules
      dtype: remove unnecessary icmp* parse/print functions
      stmt: rename nat "random-fully" option to "fully-random"
      meta: properly align types in meta_template table
      dtype: fix memory leak in concat_type_destroy()
      datatype: print datatype name in datatype_print() BUG message

Steven Barth (2):
      build: allow disabling libreadline-support
      build: remove unnecessary libintl.h check

Yanchuan Nian (2):
      Fix memory leak in nft get operation
      Fix typo in chain hook parsing

Yuxuan Shui (1):
      payload: use proto_unknown for raw protocol header


^ permalink raw reply

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: John Fastabend @ 2014-12-16  0:45 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Hubert Sokolowski, Roopa Prabhu, netdev@vger.kernel.org,
	Vlad Yasevich
In-Reply-To: <548EF05E.6050401@mojatatu.com>

On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote:
> On 12/12/14 15:05, John Fastabend wrote:
>> On 12/12/2014 06:35 AM, Jamal Hadi Salim wrote:
>
>
>> I'll wake up ;)
>
>
> Vlad made me go over those patches in a few iterations to make
> sure that the use cases covered in the test case work. It is
> holiday season, so he may be offline.
>

Yep.

>> First quick grep of code finds some strange uses of ndo_fdb_dump like
>> this in macvlan,
>>
>>    ./drivers/net/macvlan.c
>>          .ndo_fdb_dump           = ndo_dflt_fdb_dump,
>>
>> I'll be sending a patch once net-next opens up again to resolve it. Its
>> harmless though so not really a fix for net.
>>
>> There seem to be a few places that have the potential to return
>> different values then the uc/mc lists.
>>
>>      ./drivers/net/vxlan.c
>>      ./drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
>>      ./drivers/net/ethernet/rocker/rocker.c
>>
>>      ./net/bridge/br_device.c
>>
>
> Yes, thats my observation as well.
> The question is: Are multi/unicast address unconditionally dumped?

hmm good question. When I implemented this on the host nics with SR-IOV,
VMDQ, etc. The multi/unicast addresses were propagated into the FDB by
the driver. My logic was if some netdev ethx has a set of MAC addresses
above it well then any virtual function or virtual device also behind
the hardware shouldn't be sending those addresses out the egress switch
facing port. Otherwise the switch will see packets it knows are behind
that port and drop them. Or flood them if it hasn't learned the address
yet. Either way they will never get to the right netdev.

Admittedly I wasn't thinking about switches with many ports at the time.

> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
> the other driver did).

My original thinking here was... if it didn't implement fdb_add, fdb_del
and fdb_dump then if you wanted to think of it as having forwarding
database that was fine but it was really just a two port mac relay. In
which case just dump all the mac addresses it knows about. In this case
if it was something more fancy it could do its own dump like vxlan or
macvlan.

> If you go over the original thread exchange with Vlad, you'll notice
> i was kind of unsure why dumping of unicast/multicast had anything to
> do with fdb dumping.
> It is still my view that we shouldnt be treating these addresses as if
> they were fdb entries. But: The problem is once you allow an API to
> user space you cant take it back even if people are depending on bugs.
>

For a host nic ucast/multicast and fdb are the same, I think? The
code we had was just short-hand to allow the common case a host nic
to work. Notice vxlan and bridge drivers didn't dump there addr lists 
from fdb_dump until your patch.

Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And
I shouldn't overload the addr lists.

>
>> So I guess we can walk through the list and analyse them a bit.
>>
>> vxlan:
>>
>> Try stacking devices on top of the vxlan device this will call a uc_add
>> routine if you then change the mac addr on the vlan. This would get
>> reported by the dflt fdb dump handlers but not the drivers fdb dump
>> handlers. So removing the dflt dump handler from this patch at least
>> changes things. We should either explain why this is OK or accept that
>> the driver needs to be fixed. Or I guess that the patch is just wrong.
>> My guess is one of the latter options.
>>
>> Also Jamal, your original patch seems like it might of changed this
>> and Hubert's patch is reverting back to its original case. Was this
>> specific part of your patch intentional?
>>
>
> Yes.
> This is based on the view that unicast/multicast must be dumped
> *unconditionally*. If the view is that uni/mcast addresses are
> dumped conditionally based on what the driver thinks, then Hubert's
> one liner is good. But i really would like Vlad to comment. 80%
> of the effort on my part if you look at the thread was the refactoring
> of the code to meet the use case.

I'm interested to see what Vlad says as well. But the current situation
is previously some drivers dumped their addr lists others didn't.
Specifically, the more switch like devices (bridge, vxlan) didn't. Now
every device will dump the addr lists. I'm not entirely convinced that
is correct.

>
> I thought the abstraction which requires that your own MAC addresses
> are treated as fdb entries was broken - but it is too late to change
> that.
>

It works OK for host nics (NICS that can't forward between ports) and
seems at best confusing for real switch asics. On a related question do
you expect the switch asic to trap any packets with MAC addresses in
the multi/unicast address lists and send them to the correct netdev? Or
will the switch forward them using normal FDB tables?

Also I don't think its too late to fix it though. Maybe we had some
buggy drivers is all.

> cheers,
> jamal

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* RE: [E1000-devel] [PATCH] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode
From: Hiroshi Shimamoto @ 2014-12-16  0:49 UTC (permalink / raw)
  To: Alexander Duyck, e1000-devel@lists.sourceforge.net
  Cc: netdev@vger.kernel.org, Choi, Sy Jong, Hayato Momma,
	linux-kernel@vger.kernel.org
In-Reply-To: <54809D57.9060804@gmail.com>

> > Subject: Re: [E1000-devel] [PATCH] ixgbe, ixgbevf: Add new mbox API to enable MC promiscuous mode
> >
> > On 11/27/2014 02:39 AM, Hiroshi Shimamoto wrote:
> > > From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
> > >
> > > The limitation of the number of multicast address for VF is not enough
> > > for the large scale server with SR-IOV feature.
> > > IPv6 requires the multicast MAC address for each IP address to handle
> > > the Neighbor Solicitation message.
> > > We couldn't assign over 30 IPv6 addresses to a single VF interface.
> > >
> > > The easy way to solve this is enabling multicast promiscuous mode.
> > > It is good to have a functionality to enable multicast promiscuous mode
> > > for each VF from VF driver.
> > >
> > > This patch introduces the new mbox API, IXGBE_VF_SET_MC_PROMISC, to
> > > enable/disable multicast promiscuous mode in VF. If multicast promiscuous
> > > mode is enabled the VF can receive all multicast packets.
> > >
> > > With this patch, the ixgbevf driver automatically enable multicast
> > > promiscuous mode when the number of multicast addresses is over than 30
> > > if possible.
> > >
> > > This also bump the API version up to 1.2 to check whether the API,
> > > IXGBE_VF_SET_MC_PROMISC is available.
> > >
> > > Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
> > > CC: Choi, Sy Jong <sy.jong.choi@intel.com>
> > > Reviewed-by: Hayato Momma <h-momma@ce.jp.nec.com>
> >
> > This is a REALLY bad idea unless you plan to limit this to privileged VFs.
> >
> > I would recommend looking at adding an ndo operation to control this
> > feature so that it could be disabled by default in the PF and only
> > enabled on the host side if specifically requested.  Otherwise the
> 
> Do you mean that PF driver should have the flag to enable or disable per VF
> and disallow the request from VF?

Could you answer about that?

> 
> > problem is I can easily see this leading security issues as the VFs
> > might begin getting access to messages that they aren't supposed to.
> 
> OK, by the way, I think that the current ixgbe and ixgbevf implementation
> has already such issue. The guest can add hash entry to receive MAC and it
> can get every multicast MAC frame with the current mbox API.
> Does your concern come from the easiness of doing that?

There is the single MTA per PF, not per VF.
VF requests PF to register the hash of MC MAC, then PF set a bit in the MTA
and set the flag IXGBE_VMOLR_ROMPE of VF, which enables packets switching to
the VF if MC MAC hits the hash entry in the MTA.
If VM1 has VF1 which uses MC MAC1 and VM2 has VF2 which uses MC MAC2, both
of VM1 and VM2 will receive MC MAC1. VM2 doesn't know why it receives MAC1.
In other words, in the current implementation, a VF receives all multicast
packets which are registered from other VFs.
Because the above reason, I hadn't imagined that enabling MC promiscuous mode
increases receiving the MC messages that they aren't supposed to.
I think that this patch doesn't change that behavior.

thanks,
Hiroshi

^ permalink raw reply

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-16  0:58 UTC (permalink / raw)
  To: Jamal Hadi Salim, John Fastabend, netdev@vger.kernel.org
  Cc: Roopa Prabhu, Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org,
	tgraf@suug.ch, stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <548F6E62.1040500@mojatatu.com>

> -----Original Message-----
> From: Jamal Hadi Salim [mailto:jhs@mojatatu.com]
> Sent: Tuesday, December 16, 2014 1:28 AM
> To: Arad, Ronen; John Fastabend; netdev@vger.kernel.org
> Cc: Roopa Prabhu; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org;
> tgraf@suug.ch; stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> On 12/15/14 13:36, Arad, Ronen wrote:
> >
> >
> >> -----Original Message-----
> 
> > The behavior of a driver could depend on the presence of a bridge and
> features such as FDB LEARNING and LEARNING_SYNC.
> 
> Indeed, those are bridge attributes.
> 
> > A switch port driver which is not enslaved to a bridge might need to
> >implement VLAN-aware FDB within the driver and report its content to user-
> space using ndo_fdb_dump.
>  >
> > A switch port driver which is enslaved to a bridge could do with only
> > pass through for static FDB configuration
>  > to the HW when LEARNING_SYNC is configured. FDB reporting to user-
> space and soft aging are left to the bridge module FDB.
> > Such driver, without LEARNING_SYNC could still avoid maintaing in-driver
> FDB as long as it could dump the HW FDB on demand.
> > LEARNING_SYNC also requires periodic updates of freshness information
> from the driver to the bridge module.
> >
> 
> 
> If you have an fdb - shouldnt that be exposed only if you have a bridge
> abstraction exposed? i.e thats where the Linux tools would work.

I'm trying to find out what are the opinions of other people in the netdev list.
John have clearly stated that he'd like to see full L2 switching functionality (at least) supported without making a bridge device mandatory.
The existing bridge ndos (ndo_bridge_{set,del,get}link) already support that with proper setting of SELF/MASTER flags by iproute2.
I see the value in supporting both approaches (bridge device mandatory and bridge device optional). If the choice is left to user-driven policy decision, we need to document both use models and map traditional L2 features to each model. 
The L2 offloading (or NETFUNC as it is currently called), which is being discussed on a different patch-set, is only needed when a bridge device is used.
Without a bridge device, all configuration has to be targeted at the switch port driver directly using the SELF flag. FDB remains relevant and it is used to configure static MAC table entries and dump the HW MAC table.
When the HW device is a L2 switch or a multi-layer switch (L2-L3 or even higher), there is a gap between what the HW is doing and what is explicitly modeled in Linux. Without a bridge device, the HW is represented by a set of switch port devices and the bridging (both control and data planes) takes place only in the HW and switch port driver.
Each switch port driver has to implement its own FDB as there is no common shared code among drivers for different HW devices.
Using a bridge device could partially alleviate that, but it comes with a cost. There is a need to properly implement offloading of both configuration and data-path. The transmit and receive path in the bridge module should be somehow bypassed to avoid unnecessary overhead or duplicate packets coming from both software bridging and HW bridging.

> What i was refering to was a scenario where i have no interest in the fdb
> despite such a hardware capabilities. VLANs is a different issue;
>
VLAN is fundamental feature of L2 and L3 switching and Linux is unclear about it. Bridge device could model bridging of untagged packets which requires a bridge device for each VLAN and a vlan device on each port that is a member of the bridge's VLAN.
This different from the behavior and configuration of classic closed-source switches.
An alternative model is VLAN filtering where a bridge is VLAN-aware and switches tagged traffic. A bridge device represents multiple L2 domains with VLAN filtering policy that defines the switching rules within each domain. Forwarding (e.g. L3 routing) is expected across such L2 domains using L3 entities.
The modeling of L3 entities per L2 domain (e.g. per-VLAN) in the VLAN filtering model is yet unclear to me.

> >>> Will the decision about using a bridge device or avoiding it be left
> >>> to the end-user?
> >>
> >> Its a user policy decision. Again the offload bit gets us this in a
> >> reasonably configurable way IMO.
> >>
> >>> (This requires switch port drivers to be able to work and provide
> >>> similar functionality in both setups).
> >>
> >> Right, but if the drivers "care" who is calling their ndo ops
> >> something is seriously broken. For the driver it should not need to
> >> know anything about the callers so it doesn't matter to the driver if
> >> its a netlink call from user space or an internal call fro bridge.ko
> >
> > LEARNING_SYNC only makes sense when a switch port driver is enslaved to
> a bridge.
>  > Rocker switch driver indeed monitors upper change notifications and keep
> track of master bridge presence.
> > So bridge presence is not transparent.
> >
> 
> Agreed - the challenge so far is that people have been fascinated by "switch"
> point of view. I think we are learning and the class device will eventually
> become obvious as useful.
> 
> cheers,
> jamal

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox