* Re: [PATCH net-next-2.6 v4 00/14] l2tp: Introduce L2TPv3 support
From: Eric Dumazet @ 2010-04-04 7:54 UTC (permalink / raw)
To: David Miller; +Cc: jchapman, netdev
In-Reply-To: <20100403.142522.01040744.davem@davemloft.net>
Le samedi 03 avril 2010 à 14:25 -0700, David Miller a écrit :
> From: James Chapman <jchapman@katalix.com>
> Date: Fri, 02 Apr 2010 17:18:23 +0100
>
> > This patch series adds L2TPv3 support. It splits the existing pppol2tp
> > driver to separate its L2TP and PPP parts, then adds new L2TPv3
> > functionality. The patches implement a new socket family for L2TPv3 IP
> > encapsulation, expose virtual netdevices for each L2TPv3 ethernet
> > pseudowire and add a netlink interface.
>
> Ok I'm going to toss this into net-next-2.6, we have time to
> fixup any remaining issues people might discover.
>
[PATCH net-next-2.6] l2tp: unmanaged L2TPv3 tunnels fixes
Followup to commit 789a4a2c
(l2tp: Add support for static unmanaged L2TPv3 tunnels)
One missing init in l2tp_tunnel_sock_create() could access random kernel
memory, and a bit field should be unsigned.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/l2tp/l2tp_core.c | 2 +-
net/l2tp/l2tp_core.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 13ed85b..98dfcce 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1227,7 +1227,7 @@ static int l2tp_tunnel_sock_create(u32 tunnel_id, u32 peer_tunnel_id, struct l2t
int err = -EINVAL;
struct sockaddr_in udp_addr;
struct sockaddr_l2tpip ip_addr;
- struct socket *sock;
+ struct socket *sock = NULL;
switch (cfg->encap) {
case L2TP_ENCAPTYPE_UDP:
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 91b1b9c..f0f318e 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -152,7 +152,7 @@ struct l2tp_tunnel_cfg {
struct in_addr peer_ip;
u16 local_udp_port;
u16 peer_udp_port;
- int use_udp_checksums:1;
+ unsigned int use_udp_checksums:1;
};
struct l2tp_tunnel {
^ permalink raw reply related
* Re: [PATCH net-next-2.6 v4 00/14] l2tp: Introduce L2TPv3 support
From: David Miller @ 2010-04-04 8:02 UTC (permalink / raw)
To: eric.dumazet; +Cc: jchapman, netdev
In-Reply-To: <1270367668.1971.3.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 04 Apr 2010 09:54:28 +0200
> [PATCH net-next-2.6] l2tp: unmanaged L2TPv3 tunnels fixes
>
> Followup to commit 789a4a2c
> (l2tp: Add support for static unmanaged L2TPv3 tunnels)
>
> One missing init in l2tp_tunnel_sock_create() could access random kernel
> memory, and a bit field should be unsigned.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied, thanks Eric.
^ permalink raw reply
* Re: [PATCH net-next-2.6 v4 00/14] l2tp: Introduce L2TPv3 support
From: Eric Dumazet @ 2010-04-04 8:14 UTC (permalink / raw)
To: David Miller; +Cc: jchapman, netdev
In-Reply-To: <20100404.010259.161188935.davem@davemloft.net>
Le dimanche 04 avril 2010 à 01:02 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sun, 04 Apr 2010 09:54:28 +0200
>
> > [PATCH net-next-2.6] l2tp: unmanaged L2TPv3 tunnels fixes
> >
> > Followup to commit 789a4a2c
> > (l2tp: Add support for static unmanaged L2TPv3 tunnels)
> >
> > One missing init in l2tp_tunnel_sock_create() could access random kernel
> > memory, and a bit field should be unsigned.
> >
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> Applied, thanks Eric.
I am going to work on net/l2tp/l2tp_core.c, since RCU conversion is
wrong (but original code was wong too...)
Example :
There is no real protection in following code, since no refcount is
taken on session before releasing rcu_read_lock :
static struct l2tp_session *l2tp_session_find_2(struct net *net, u32 session_id)
{
struct l2tp_net *pn = l2tp_pernet(net);
struct hlist_head *session_list =
l2tp_session_id_hash_2(pn, session_id);
struct l2tp_session *session;
struct hlist_node *walk;
rcu_read_lock_bh();
hlist_for_each_entry_rcu(session, walk, session_list, global_hlist) {
if (session->session_id == session_id) {
rcu_read_unlock_bh();
return session;
}
}
rcu_read_unlock_bh();
return NULL;
}
^ permalink raw reply
* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Vladislav Zolotarov @ 2010-04-04 8:19 UTC (permalink / raw)
To: FUJITA Tomonori, netdev@vger.kernel.org; +Cc: Eilon Greenstein
In-Reply-To: <20100402115054A.fujita.tomonori@lab.ntt.co.jp>
Why is it preferable? As far as I can see the current patch is not going to introduce any functional change.
Is there a plan to remove pci_map_X()/pci_alloc_consistent() functions family in the future and completely replaced with dma_X() functions? Is it appropriate to use dma_X() in all the places where pci_X() is used? For instance, we do use DAC mode and as far as I understand we should use pci_X() interface in this case. Is this rule not relevant anymore?
So, if we don't need to use pci_X() interface anymore, lets replace pci_X() properly all over the bnx2x with dma_X() functions. And if not, this patch mixes the macros from one API (dma_X) and functions from another (pci_X()) which may hardly be called "preferable"...
Thanks,
vlad
> -----Original Message-----
> From: netdev-owner@vger.kernel.org
> [mailto:netdev-owner@vger.kernel.org] On Behalf Of FUJITA Tomonori
> Sent: Friday, April 02, 2010 5:57 AM
> To: netdev@vger.kernel.org
> Cc: Eilon Greenstein
> Subject: [PATCH] bnx2x: use the dma state API instead of the
> pci equivalents
>
> The DMA API is preferred.
>
> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> ---
> drivers/net/bnx2x.h | 4 ++--
> drivers/net/bnx2x_main.c | 28 ++++++++++++++--------------
> 2 files changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/net/bnx2x.h b/drivers/net/bnx2x.h
> index 3c48a7a..ae9c89e 100644
> --- a/drivers/net/bnx2x.h
> +++ b/drivers/net/bnx2x.h
> @@ -163,7 +163,7 @@ do {
> \
>
> struct sw_rx_bd {
> struct sk_buff *skb;
> - DECLARE_PCI_UNMAP_ADDR(mapping)
> + DEFINE_DMA_UNMAP_ADDR(mapping);
> };
>
> struct sw_tx_bd {
> @@ -176,7 +176,7 @@ struct sw_tx_bd {
>
> struct sw_rx_page {
> struct page *page;
> - DECLARE_PCI_UNMAP_ADDR(mapping)
> + DEFINE_DMA_UNMAP_ADDR(mapping);
> };
>
> union db_prod {
> diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
> index 6c042a7..2a77611 100644
> --- a/drivers/net/bnx2x_main.c
> +++ b/drivers/net/bnx2x_main.c
> @@ -1086,7 +1086,7 @@ static inline void
> bnx2x_free_rx_sge(struct bnx2x *bp,
> if (!page)
> return;
>
> - pci_unmap_page(bp->pdev, pci_unmap_addr(sw_buf, mapping),
> + pci_unmap_page(bp->pdev, dma_unmap_addr(sw_buf, mapping),
> SGE_PAGE_SIZE*PAGES_PER_SGE, PCI_DMA_FROMDEVICE);
> __free_pages(page, PAGES_PER_SGE_SHIFT);
>
> @@ -1123,7 +1123,7 @@ static inline int
> bnx2x_alloc_rx_sge(struct bnx2x *bp,
> }
>
> sw_buf->page = page;
> - pci_unmap_addr_set(sw_buf, mapping, mapping);
> + dma_unmap_addr_set(sw_buf, mapping, mapping);
>
> sge->addr_hi = cpu_to_le32(U64_HI(mapping));
> sge->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1151,7 +1151,7 @@ static inline int
> bnx2x_alloc_rx_skb(struct bnx2x *bp,
> }
>
> rx_buf->skb = skb;
> - pci_unmap_addr_set(rx_buf, mapping, mapping);
> + dma_unmap_addr_set(rx_buf, mapping, mapping);
>
> rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
> rx_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1174,12 +1174,12 @@ static void bnx2x_reuse_rx_skb(struct
> bnx2x_fastpath *fp,
> struct eth_rx_bd *prod_bd = &fp->rx_desc_ring[prod];
>
> pci_dma_sync_single_for_device(bp->pdev,
> -
> pci_unmap_addr(cons_rx_buf, mapping),
> +
> dma_unmap_addr(cons_rx_buf, mapping),
> RX_COPY_THRESH,
> PCI_DMA_FROMDEVICE);
>
> prod_rx_buf->skb = cons_rx_buf->skb;
> - pci_unmap_addr_set(prod_rx_buf, mapping,
> - pci_unmap_addr(cons_rx_buf, mapping));
> + dma_unmap_addr_set(prod_rx_buf, mapping,
> + dma_unmap_addr(cons_rx_buf, mapping));
> *prod_bd = *cons_bd;
> }
>
> @@ -1285,7 +1285,7 @@ static void bnx2x_tpa_start(struct
> bnx2x_fastpath *fp, u16 queue,
> prod_rx_buf->skb = fp->tpa_pool[queue].skb;
> mapping = pci_map_single(bp->pdev,
> fp->tpa_pool[queue].skb->data,
> bp->rx_buf_size, PCI_DMA_FROMDEVICE);
> - pci_unmap_addr_set(prod_rx_buf, mapping, mapping);
> + dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
>
> /* move partial skb from cons to pool (don't unmap yet) */
> fp->tpa_pool[queue] = *cons_rx_buf;
> @@ -1361,7 +1361,7 @@ static int bnx2x_fill_frag_skb(struct
> bnx2x *bp, struct bnx2x_fastpath *fp,
> }
>
> /* Unmap the page as we r going to pass it to
> the stack */
> - pci_unmap_page(bp->pdev,
> pci_unmap_addr(&old_rx_pg, mapping),
> + pci_unmap_page(bp->pdev,
> dma_unmap_addr(&old_rx_pg, mapping),
> SGE_PAGE_SIZE*PAGES_PER_SGE,
> PCI_DMA_FROMDEVICE);
>
> /* Add one frag and update the appropriate
> fields in the skb */
> @@ -1389,7 +1389,7 @@ static void bnx2x_tpa_stop(struct bnx2x
> *bp, struct bnx2x_fastpath *fp,
> /* Unmap skb in the pool anyway, as we are going to change
> pool entry status to BNX2X_TPA_STOP even if new skb
> allocation
> fails. */
> - pci_unmap_single(bp->pdev, pci_unmap_addr(rx_buf, mapping),
> + pci_unmap_single(bp->pdev, dma_unmap_addr(rx_buf, mapping),
> bp->rx_buf_size, PCI_DMA_FROMDEVICE);
>
> if (likely(new_skb)) {
> @@ -1621,7 +1621,7 @@ static int bnx2x_rx_int(struct
> bnx2x_fastpath *fp, int budget)
> }
>
> pci_dma_sync_single_for_device(bp->pdev,
> - pci_unmap_addr(rx_buf, mapping),
> + dma_unmap_addr(rx_buf, mapping),
> pad +
> RX_COPY_THRESH,
>
> PCI_DMA_FROMDEVICE);
> prefetch(skb);
> @@ -1666,7 +1666,7 @@ static int bnx2x_rx_int(struct
> bnx2x_fastpath *fp, int budget)
> } else
> if (likely(bnx2x_alloc_rx_skb(bp, fp,
> bd_prod) == 0)) {
> pci_unmap_single(bp->pdev,
> - pci_unmap_addr(rx_buf, mapping),
> + dma_unmap_addr(rx_buf, mapping),
> bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
> skb_reserve(skb, pad);
> @@ -4941,7 +4941,7 @@ static inline void
> bnx2x_free_tpa_pool(struct bnx2x *bp,
>
> if (fp->tpa_state[i] == BNX2X_TPA_START)
> pci_unmap_single(bp->pdev,
> - pci_unmap_addr(rx_buf,
> mapping),
> + dma_unmap_addr(rx_buf,
> mapping),
> bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
>
> dev_kfree_skb(skb);
> @@ -4978,7 +4978,7 @@ static void bnx2x_init_rx_rings(struct
> bnx2x *bp)
> fp->disable_tpa = 1;
> break;
> }
> - pci_unmap_addr_set((struct sw_rx_bd *)
> + dma_unmap_addr_set((struct sw_rx_bd *)
>
> &bp->fp->tpa_pool[i],
> mapping, 0);
> fp->tpa_state[i] = BNX2X_TPA_STOP;
> @@ -6907,7 +6907,7 @@ static void bnx2x_free_rx_skbs(struct bnx2x *bp)
> continue;
>
> pci_unmap_single(bp->pdev,
> - pci_unmap_addr(rx_buf,
> mapping),
> + dma_unmap_addr(rx_buf,
> mapping),
> bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
>
> rx_buf->skb = NULL;
> --
> 1.7.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04 8:35 UTC (permalink / raw)
To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB8319B.3040209@iki.fi>
On Sun, Apr 04, 2010 at 09:28:43AM +0300, Timo Teräs wrote:
>
> No. The flow cache flush removal does not prevent bundle deletion.
> The flow cache flush is in current code *after* deleting the bundles
> from the policy. Freeing bundles and flushing cache are completely
> two separate things in current code. Only in 2/4 the bundle deletion
> becomes dependent on flow cache flush.
Ah yes I confused myself. The problem I thought would occur
doesn't because 1/4 doesn't put the bundles in the flow cache
yet.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: David Miller @ 2010-04-04 8:39 UTC (permalink / raw)
To: vladz; +Cc: fujita.tomonori, netdev, eilong
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDC525ADDB@SJEXCHCCR02.corp.ad.broadcom.com>
From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Sun, 4 Apr 2010 01:19:07 -0700
> So, if we don't need to use pci_X() interface anymore, lets replace
> pci_X() properly all over the bnx2x with dma_X() functions.
I think Fujita's plan of gradual and partial transformations is
legitimate, and his changes shouldn't be rejected because he simply
isn't modifying all of the interfaces used by this driver but rather
just a specific subset he is trying to transform across the tree.
Please rescind your objections.
Thanks.
^ permalink raw reply
* Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net
From: Michael S. Tsirkin @ 2010-04-04 8:55 UTC (permalink / raw)
To: David Stevens; +Cc: kvm, kvm-owner, netdev, rusty, virtualization
In-Reply-To: <OF9AF90021.62EF0EE4-ON882576F8.00618B3E-882576F8.0064F223@us.ibm.com>
On Thu, Apr 01, 2010 at 11:22:37AM -0700, David Stevens wrote:
> kvm-owner@vger.kernel.org wrote on 04/01/2010 03:54:15 AM:
>
> > On Wed, Mar 31, 2010 at 03:04:43PM -0700, David Stevens wrote:
>
> > >
> > > > > + head.iov_base = (void
> *)vhost_get_vq_desc(&net->dev,
> > > vq,
> > > > > + vq->iov, ARRAY_SIZE(vq->iov), &out, &in,
> NULL,
> > >
> > > > > NULL);
> > > >
> > > > I this casting confusing.
> > > > Is it really expensive to add an array of heads so that
> > > > we do not need to cast?
> > >
> > > It needs the heads and the lengths, which looks a lot
> > > like an iovec. I was trying to resist adding a new
> > > struct XXX { unsigned head; unsigned len; } just for this,
> > > but I could make these parallel arrays, one with head index and
> > > the other with length.
>
> Michael, on this one, if I add vq->heads as an argument to
> vhost_get_heads (aka vhost_get_desc_n), I'd need the length too.
> Would you rather this 1) remain an iovec (and a single arg added) but
> cast still there, 2) 2 arrays (head and length) and 2 args added, or
> 3) a new struct type of {unsigned,int} to carry for the heads+len
> instead of iovec?
> My preference would be 1). I agree the casts are ugly, but
> it is essentially an iovec the way we use it; it's just that the
> base isn't a pointer but a descriptor index instead.
I prefer 2 or 3. If you prefer 1 strongly, I think we should
add a detailed comment near the iovec, and
a couple of inline wrappers to store/get data in the iovec.
> > >
> > > EAGAIN is not possible after the change, because we don't
> > > even enter the loop unless we have an skb on the read queue; the
> > > other cases bomb out, so I figured the comment for future work is
> > > now done. :-)
> >
> > Guest could be buggy so we'll get EFAULT.
> > If skb is taken off the rx queue (as below), we might get EAGAIN.
>
> We break on any error. If we get EAGAIN because someone read
> on the socket, this code would break the loop, but EAGAIN is a more
> serious problem if it changed since we peeked (because it means
> someone else is reading the socket).
> But I don't understand -- are you suggesting that the error
> handling be different than that, or that the comment is still
> relevant?
> My intention here is to do the "TODO" from the comment
> so that it can be removed, by handling all error cases. I think
> because of the peek, EAGAIN isn't something to be ignored anymore,
> but the effect is the same whether we break out of the loop or
> not, since we retry the packet next time around. Essentially, we
> ignore every error since we will redo it with the same packet the
> next time around. Maybe we should print something here, but since
> we'll be retrying the packet that's still on the socket, a permanent
> error would spew continuously. Maybe we should shut down entirely
> if we get any negative return value here (including EAGAIN, since
> that tells us someone messed with the socket when we don't want them
> to).
> If you want the comment still there, ok, but I do think EAGAIN
> isn't a special case per the comment anymore, and is handled as all
> other errors are: by exiting the loop and retrying next time.
>
> +-DLS
Yes, I just think some comment should stay, as you say, because
otherwise we simply retry continuously. Maybe we should trigger vq_err.
It needs to be given some thought which I have not given it yet.
Thinking aloud, EAGAIN means someone reads the socket
together with us, I prefer that this condition is made a fatal
error, we should make sure we are polling the socket
so we see packets if more appear.
--
MST
^ permalink raw reply
* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Vladislav Zolotarov @ 2010-04-04 9:15 UTC (permalink / raw)
To: David Miller
Cc: fujita.tomonori@lab.ntt.co.jp, netdev@vger.kernel.org,
Eilon Greenstein
In-Reply-To: <20100404.013920.120463402.davem@davemloft.net>
According to the changes in a PCI-DMA-mapping.txt it sounds like the trend is to stop using the pci_dma_* API and start using the dma_* API instead. Does this mean that using the pci_dma_* API is deprecated?
Thanks,
vlad
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Sunday, April 04, 2010 11:39 AM
> To: Vladislav Zolotarov
> Cc: fujita.tomonori@lab.ntt.co.jp; netdev@vger.kernel.org;
> Eilon Greenstein
> Subject: Re: [PATCH] bnx2x: use the dma state API instead of
> the pci equivalents
>
> From: "Vladislav Zolotarov" <vladz@broadcom.com>
> Date: Sun, 4 Apr 2010 01:19:07 -0700
>
> > So, if we don't need to use pci_X() interface anymore, lets replace
> > pci_X() properly all over the bnx2x with dma_X() functions.
>
> I think Fujita's plan of gradual and partial transformations is
> legitimate, and his changes shouldn't be rejected because he simply
> isn't modifying all of the interfaces used by this driver but rather
> just a specific subset he is trying to transform across the tree.
>
> Please rescind your objections.
>
> Thanks.
>
>
^ permalink raw reply
* Re: [BUG] latest net-next-2.6 doesnt fly
From: FUJITA Tomonori @ 2010-04-04 9:16 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, davem, fujita.tomonori
In-Reply-To: <1270202304.1989.14.camel@edumazet-laptop>
On Fri, 02 Apr 2010 11:58:24 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e19cdae..c6b5206 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1801,7 +1801,7 @@ EXPORT_SYMBOL(netdev_rx_csum_fault);
> * 2. No high memory really exists on this machine.
> */
>
> -static inline int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
> +static int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
> {
> #ifdef CONFIG_HIGHMEM
> int i;
> @@ -1814,6 +1814,8 @@ static inline int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
> if (PCI_DMA_BUS_IS_PHYS) {
> struct device *pdev = dev->dev.parent;
>
> + if (!pdev)
> + return 0;
Sorry about that and thanks for the fix.
I think, if pdev is null, returning 1 here is safer since the device
doesn't set up dma info properly.
Do you know what device hits this bug? You said that you use bnx2 and
tg3. Both call SET_NETDEV_DEV with pdev->dev. I tested bnx2 and seems
that netdev->dev.parent is set up correctly.
^ permalink raw reply
* Re: [BUG] latest net-next-2.6 doesnt fly
From: Eric Dumazet @ 2010-04-04 9:29 UTC (permalink / raw)
To: FUJITA Tomonori; +Cc: netdev, davem
In-Reply-To: <20100404181542N.fujita.tomonori@lab.ntt.co.jp>
Le dimanche 04 avril 2010 à 18:16 +0900, FUJITA Tomonori a écrit :
> > + return 0;
>
> Sorry about that and thanks for the fix.
>
> I think, if pdev is null, returning 1 here is safer since the device
> doesn't set up dma info properly.
>
> Do you know what device hits this bug? You said that you use bnx2 and
> tg3. Both call SET_NETDEV_DEV with pdev->dev. I tested bnx2 and seems
> that netdev->dev.parent is set up correctly.
> --
Might be because of my setup, I suspect I had two reasons to hit the
bug :
A bonding of eth2 (bnx2) and eth3 (tg3)
Then vlans on top of this bond0
When first dev_queue_xmit() was called, it was for a virtual device :)
# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
qlen 1000
link/ether 00:1e:0b:ec:d3:dc brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq
master bond0 state UP qlen 1000
link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc hfsc
master bond0 state UP qlen 1000
link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
link/ether 00:1e:0b:92:78:51 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP
link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
7: vlan.103@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500
qdisc pfifo_fast state UP qlen 100
link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
8: vlan.825@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500
qdisc pfifo_fast state UP qlen 1000
link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
^ permalink raw reply
* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-04 10:03 UTC (permalink / raw)
To: vladz; +Cc: davem, fujita.tomonori, netdev, eilong
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDC525ADDC@SJEXCHCCR02.corp.ad.broadcom.com>
On Sun, 4 Apr 2010 02:15:52 -0700
"Vladislav Zolotarov" <vladz@broadcom.com> wrote:
> According to the changes in a PCI-DMA-mapping.txt it sounds like the
> trend is to stop using the pci_dma_* API and start using the dma_*
> API instead. Does this mean that using the pci_dma_* API is
> deprecated?
Sorry that I didn't put the enough information in the patch
description.
In the long term, I want to remove the pci_dma_* API.
We had the various bus-specific DMA API (pci, sbus, etc). It was the
headache for driver writers that handle multiple bus devices. So we
invented the generic DMA API long ago. Now we have only two bus
specific APIs: pci and ssb so I want to remove them and make sure that
driver writers are always able to use the generic DMA API with any
bus.
http://lwn.net/Articles/374137/
I don't plan to convert the whole tree to use the DMA API over the PCI
DMA API all together. I convert the drivers gradually. I already
removed the PCI DMA API from the docs under Documentation/. It would
be nice if some driver maintainers (or others) convert their drivers.
The patchset convert only the pci state API (such as
DECLARE_PCI_UNMAP_ADDR) because akpm complained of the API and I
promised him to clean up it:
http://lkml.org/lkml/2010/2/12/415
I'll send a patch if you like me to convert bnx2x to use the the DMA
API completely.
^ permalink raw reply
* Re: [BUG] latest net-next-2.6 doesnt fly
From: FUJITA Tomonori @ 2010-04-04 10:19 UTC (permalink / raw)
To: eric.dumazet; +Cc: fujita.tomonori, netdev, davem
In-Reply-To: <1270373395.1971.13.camel@edumazet-laptop>
On Sun, 04 Apr 2010 11:29:55 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > I think, if pdev is null, returning 1 here is safer since the device
> > doesn't set up dma info properly.
> >
> > Do you know what device hits this bug? You said that you use bnx2 and
> > tg3. Both call SET_NETDEV_DEV with pdev->dev. I tested bnx2 and seems
> > that netdev->dev.parent is set up correctly.
> > --
>
> Might be because of my setup, I suspect I had two reasons to hit the
> bug :
>
> A bonding of eth2 (bnx2) and eth3 (tg3)
>
> Then vlans on top of this bond0
>
> When first dev_queue_xmit() was called, it was for a virtual device :)
Thanks! So it's due to bond or vlan (or both).
I guess that returning zero here with a null pdev is fine. If we
return 1, probably some people would complain about performance
regression. Like the block layer does, coping the dma restriction info
from the lower devices can solve this problem but I guess that it's
over engineering. My original patch doesn't loosen the DMA restriction
checking so returning zero shouldn't break anything. If when we fix
the usage of NETIF_F_HIGHDMA in each driver and also check the usage
of netdev->dev.parent, everything should be fine.
^ permalink raw reply
* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Vladislav Zolotarov @ 2010-04-04 10:24 UTC (permalink / raw)
To: FUJITA Tomonori
Cc: davem@davemloft.net, netdev@vger.kernel.org, Eilon Greenstein
In-Reply-To: <20100404190116N.fujita.tomonori@lab.ntt.co.jp>
Ok. Got it now. Thanks, Fujita. I think we should patch the bnx2x to use the generic model (not just the mapping macros).
One last question: since which kernel version the generic DMA layer may be used instead of PCI DMA layer?
Thanks,
vlad
> -----Original Message-----
> From: netdev-owner@vger.kernel.org
> [mailto:netdev-owner@vger.kernel.org] On Behalf Of FUJITA Tomonori
> Sent: Sunday, April 04, 2010 1:03 PM
> To: Vladislav Zolotarov
> Cc: davem@davemloft.net; fujita.tomonori@lab.ntt.co.jp;
> netdev@vger.kernel.org; Eilon Greenstein
> Subject: RE: [PATCH] bnx2x: use the dma state API instead of
> the pci equivalents
>
> On Sun, 4 Apr 2010 02:15:52 -0700
> "Vladislav Zolotarov" <vladz@broadcom.com> wrote:
>
> > According to the changes in a PCI-DMA-mapping.txt it sounds like the
> > trend is to stop using the pci_dma_* API and start using the dma_*
> > API instead. Does this mean that using the pci_dma_* API is
> > deprecated?
>
> Sorry that I didn't put the enough information in the patch
> description.
>
> In the long term, I want to remove the pci_dma_* API.
>
> We had the various bus-specific DMA API (pci, sbus, etc). It was the
> headache for driver writers that handle multiple bus devices. So we
> invented the generic DMA API long ago. Now we have only two bus
> specific APIs: pci and ssb so I want to remove them and make sure that
> driver writers are always able to use the generic DMA API with any
> bus.
>
> http://lwn.net/Articles/374137/
>
>
> I don't plan to convert the whole tree to use the DMA API over the PCI
> DMA API all together. I convert the drivers gradually. I already
> removed the PCI DMA API from the docs under Documentation/. It would
> be nice if some driver maintainers (or others) convert their drivers.
>
> The patchset convert only the pci state API (such as
> DECLARE_PCI_UNMAP_ADDR) because akpm complained of the API and I
> promised him to clean up it:
>
> http://lkml.org/lkml/2010/2/12/415
>
>
> I'll send a patch if you like me to convert bnx2x to use the the DMA
> API completely.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply
* Re: UDP path MTU discovery
From: Andi Kleen @ 2010-04-04 10:25 UTC (permalink / raw)
To: Glen Turner; +Cc: Andi Kleen, Rick Jones, netdev
In-Reply-To: <1270186918.2119.27.camel@ilion>
On Fri, Apr 02, 2010 at 04:11:58PM +1030, Glen Turner wrote:
> On Thu, 2010-04-01 at 02:55 +0200, Andi Kleen wrote:
> > > What we need is an API for an instant notification that a ICMP Packet
> > > Too Big message has arrived concerning the socket.
> >
> > That already exists of course: IP_RECVERR
>
> Hi Andi,
>
> So what should I code? The suggested EMSGSIZE or your suggestion
> of grabbing all returning ICMP and parsing it? Noting that the
You don't need to parse any ICMPs, the kernel does that for you.
See the documentation of IP_RECVERR in ip(7). The MTU is in ee_info
First you need to enable path mtu discovery for the socket
using IP_MTU_DISCOVER.
So you can either keep track of the MTU yourself based on extended
errors coming out of IP_RECVERR, or ask the kernel using IP_MTU when
the socket is connected or simply lower when you see a EMSGSIZE. It's also
possible to do this with a dummy socket that gets connected/unconnected too.
> second choice is pretty ugly. That both seem specific to Linux is
> frustrating, but that is life -- adding support for an operating
> system seems to inevitably add #ifdefs for this sort of code.
Well when the other OS see the need they will hopefully add similar
interfaces, with some luck even compatible to the ones in Linux.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04 10:42 UTC (permalink / raw)
To: Timo Teras; +Cc: netdev
In-Reply-To: <1270126340-30181-2-git-send-email-timo.teras@iki.fi>
On Thu, Apr 01, 2010 at 03:52:17PM +0300, Timo Teras wrote:
>
> -extern void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family,
> - u8 dir, flow_resolve_t resolver);
> +struct flow_cache_entry_ops {
> + struct flow_cache_entry_ops ** (*get)(struct flow_cache_entry_ops **);
> + int (*check)(struct flow_cache_entry_ops **);
> + void (*delete)(struct flow_cache_entry_ops **);
> +};
> +
> +typedef struct flow_cache_entry_ops **(*flow_resolve_t)(
> + struct net *net, struct flowi *key, u16 family,
> + u8 dir, struct flow_cache_entry_ops **old_ops, void *ctx);
OK this bit really bugs me.
When I first looked at it, my reaction was why on earth are we
returning an ops pointer? Only after some digging around do I see
the fact that this ops pointer is in fact embedded in xfrm_policy.
How about embedding flow_cache_entry in xfrm_policy instead? Returning
flow_cache_entry * would make a lot more sense than a nested pointer
to flow_cache_entry_ops.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-04 10:50 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100404104230.GA10368@gondor.apana.org.au>
Herbert Xu wrote:
> On Thu, Apr 01, 2010 at 03:52:17PM +0300, Timo Teras wrote:
>> -extern void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family,
>> - u8 dir, flow_resolve_t resolver);
>> +struct flow_cache_entry_ops {
>> + struct flow_cache_entry_ops ** (*get)(struct flow_cache_entry_ops **);
>> + int (*check)(struct flow_cache_entry_ops **);
>> + void (*delete)(struct flow_cache_entry_ops **);
>> +};
>> +
>> +typedef struct flow_cache_entry_ops **(*flow_resolve_t)(
>> + struct net *net, struct flowi *key, u16 family,
>> + u8 dir, struct flow_cache_entry_ops **old_ops, void *ctx);
>
> OK this bit really bugs me.
>
> When I first looked at it, my reaction was why on earth are we
> returning an ops pointer? Only after some digging around do I see
> the fact that this ops pointer is in fact embedded in xfrm_policy.
>
> How about embedding flow_cache_entry in xfrm_policy instead? Returning
> flow_cache_entry * would make a lot more sense than a nested pointer
> to flow_cache_entry_ops.
Because flow_cache_entry is per-cpu, and multiple entries (due to
different flows matching same policies, or same flow having multiple
per-cpu entries) can point to same policy. If we cached "dummy" objects
for even policies, then this would be better approach.
This would make actually sense, since it'd be useful to cache all
policies involved in check path (main + sub policy refs). In which
case we might want to make the ops 'per flow cache instance' instead
of 'per cache entry'.
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04 11:00 UTC (permalink / raw)
To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB86EE8.5090203@iki.fi>
On Sun, Apr 04, 2010 at 01:50:16PM +0300, Timo Teräs wrote:
>
> Because flow_cache_entry is per-cpu, and multiple entries (due to
> different flows matching same policies, or same flow having multiple
> per-cpu entries) can point to same policy. If we cached "dummy" objects
> for even policies, then this would be better approach.
Oh yes of course.
But what we could do is embed most of flow_cache_entry into
xfrm_policy (and xdst in your latter patches) along with the
ops pointer.
Like this:
struct flow_cache_object {
u16 family;
u8 dir;
u32 genid;
struct flowi key;
struct flow_cache_ops **ops;
};
struct flow_cache_entry {
struct flow_cache_entry *next;
struct flow_cache_object *obj;
};
struct xfrm_policy {
struct flow_cache_object flo;
...
};
What do you think?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-04 11:06 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100404110014.GA10864@gondor.apana.org.au>
Herbert Xu wrote:
> On Sun, Apr 04, 2010 at 01:50:16PM +0300, Timo Teräs wrote:
>> Because flow_cache_entry is per-cpu, and multiple entries (due to
>> different flows matching same policies, or same flow having multiple
>> per-cpu entries) can point to same policy. If we cached "dummy" objects
>> for even policies, then this would be better approach.
>
> Oh yes of course.
>
> But what we could do is embed most of flow_cache_entry into
> xfrm_policy (and xdst in your latter patches) along with the
> ops pointer.
>
> Like this:
>
> struct flow_cache_object {
> u16 family;
> u8 dir;
> u32 genid;
> struct flowi key;
> struct flow_cache_ops **ops;
> };
>
> struct flow_cache_entry {
> struct flow_cache_entry *next;
> struct flow_cache_object *obj;
> };
>
> struct xfrm_policy {
> struct flow_cache_object flo;
> ...
> };
>
> What do you think?
It would still not work for policies. For every policy X we
can get N+1 different matches with separate struct flowi contents.
It's not possible to put single struct flowi or any other of
the flow details in to xfrm_policy. It's a N-to-1 mapping. Not
a 1-to-1 mapping.
^ permalink raw reply
* Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.
From: Michael S. Tsirkin @ 2010-04-04 11:14 UTC (permalink / raw)
To: Sridhar Samudrala; +Cc: Tom Lendacky, netdev, kvm@vger.kernel.org
In-Reply-To: <1270229480.13897.8.camel@w-sridhar.beaverton.ibm.com>
On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote:
> Make vhost scalable by creating a separate vhost thread per vhost
> device. This provides better scaling across multiple guests and with
> multiple interfaces in a guest.
Thanks for looking into this. An alternative approach is
to simply replace create_singlethread_workqueue with
create_workqueue which would get us a thread per host CPU.
It seems that in theory this should be the optimal approach
wrt CPU locality, however, in practice a single thread
seems to get better numbers. I have a TODO to investigate this.
Could you try looking into this?
>
> I am seeing better aggregated througput/latency when running netperf
> across multiple guests or multiple interfaces in a guest in parallel
> with this patch.
Any numbers? What happens to CPU utilization?
> Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index a6a88df..29aa80f 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -339,8 +339,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> return r;
> }
>
> - vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> - vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> + vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
> + &n->dev);
> + vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
> + &n->dev);
> n->tx_poll_state = VHOST_NET_POLL_DISABLED;
>
> f->private_data = n;
> @@ -643,25 +645,14 @@ static struct miscdevice vhost_net_misc = {
>
> int vhost_net_init(void)
> {
> - int r = vhost_init();
> - if (r)
> - goto err_init;
> - r = misc_register(&vhost_net_misc);
> - if (r)
> - goto err_reg;
> - return 0;
> -err_reg:
> - vhost_cleanup();
> -err_init:
> - return r;
> -
> + return misc_register(&vhost_net_misc);
> }
> +
> module_init(vhost_net_init);
>
> void vhost_net_exit(void)
> {
> misc_deregister(&vhost_net_misc);
> - vhost_cleanup();
> }
> module_exit(vhost_net_exit);
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 7bd7a1e..243f4d3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -36,8 +36,6 @@ enum {
> VHOST_MEMORY_F_LOG = 0x1,
> };
>
> -static struct workqueue_struct *vhost_workqueue;
> -
> static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
> poll_table *pt)
> {
> @@ -56,18 +54,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
> if (!((unsigned long)key & poll->mask))
> return 0;
>
> - queue_work(vhost_workqueue, &poll->work);
> + queue_work(poll->dev->wq, &poll->work);
> return 0;
> }
>
> /* Init poll structure */
> void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
> - unsigned long mask)
> + unsigned long mask, struct vhost_dev *dev)
> {
> INIT_WORK(&poll->work, func);
> init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
> init_poll_funcptr(&poll->table, vhost_poll_func);
> poll->mask = mask;
> + poll->dev = dev;
> }
>
> /* Start polling a file. We add ourselves to file's wait queue. The caller must
> @@ -96,7 +95,7 @@ void vhost_poll_flush(struct vhost_poll *poll)
>
> void vhost_poll_queue(struct vhost_poll *poll)
> {
> - queue_work(vhost_workqueue, &poll->work);
> + queue_work(poll->dev->wq, &poll->work);
> }
>
> static void vhost_vq_reset(struct vhost_dev *dev,
> @@ -128,6 +127,11 @@ long vhost_dev_init(struct vhost_dev *dev,
> struct vhost_virtqueue *vqs, int nvqs)
> {
> int i;
> +
> + dev->wq = create_singlethread_workqueue("vhost");
> + if (!dev->wq)
> + return -ENOMEM;
> +
> dev->vqs = vqs;
> dev->nvqs = nvqs;
> mutex_init(&dev->mutex);
> @@ -143,7 +147,7 @@ long vhost_dev_init(struct vhost_dev *dev,
> if (dev->vqs[i].handle_kick)
> vhost_poll_init(&dev->vqs[i].poll,
> dev->vqs[i].handle_kick,
> - POLLIN);
> + POLLIN, dev);
> }
> return 0;
> }
> @@ -216,6 +220,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> if (dev->mm)
> mmput(dev->mm);
> dev->mm = NULL;
> +
> + destroy_workqueue(dev->wq);
> }
>
> static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
> @@ -1095,16 +1101,3 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
> vq_err(vq, "Failed to enable notification at %p: %d\n",
> &vq->used->flags, r);
> }
> -
> -int vhost_init(void)
> -{
> - vhost_workqueue = create_singlethread_workqueue("vhost");
> - if (!vhost_workqueue)
> - return -ENOMEM;
> - return 0;
> -}
> -
> -void vhost_cleanup(void)
> -{
> - destroy_workqueue(vhost_workqueue);
> -}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 44591ba..60fefd0 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -29,10 +29,11 @@ struct vhost_poll {
> /* struct which will handle all actual work. */
> struct work_struct work;
> unsigned long mask;
> + struct vhost_dev *dev;
> };
>
> void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
> - unsigned long mask);
> + unsigned long mask, struct vhost_dev *dev);
> void vhost_poll_start(struct vhost_poll *poll, struct file *file);
> void vhost_poll_stop(struct vhost_poll *poll);
> void vhost_poll_flush(struct vhost_poll *poll);
> @@ -110,6 +111,7 @@ struct vhost_dev {
> int nvqs;
> struct file *log_file;
> struct eventfd_ctx *log_ctx;
> + struct workqueue_struct *wq;
> };
>
> long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
> @@ -136,9 +138,6 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
> int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> unsigned int log_num, u64 len);
>
> -int vhost_init(void);
> -void vhost_cleanup(void);
> -
> #define vq_err(vq, fmt, ...) do { \
> pr_debug(pr_fmt(fmt), ##__VA_ARGS__); \
> if ((vq)->error_ctx) \
>
>
>
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04 11:26 UTC (permalink / raw)
To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB872CF.2030202@iki.fi>
On Sun, Apr 04, 2010 at 02:06:55PM +0300, Timo Teräs wrote:
>
> It would still not work for policies. For every policy X we
> can get N+1 different matches with separate struct flowi contents.
> It's not possible to put single struct flowi or any other of
> the flow details in to xfrm_policy. It's a N-to-1 mapping. Not
> a 1-to-1 mapping.
Fine, move key into flow_cache_entry but the rest should still
work, no?
struct flow_cache_object {
u16 family;
u8 dir;
u32 genid;
struct flow_cache_ops *ops;
};
struct flow_cache_entry {
struct flow_cache_entry *next;
struct flowi key;
struct flow_cache_object *obj;
};
struct xfrm_policy {
struct flow_cache_object flo;
...
};
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04 11:31 UTC (permalink / raw)
To: Timo Teräs; +Cc: netdev
In-Reply-To: <20100404112636.GA11061@gondor.apana.org.au>
On Sun, Apr 04, 2010 at 07:26:36PM +0800, Herbert Xu wrote:
>
> Fine, move key into flow_cache_entry but the rest should still
> work, no?
>
> struct flow_cache_object {
> u16 family;
> u8 dir;
> u32 genid;
> struct flow_cache_ops *ops;
> };
>
> struct flow_cache_entry {
> struct flow_cache_entry *next;
> struct flowi key;
> struct flow_cache_object *obj;
> };
>
> struct xfrm_policy {
> struct flow_cache_object flo;
> ...
> };
OK this doesn't work either as we still have NULL objects for
now. But I still think even if the ops pointer is the only
member in flow_cache_object, it looks better than returning the
nested ops pointer directly from flow_cache_lookup.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
From: Michael S. Tsirkin @ 2010-04-04 11:40 UTC (permalink / raw)
To: Xin, Xiaohui
Cc: netdev@vger.kernel.org, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, mingo@elte.hu, jdike@addtoit.com
In-Reply-To: <97F6D3BD476C464182C1B7BABF0B0AF5C17B55B5@shzsmsx502.ccr.corp.intel.com>
On Fri, Apr 02, 2010 at 10:16:16AM +0800, Xin, Xiaohui wrote:
>
> >> For the write logging, do you have a function in hand that we can
> >> recompute the log? If that, I think I can use it to recompute the
> >>log info when the logging is suddenly enabled.
> >> For the outstanding requests, do you mean all the user buffers have
> >>submitted before the logging ioctl changed? That may be a lot, and
> >> some of them are still in NIC ring descriptors. Waiting them to be
> >>finished may be need some time. I think when logging ioctl changed,
> >> then the logging is changed just after that is also reasonable.
>
> >The key point is that after loggin ioctl returns, any
> >subsequent change to memory must be logged. It does not
> >matter when was the request submitted, otherwise we will
> >get memory corruption on migration.
>
> The change to memory happens when vhost_add_used_and_signal(), right?
> So after ioctl returns, just recompute the log info to the events in the async queue,
> is ok. Since the ioctl and write log operations are all protected by vq->mutex.
>
> Thanks
> Xiaohui
Yes, I think this will work.
> > Thanks
> > Xiaohui
> >
> > drivers/vhost/net.c | 189 +++++++++++++++++++++++++++++++++++++++++++++++--
> > drivers/vhost/vhost.h | 10 +++
> > 2 files changed, 192 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 22d5fef..2aafd90 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -17,11 +17,13 @@
> > #include <linux/workqueue.h>
> > #include <linux/rcupdate.h>
> > #include <linux/file.h>
> > +#include <linux/aio.h>
> >
> > #include <linux/net.h>
> > #include <linux/if_packet.h>
> > #include <linux/if_arp.h>
> > #include <linux/if_tun.h>
> > +#include <linux/mpassthru.h>
> >
> > #include <net/sock.h>
> >
> > @@ -47,6 +49,7 @@ struct vhost_net {
> > struct vhost_dev dev;
> > struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
> > struct vhost_poll poll[VHOST_NET_VQ_MAX];
> > + struct kmem_cache *cache;
> > /* Tells us whether we are polling a socket for TX.
> > * We only do this when socket buffer fills up.
> > * Protected by tx vq lock. */
> > @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
> > net->tx_poll_state = VHOST_NET_POLL_STARTED;
> > }
> >
> > +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
> > +{
> > + struct kiocb *iocb = NULL;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&vq->notify_lock, flags);
> > + if (!list_empty(&vq->notifier)) {
> > + iocb = list_first_entry(&vq->notifier,
> > + struct kiocb, ki_list);
> > + list_del(&iocb->ki_list);
> > + }
> > + spin_unlock_irqrestore(&vq->notify_lock, flags);
> > + return iocb;
> > +}
> > +
> > +static void handle_async_rx_events_notify(struct vhost_net *net,
> > + struct vhost_virtqueue *vq)
> > +{
> > + struct kiocb *iocb = NULL;
> > + struct vhost_log *vq_log = NULL;
> > + int rx_total_len = 0;
> > + int log, size;
> > +
> > + if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > + return;
> > +
> > + if (vq->receiver)
> > + vq->receiver(vq);
> > +
> > + vq_log = unlikely(vhost_has_feature(
> > + &net->dev, VHOST_F_LOG_ALL)) ? vq->log : NULL;
> > + while ((iocb = notify_dequeue(vq)) != NULL) {
> > + vhost_add_used_and_signal(&net->dev, vq,
> > + iocb->ki_pos, iocb->ki_nbytes);
> > + log = (int)iocb->ki_user_data;
> > + size = iocb->ki_nbytes;
> > + rx_total_len += iocb->ki_nbytes;
> > +
> > + if (iocb->ki_dtor)
> > + iocb->ki_dtor(iocb);
> > + kmem_cache_free(net->cache, iocb);
> > +
> > + if (unlikely(vq_log))
> > + vhost_log_write(vq, vq_log, log, size);
> > + if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
> > + vhost_poll_queue(&vq->poll);
> > + break;
> > + }
> > + }
> > +}
> > +
> > +static void handle_async_tx_events_notify(struct vhost_net *net,
> > + struct vhost_virtqueue *vq)
> > +{
> > + struct kiocb *iocb = NULL;
> > + int tx_total_len = 0;
> > +
> > + if (vq->link_state != VHOST_VQ_LINK_ASYNC)
> > + return;
> > +
> > + while ((iocb = notify_dequeue(vq)) != NULL) {
> > + vhost_add_used_and_signal(&net->dev, vq,
> > + iocb->ki_pos, 0);
> > + tx_total_len += iocb->ki_nbytes;
> > +
> > + if (iocb->ki_dtor)
> > + iocb->ki_dtor(iocb);
> > +
> > + kmem_cache_free(net->cache, iocb);
> > + if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
> > + vhost_poll_queue(&vq->poll);
> > + break;
> > + }
> > + }
> > +}
> > +
> > /* Expects to be always run from workqueue - which acts as
> > * read-size critical section for our kind of RCU. */
> > static void handle_tx(struct vhost_net *net)
> > {
> > struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
> > + struct kiocb *iocb = NULL;
> > unsigned head, out, in, s;
> > struct msghdr msg = {
> > .msg_name = NULL,
> > @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net)
> > tx_poll_stop(net);
> > hdr_size = vq->hdr_size;
> >
> > + handle_async_tx_events_notify(net, vq);
> > +
> > for (;;) {
> > head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > ARRAY_SIZE(vq->iov),
> > @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net)
> > /* Skip header. TODO: support TSO. */
> > s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
> > msg.msg_iovlen = out;
> > +
> > + if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > + iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > + if (!iocb)
> > + break;
> > + iocb->ki_pos = head;
> > + iocb->private = (void *)vq;
> > + }
> > +
> > len = iov_length(vq->iov, out);
> > /* Sanity check */
> > if (!len) {
> > @@ -160,12 +251,16 @@ static void handle_tx(struct vhost_net *net)
> > break;
> > }
> > /* TODO: Check specific error and bomb out unless ENOBUFS? */
> > - err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > + err = sock->ops->sendmsg(iocb, sock, &msg, len);
> > if (unlikely(err < 0)) {
> > vhost_discard_vq_desc(vq);
> > tx_poll_start(net, sock);
> > break;
> > }
> > +
> > + if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > + continue;
> > +
> > if (err != len)
> > pr_err("Truncated TX packet: "
> > " len %d != %zd\n", err, len);
> > @@ -177,6 +272,8 @@ static void handle_tx(struct vhost_net *net)
> > }
> > }
> >
> > + handle_async_tx_events_notify(net, vq);
> > +
> > mutex_unlock(&vq->mutex);
> > unuse_mm(net->dev.mm);
> > }
> > @@ -186,6 +283,7 @@ static void handle_tx(struct vhost_net *net)
> > static void handle_rx(struct vhost_net *net)
> > {
> > struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > + struct kiocb *iocb = NULL;
> > unsigned head, out, in, log, s;
> > struct vhost_log *vq_log;
> > struct msghdr msg = {
> > @@ -206,7 +304,8 @@ static void handle_rx(struct vhost_net *net)
> > int err;
> > size_t hdr_size;
> > struct socket *sock = rcu_dereference(vq->private_data);
> > - if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > + if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
> > + vq->link_state == VHOST_VQ_LINK_SYNC))
> > return;
> >
> > use_mm(net->dev.mm);
> > @@ -214,9 +313,18 @@ static void handle_rx(struct vhost_net *net)
> > vhost_disable_notify(vq);
> > hdr_size = vq->hdr_size;
> >
> > - vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > + /* In async cases, for write logging, the simple way is to get
> > + * the log info always, and really logging is decided later.
> > + * Thus, when logging enabled, we can get log, and when logging
> > + * disabled, we can get log disabled accordingly.
> > + */
> > +
> > + vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
> > + (vq->link_state == VHOST_VQ_LINK_ASYNC) ?
> > vq->log : NULL;
> >
> > + handle_async_rx_events_notify(net, vq);
> > +
> > for (;;) {
> > head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > ARRAY_SIZE(vq->iov),
> > @@ -245,6 +353,14 @@ static void handle_rx(struct vhost_net *net)
> > s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
> > msg.msg_iovlen = in;
> > len = iov_length(vq->iov, in);
> > + if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > + iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
> > + if (!iocb)
> > + break;
> > + iocb->private = vq;
> > + iocb->ki_pos = head;
> > + iocb->ki_user_data = log;
> > + }
> > /* Sanity check */
> > if (!len) {
> > vq_err(vq, "Unexpected header len for RX: "
> > @@ -252,13 +368,18 @@ static void handle_rx(struct vhost_net *net)
> > iov_length(vq->hdr, s), hdr_size);
> > break;
> > }
> > - err = sock->ops->recvmsg(NULL, sock, &msg,
> > +
> > + err = sock->ops->recvmsg(iocb, sock, &msg,
> > len, MSG_DONTWAIT | MSG_TRUNC);
> > /* TODO: Check specific error and bomb out unless EAGAIN? */
> > if (err < 0) {
> > vhost_discard_vq_desc(vq);
> > break;
> > }
> > +
> > + if (vq->link_state == VHOST_VQ_LINK_ASYNC)
> > + continue;
> > +
> > /* TODO: Should check and handle checksum. */
> > if (err > len) {
> > pr_err("Discarded truncated rx packet: "
> > @@ -284,10 +405,13 @@ static void handle_rx(struct vhost_net *net)
> > }
> > }
> >
> > + handle_async_rx_events_notify(net, vq);
> > +
> > mutex_unlock(&vq->mutex);
> > unuse_mm(net->dev.mm);
> > }
> >
> > +
> > static void handle_tx_kick(struct work_struct *work)
> > {
> > struct vhost_virtqueue *vq;
> > @@ -338,6 +462,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
> > vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
> > vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
> > n->tx_poll_state = VHOST_NET_POLL_DISABLED;
> > + n->cache = NULL;
> > return 0;
> > }
> >
> > @@ -398,6 +523,17 @@ static void vhost_net_flush(struct vhost_net *n)
> > vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
> > }
> >
> > +static void vhost_notifier_cleanup(struct vhost_net *n)
> > +{
> > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
> > + struct kiocb *iocb = NULL;
> > + if (n->cache) {
> > + while ((iocb = notify_dequeue(vq)) != NULL)
> > + kmem_cache_free(n->cache, iocb);
> > + kmem_cache_destroy(n->cache);
> > + }
> > +}
> > +
> > static int vhost_net_release(struct inode *inode, struct file *f)
> > {
> > struct vhost_net *n = f->private_data;
> > @@ -414,6 +550,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
> > /* We do an extra flush before freeing memory,
> > * since jobs can re-queue themselves. */
> > vhost_net_flush(n);
> > + vhost_notifier_cleanup(n);
> > kfree(n);
> > return 0;
> > }
> > @@ -462,7 +599,19 @@ static struct socket *get_tun_socket(int fd)
> > return sock;
> > }
> >
> > -static struct socket *get_socket(int fd)
> > +static struct socket *get_mp_socket(int fd)
> > +{
> > + struct file *file = fget(fd);
> > + struct socket *sock;
> > + if (!file)
> > + return ERR_PTR(-EBADF);
> > + sock = mp_get_socket(file);
> > + if (IS_ERR(sock))
> > + fput(file);
> > + return sock;
> > +}
> > +
> > +static struct socket *get_socket(struct vhost_virtqueue *vq, int fd)
> > {
> > struct socket *sock;
> > if (fd == -1)
> > @@ -473,9 +622,31 @@ static struct socket *get_socket(int fd)
> > sock = get_tun_socket(fd);
> > if (!IS_ERR(sock))
> > return sock;
> > + sock = get_mp_socket(fd);
> > + if (!IS_ERR(sock)) {
> > + vq->link_state = VHOST_VQ_LINK_ASYNC;
> > + return sock;
> > + }
> > return ERR_PTR(-ENOTSOCK);
> > }
> >
> > +static void vhost_init_link_state(struct vhost_net *n, int index)
> > +{
> > + struct vhost_virtqueue *vq = n->vqs + index;
> > +
> > + WARN_ON(!mutex_is_locked(&vq->mutex));
> > + if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
> > + vq->receiver = NULL;
> > + INIT_LIST_HEAD(&vq->notifier);
> > + spin_lock_init(&vq->notify_lock);
> > + if (!n->cache) {
> > + n->cache = kmem_cache_create("vhost_kiocb",
> > + sizeof(struct kiocb), 0,
> > + SLAB_HWCACHE_ALIGN, NULL);
> > + }
> > + }
> > +}
> > +
> > static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > {
> > struct socket *sock, *oldsock;
> > @@ -493,12 +664,15 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > }
> > vq = n->vqs + index;
> > mutex_lock(&vq->mutex);
> > - sock = get_socket(fd);
> > + vq->link_state = VHOST_VQ_LINK_SYNC;
> > + sock = get_socket(vq, fd);
> > if (IS_ERR(sock)) {
> > r = PTR_ERR(sock);
> > goto err;
> > }
> >
> > + vhost_init_link_state(n, index);
> > +
> > /* start polling new socket */
> > oldsock = vq->private_data;
> > if (sock == oldsock)
> > @@ -507,8 +681,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> > vhost_net_disable_vq(n, vq);
> > rcu_assign_pointer(vq->private_data, sock);
> > vhost_net_enable_vq(n, vq);
> > - mutex_unlock(&vq->mutex);
> > done:
> > + mutex_unlock(&vq->mutex);
> > mutex_unlock(&n->dev.mutex);
> > if (oldsock) {
> > vhost_net_flush_vq(n, index);
> > @@ -516,6 +690,7 @@ done:
> > }
> > return r;
> > err:
> > + mutex_unlock(&vq->mutex);
> > mutex_unlock(&n->dev.mutex);
> > return r;
> > }
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index d1f0453..cffe39a 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -43,6 +43,11 @@ struct vhost_log {
> > u64 len;
> > };
> >
> > +enum vhost_vq_link_state {
> > + VHOST_VQ_LINK_SYNC = 0,
> > + VHOST_VQ_LINK_ASYNC = 1,
> > +};
> > +
> > /* The virtqueue structure describes a queue attached to a device. */
> > struct vhost_virtqueue {
> > struct vhost_dev *dev;
> > @@ -96,6 +101,11 @@ struct vhost_virtqueue {
> > /* Log write descriptors */
> > void __user *log_base;
> > struct vhost_log log[VHOST_NET_MAX_SG];
> > + /*Differiate async socket for 0-copy from normal*/
> > + enum vhost_vq_link_state link_state;
> > + struct list_head notifier;
> > + spinlock_t notify_lock;
> > + void (*receiver)(struct vhost_virtqueue *);
> > };
> >
> > struct vhost_dev {
> > --
> > 1.5.4.4
^ permalink raw reply
* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-04 11:51 UTC (permalink / raw)
To: vladz; +Cc: fujita.tomonori, davem, netdev, eilong
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDC525ADDD@SJEXCHCCR02.corp.ad.broadcom.com>
On Sun, 4 Apr 2010 03:24:46 -0700
"Vladislav Zolotarov" <vladz@broadcom.com> wrote:
> Ok. Got it now. Thanks, Fujita. I think we should patch the bnx2x to
> use the generic model (not just the mapping macros).
I've attached the patch.
There is one functional change: pci_alloc_consistent ->
dma_alloc_coherent
pci_alloc_consistent is a wrapper function of dma_alloc_coherent with
GFP_ATOMIC flag (see include/asm-generic/pci-dma-compat.h).
pci_alloc_consistent uses GFP_ATOMIC flag because of the compatibility
for some broken drivers that use the function in interrupt. But
GFP_ATOMIC should be avoided if possible. Looks like bnx2x doesn't use
pci_alloc_consistent in interrupt so I replaced them with
dma_alloc_coherent with GFP_KERNEL.
Please check if that change works for bnx2x.
> One last question: since which kernel version the generic DMA layer
> may be used instead of PCI DMA layer?
After 2.6.34-rc2.
Well, on the majority of architectures, you have been able to use the
generic DMA API over the PCI DMA API. The PCI DMA API is just the
wrapper of the generic DMA API. But on some architectures, two APIs
worked differently a bit. since 2.6.34-rc2, two API work in the exact
same way on all the architectures.
=
From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Subject: [PATCH] bnx2x: use the DMA API instead of the pci equivalents
The DMA API is preferred.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
drivers/net/bnx2x.h | 4 +-
drivers/net/bnx2x_main.c | 110 +++++++++++++++++++++++----------------------
2 files changed, 58 insertions(+), 56 deletions(-)
diff --git a/drivers/net/bnx2x.h b/drivers/net/bnx2x.h
index 3c48a7a..ae9c89e 100644
--- a/drivers/net/bnx2x.h
+++ b/drivers/net/bnx2x.h
@@ -163,7 +163,7 @@ do { \
struct sw_rx_bd {
struct sk_buff *skb;
- DECLARE_PCI_UNMAP_ADDR(mapping)
+ DEFINE_DMA_UNMAP_ADDR(mapping);
};
struct sw_tx_bd {
@@ -176,7 +176,7 @@ struct sw_tx_bd {
struct sw_rx_page {
struct page *page;
- DECLARE_PCI_UNMAP_ADDR(mapping)
+ DEFINE_DMA_UNMAP_ADDR(mapping);
};
union db_prod {
diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
index fa9275c..63a17d6 100644
--- a/drivers/net/bnx2x_main.c
+++ b/drivers/net/bnx2x_main.c
@@ -842,7 +842,7 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x *bp, struct bnx2x_fastpath *fp,
/* unmap first bd */
DP(BNX2X_MSG_OFF, "free bd_idx %d\n", bd_idx);
tx_start_bd = &fp->tx_desc_ring[bd_idx].start_bd;
- pci_unmap_single(bp->pdev, BD_UNMAP_ADDR(tx_start_bd),
+ dma_unmap_single(&bp->pdev->dev, BD_UNMAP_ADDR(tx_start_bd),
BD_UNMAP_LEN(tx_start_bd), PCI_DMA_TODEVICE);
nbd = le16_to_cpu(tx_start_bd->nbd) - 1;
@@ -872,8 +872,8 @@ static u16 bnx2x_free_tx_pkt(struct bnx2x *bp, struct bnx2x_fastpath *fp,
DP(BNX2X_MSG_OFF, "free frag bd_idx %d\n", bd_idx);
tx_data_bd = &fp->tx_desc_ring[bd_idx].reg_bd;
- pci_unmap_page(bp->pdev, BD_UNMAP_ADDR(tx_data_bd),
- BD_UNMAP_LEN(tx_data_bd), PCI_DMA_TODEVICE);
+ dma_unmap_page(&bp->pdev->dev, BD_UNMAP_ADDR(tx_data_bd),
+ BD_UNMAP_LEN(tx_data_bd), DMA_TO_DEVICE);
if (--nbd)
bd_idx = TX_BD(NEXT_TX_IDX(bd_idx));
}
@@ -1086,7 +1086,7 @@ static inline void bnx2x_free_rx_sge(struct bnx2x *bp,
if (!page)
return;
- pci_unmap_page(bp->pdev, pci_unmap_addr(sw_buf, mapping),
+ dma_unmap_page(&bp->pdev->dev, dma_unmap_addr(sw_buf, mapping),
SGE_PAGE_SIZE*PAGES_PER_SGE, PCI_DMA_FROMDEVICE);
__free_pages(page, PAGES_PER_SGE_SHIFT);
@@ -1115,15 +1115,15 @@ static inline int bnx2x_alloc_rx_sge(struct bnx2x *bp,
if (unlikely(page == NULL))
return -ENOMEM;
- mapping = pci_map_page(bp->pdev, page, 0, SGE_PAGE_SIZE*PAGES_PER_SGE,
- PCI_DMA_FROMDEVICE);
+ mapping = dma_map_page(&bp->pdev->dev, page, 0,
+ SGE_PAGE_SIZE*PAGES_PER_SGE, DMA_FROM_DEVICE);
if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
__free_pages(page, PAGES_PER_SGE_SHIFT);
return -ENOMEM;
}
sw_buf->page = page;
- pci_unmap_addr_set(sw_buf, mapping, mapping);
+ dma_unmap_addr_set(sw_buf, mapping, mapping);
sge->addr_hi = cpu_to_le32(U64_HI(mapping));
sge->addr_lo = cpu_to_le32(U64_LO(mapping));
@@ -1143,15 +1143,15 @@ static inline int bnx2x_alloc_rx_skb(struct bnx2x *bp,
if (unlikely(skb == NULL))
return -ENOMEM;
- mapping = pci_map_single(bp->pdev, skb->data, bp->rx_buf_size,
- PCI_DMA_FROMDEVICE);
+ mapping = dma_map_single(&bp->pdev->dev, skb->data, bp->rx_buf_size,
+ DMA_FROM_DEVICE);
if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
dev_kfree_skb(skb);
return -ENOMEM;
}
rx_buf->skb = skb;
- pci_unmap_addr_set(rx_buf, mapping, mapping);
+ dma_unmap_addr_set(rx_buf, mapping, mapping);
rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
rx_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
@@ -1173,13 +1173,13 @@ static void bnx2x_reuse_rx_skb(struct bnx2x_fastpath *fp,
struct eth_rx_bd *cons_bd = &fp->rx_desc_ring[cons];
struct eth_rx_bd *prod_bd = &fp->rx_desc_ring[prod];
- pci_dma_sync_single_for_device(bp->pdev,
- pci_unmap_addr(cons_rx_buf, mapping),
- RX_COPY_THRESH, PCI_DMA_FROMDEVICE);
+ dma_sync_single_for_device(&bp->pdev->dev,
+ dma_unmap_addr(cons_rx_buf, mapping),
+ RX_COPY_THRESH, DMA_FROM_DEVICE);
prod_rx_buf->skb = cons_rx_buf->skb;
- pci_unmap_addr_set(prod_rx_buf, mapping,
- pci_unmap_addr(cons_rx_buf, mapping));
+ dma_unmap_addr_set(prod_rx_buf, mapping,
+ dma_unmap_addr(cons_rx_buf, mapping));
*prod_bd = *cons_bd;
}
@@ -1283,9 +1283,9 @@ static void bnx2x_tpa_start(struct bnx2x_fastpath *fp, u16 queue,
/* move empty skb from pool to prod and map it */
prod_rx_buf->skb = fp->tpa_pool[queue].skb;
- mapping = pci_map_single(bp->pdev, fp->tpa_pool[queue].skb->data,
- bp->rx_buf_size, PCI_DMA_FROMDEVICE);
- pci_unmap_addr_set(prod_rx_buf, mapping, mapping);
+ mapping = dma_map_single(&bp->pdev->dev, fp->tpa_pool[queue].skb->data,
+ bp->rx_buf_size, DMA_FROM_DEVICE);
+ dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
/* move partial skb from cons to pool (don't unmap yet) */
fp->tpa_pool[queue] = *cons_rx_buf;
@@ -1361,8 +1361,9 @@ static int bnx2x_fill_frag_skb(struct bnx2x *bp, struct bnx2x_fastpath *fp,
}
/* Unmap the page as we r going to pass it to the stack */
- pci_unmap_page(bp->pdev, pci_unmap_addr(&old_rx_pg, mapping),
- SGE_PAGE_SIZE*PAGES_PER_SGE, PCI_DMA_FROMDEVICE);
+ dma_unmap_page(&bp->pdev->dev,
+ dma_unmap_addr(&old_rx_pg, mapping),
+ SGE_PAGE_SIZE*PAGES_PER_SGE, DMA_FROM_DEVICE);
/* Add one frag and update the appropriate fields in the skb */
skb_fill_page_desc(skb, j, old_rx_pg.page, 0, frag_len);
@@ -1389,8 +1390,8 @@ static void bnx2x_tpa_stop(struct bnx2x *bp, struct bnx2x_fastpath *fp,
/* Unmap skb in the pool anyway, as we are going to change
pool entry status to BNX2X_TPA_STOP even if new skb allocation
fails. */
- pci_unmap_single(bp->pdev, pci_unmap_addr(rx_buf, mapping),
- bp->rx_buf_size, PCI_DMA_FROMDEVICE);
+ dma_unmap_single(&bp->pdev->dev, dma_unmap_addr(rx_buf, mapping),
+ bp->rx_buf_size, DMA_FROM_DEVICE);
if (likely(new_skb)) {
/* fix ip xsum and give it to the stack */
@@ -1620,10 +1621,10 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
}
}
- pci_dma_sync_single_for_device(bp->pdev,
- pci_unmap_addr(rx_buf, mapping),
- pad + RX_COPY_THRESH,
- PCI_DMA_FROMDEVICE);
+ dma_sync_single_for_device(&bp->pdev->dev,
+ dma_unmap_addr(rx_buf, mapping),
+ pad + RX_COPY_THRESH,
+ DMA_FROM_DEVICE);
prefetch(skb);
prefetch(((char *)(skb)) + 128);
@@ -1665,10 +1666,10 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
} else
if (likely(bnx2x_alloc_rx_skb(bp, fp, bd_prod) == 0)) {
- pci_unmap_single(bp->pdev,
- pci_unmap_addr(rx_buf, mapping),
+ dma_unmap_single(&bp->pdev->dev,
+ dma_unmap_addr(rx_buf, mapping),
bp->rx_buf_size,
- PCI_DMA_FROMDEVICE);
+ DMA_FROM_DEVICE);
skb_reserve(skb, pad);
skb_put(skb, len);
@@ -4940,9 +4941,9 @@ static inline void bnx2x_free_tpa_pool(struct bnx2x *bp,
}
if (fp->tpa_state[i] == BNX2X_TPA_START)
- pci_unmap_single(bp->pdev,
- pci_unmap_addr(rx_buf, mapping),
- bp->rx_buf_size, PCI_DMA_FROMDEVICE);
+ dma_unmap_single(&bp->pdev->dev,
+ dma_unmap_addr(rx_buf, mapping),
+ bp->rx_buf_size, DMA_FROM_DEVICE);
dev_kfree_skb(skb);
rx_buf->skb = NULL;
@@ -4978,7 +4979,7 @@ static void bnx2x_init_rx_rings(struct bnx2x *bp)
fp->disable_tpa = 1;
break;
}
- pci_unmap_addr_set((struct sw_rx_bd *)
+ dma_unmap_addr_set((struct sw_rx_bd *)
&bp->fp->tpa_pool[i],
mapping, 0);
fp->tpa_state[i] = BNX2X_TPA_STOP;
@@ -5658,8 +5659,8 @@ static void bnx2x_nic_init(struct bnx2x *bp, u32 load_code)
static int bnx2x_gunzip_init(struct bnx2x *bp)
{
- bp->gunzip_buf = pci_alloc_consistent(bp->pdev, FW_BUF_SIZE,
- &bp->gunzip_mapping);
+ bp->gunzip_buf = dma_alloc_coherent(&bp->pdev->dev, FW_BUF_SIZE,
+ &bp->gunzip_mapping, GFP_KERNEL);
if (bp->gunzip_buf == NULL)
goto gunzip_nomem1;
@@ -5679,8 +5680,8 @@ gunzip_nomem3:
bp->strm = NULL;
gunzip_nomem2:
- pci_free_consistent(bp->pdev, FW_BUF_SIZE, bp->gunzip_buf,
- bp->gunzip_mapping);
+ dma_free_coherent(&bp->pdev->dev, FW_BUF_SIZE, bp->gunzip_buf,
+ bp->gunzip_mapping);
bp->gunzip_buf = NULL;
gunzip_nomem1:
@@ -5696,8 +5697,8 @@ static void bnx2x_gunzip_end(struct bnx2x *bp)
bp->strm = NULL;
if (bp->gunzip_buf) {
- pci_free_consistent(bp->pdev, FW_BUF_SIZE, bp->gunzip_buf,
- bp->gunzip_mapping);
+ dma_free_coherent(&bp->pdev->dev, FW_BUF_SIZE, bp->gunzip_buf,
+ bp->gunzip_mapping);
bp->gunzip_buf = NULL;
}
}
@@ -6692,7 +6693,7 @@ static void bnx2x_free_mem(struct bnx2x *bp)
#define BNX2X_PCI_FREE(x, y, size) \
do { \
if (x) { \
- pci_free_consistent(bp->pdev, size, x, y); \
+ dma_free_coherent(&bp->pdev->dev, size, x, y); \
x = NULL; \
y = 0; \
} \
@@ -6773,7 +6774,7 @@ static int bnx2x_alloc_mem(struct bnx2x *bp)
#define BNX2X_PCI_ALLOC(x, y, size) \
do { \
- x = pci_alloc_consistent(bp->pdev, size, y); \
+ x = dma_alloc_coherent(&bp->pdev->dev, size, y, GFP_KERNEL); \
if (x == NULL) \
goto alloc_mem_err; \
memset(x, 0, size); \
@@ -6906,9 +6907,9 @@ static void bnx2x_free_rx_skbs(struct bnx2x *bp)
if (skb == NULL)
continue;
- pci_unmap_single(bp->pdev,
- pci_unmap_addr(rx_buf, mapping),
- bp->rx_buf_size, PCI_DMA_FROMDEVICE);
+ dma_unmap_single(&bp->pdev->dev,
+ dma_unmap_addr(rx_buf, mapping),
+ bp->rx_buf_size, DMA_FROM_DEVICE);
rx_buf->skb = NULL;
dev_kfree_skb(skb);
@@ -10269,8 +10270,8 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode, u8 link_up)
bd_prod = TX_BD(fp_tx->tx_bd_prod);
tx_start_bd = &fp_tx->tx_desc_ring[bd_prod].start_bd;
- mapping = pci_map_single(bp->pdev, skb->data,
- skb_headlen(skb), PCI_DMA_TODEVICE);
+ mapping = dma_map_single(&bp->pdev->dev, skb->data,
+ skb_headlen(skb), DMA_TO_DEVICE);
tx_start_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
tx_start_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
tx_start_bd->nbd = cpu_to_le16(2); /* start + pbd */
@@ -11316,8 +11317,8 @@ static netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
}
}
- mapping = pci_map_single(bp->pdev, skb->data,
- skb_headlen(skb), PCI_DMA_TODEVICE);
+ mapping = dma_map_single(&bp->pdev->dev, skb->data,
+ skb_headlen(skb), DMA_TO_DEVICE);
tx_start_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
tx_start_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
@@ -11374,8 +11375,9 @@ static netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
if (total_pkt_bd == NULL)
total_pkt_bd = &fp->tx_desc_ring[bd_prod].reg_bd;
- mapping = pci_map_page(bp->pdev, frag->page, frag->page_offset,
- frag->size, PCI_DMA_TODEVICE);
+ mapping = dma_map_page(&bp->pdev->dev, frag->page,
+ frag->page_offset,
+ frag->size, DMA_TO_DEVICE);
tx_data_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
tx_data_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
@@ -11832,15 +11834,15 @@ static int __devinit bnx2x_init_dev(struct pci_dev *pdev,
goto err_out_release;
}
- if (pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) == 0) {
+ if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(64)) == 0) {
bp->flags |= USING_DAC_FLAG;
- if (pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64)) != 0) {
- pr_err("pci_set_consistent_dma_mask failed, aborting\n");
+ if (dma_set_coherent_mask(&pdev->dev, DMA_BIT_MASK(64)) != 0) {
+ pr_err("dma_set_coherent_mask failed, aborting\n");
rc = -EIO;
goto err_out_release;
}
- } else if (pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) != 0) {
+ } else if (dma_set_mask(&pdev->dev, DMA_BIT_MASK(32)) != 0) {
pr_err("System does not support DMA, aborting\n");
rc = -EIO;
goto err_out_release;
--
1.7.0
^ permalink raw reply related
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-04 12:09 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100404113142.GA11124@gondor.apana.org.au>
Herbert Xu wrote:
> On Sun, Apr 04, 2010 at 07:26:36PM +0800, Herbert Xu wrote:
>> Fine, move key into flow_cache_entry but the rest should still
>> work, no?
>
> OK this doesn't work either as we still have NULL objects for
> now. But I still think even if the ops pointer is the only
> member in flow_cache_object, it looks better than returning the
> nested ops pointer directly from flow_cache_lookup.
Yes, it'll look better. I'll wrap the pointer in a struct.
Ok, so far it's:
- constify ops
- indentation fixes for flow.c struct's with pointer members
- wrap ops* in a struct* to avoid ops**
Will fix and resend refreshed patches tomorrow.
Thanks.
^ permalink raw reply
* [PATCHv2] virtio-net: move sg off stack
From: Michael S. Tsirkin @ 2010-04-04 13:07 UTC (permalink / raw)
To: David S. Miller, Rusty Russell, Jiri Pirko, Michael S. Tsirkin,
Shirley Ma, ne
Move sg structure off stack and into virtnet_info structure.
This helps remove extra sg_init_table calls as well as reduce
stack usage.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Michael S. Tsirkin <mst@redhat.com>
---
Changes from v1: fix compilation of add_recvbuf_mergeable (&rx_sg -> rx_sg)
This patch works for me. Shirley, could you find the time to test
this as well please?
drivers/net/virtio_net.c | 52 ++++++++++++++++++++++-----------------------
1 files changed, 25 insertions(+), 27 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3f5be35..186dd6a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -39,8 +39,7 @@ module_param(gso, bool, 0444);
#define VIRTNET_SEND_COMMAND_SG_MAX 2
-struct virtnet_info
-{
+struct virtnet_info {
struct virtio_device *vdev;
struct virtqueue *rvq, *svq, *cvq;
struct net_device *dev;
@@ -61,6 +60,10 @@ struct virtnet_info
/* Chain pages by the private ptr. */
struct page *pages;
+
+ /* fragments + linear part + virtio header */
+ struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
+ struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
};
struct skb_vnet_hdr {
@@ -323,10 +326,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
{
struct sk_buff *skb;
struct skb_vnet_hdr *hdr;
- struct scatterlist sg[2];
int err;
- sg_init_table(sg, 2);
skb = netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN);
if (unlikely(!skb))
return -ENOMEM;
@@ -334,11 +335,11 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
skb_put(skb, MAX_PACKET_LEN);
hdr = skb_vnet_hdr(skb);
- sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
+ sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
- skb_to_sgvec(skb, sg + 1, 0, skb->len);
+ skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
- err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, 2, skb);
+ err = vi->rvq->vq_ops->add_buf(vi->rvq, vi->rx_sg, 0, 2, skb);
if (err < 0)
dev_kfree_skb(skb);
@@ -347,13 +348,11 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
{
- struct scatterlist sg[MAX_SKB_FRAGS + 2];
struct page *first, *list = NULL;
char *p;
int i, err, offset;
- sg_init_table(sg, MAX_SKB_FRAGS + 2);
- /* page in sg[MAX_SKB_FRAGS + 1] is list tail */
+ /* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
first = get_a_page(vi, gfp);
if (!first) {
@@ -361,7 +360,7 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
give_pages(vi, list);
return -ENOMEM;
}
- sg_set_buf(&sg[i], page_address(first), PAGE_SIZE);
+ sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
/* chain new page in list head to match sg */
first->private = (unsigned long)list;
@@ -375,17 +374,17 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
}
p = page_address(first);
- /* sg[0], sg[1] share the same page */
- /* a separated sg[0] for virtio_net_hdr only during to QEMU bug*/
- sg_set_buf(&sg[0], p, sizeof(struct virtio_net_hdr));
+ /* vi->rx_sg[0], vi->rx_sg[1] share the same page */
+ /* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
+ sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
- /* sg[1] for data packet, from offset */
+ /* vi->rx_sg[1] for data packet, from offset */
offset = sizeof(struct padded_vnet_hdr);
- sg_set_buf(&sg[1], p + offset, PAGE_SIZE - offset);
+ sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
/* chain first in list head */
first->private = (unsigned long)list;
- err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, MAX_SKB_FRAGS + 2,
+ err = vi->rvq->vq_ops->add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
first);
if (err < 0)
give_pages(vi, first);
@@ -396,16 +395,15 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
{
struct page *page;
- struct scatterlist sg;
int err;
page = get_a_page(vi, gfp);
if (!page)
return -ENOMEM;
- sg_init_one(&sg, page_address(page), PAGE_SIZE);
+ sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
- err = vi->rvq->vq_ops->add_buf(vi->rvq, &sg, 0, 1, page);
+ err = vi->rvq->vq_ops->add_buf(vi->rvq, vi->rx_sg, 0, 1, page);
if (err < 0)
give_pages(vi, page);
@@ -514,12 +512,9 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
{
- struct scatterlist sg[2+MAX_SKB_FRAGS];
struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
- sg_init_table(sg, 2+MAX_SKB_FRAGS);
-
pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
if (skb->ip_summed == CHECKSUM_PARTIAL) {
@@ -553,12 +548,13 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
/* Encode metadata header at front. */
if (vi->mergeable_rx_bufs)
- sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
+ sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
else
- sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
+ sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
- hdr->num_sg = skb_to_sgvec(skb, sg+1, 0, skb->len) + 1;
- return vi->svq->vq_ops->add_buf(vi->svq, sg, hdr->num_sg, 0, skb);
+ hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
+ return vi->svq->vq_ops->add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+ 0, skb);
}
static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -941,6 +937,8 @@ static int virtnet_probe(struct virtio_device *vdev)
vdev->priv = vi;
vi->pages = NULL;
INIT_DELAYED_WORK(&vi->refill, refill_work);
+ sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
+ sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
/* If we can receive ANY GSO packets, we must allocate large ones. */
if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
--
1.7.0.2.280.gc6f05
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox