* Re: [BUG] latest net-next-2.6 doesnt fly
From: FUJITA Tomonori @ 2010-04-04 10:19 UTC (permalink / raw)
To: eric.dumazet; +Cc: fujita.tomonori, netdev, davem
In-Reply-To: <1270373395.1971.13.camel@edumazet-laptop>
On Sun, 04 Apr 2010 11:29:55 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > I think, if pdev is null, returning 1 here is safer since the device
> > doesn't set up dma info properly.
> >
> > Do you know what device hits this bug? You said that you use bnx2 and
> > tg3. Both call SET_NETDEV_DEV with pdev->dev. I tested bnx2 and seems
> > that netdev->dev.parent is set up correctly.
> > --
>
> Might be because of my setup, I suspect I had two reasons to hit the
> bug :
>
> A bonding of eth2 (bnx2) and eth3 (tg3)
>
> Then vlans on top of this bond0
>
> When first dev_queue_xmit() was called, it was for a virtual device :)
Thanks! So it's due to bond or vlan (or both).
I guess that returning zero here with a null pdev is fine. If we
return 1, probably some people would complain about performance
regression. Like the block layer does, coping the dma restriction info
from the lower devices can solve this problem but I guess that it's
over engineering. My original patch doesn't loosen the DMA restriction
checking so returning zero shouldn't break anything. If when we fix
the usage of NETIF_F_HIGHDMA in each driver and also check the usage
of netdev->dev.parent, everything should be fine.
^ permalink raw reply
* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-04 10:03 UTC (permalink / raw)
To: vladz; +Cc: davem, fujita.tomonori, netdev, eilong
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDC525ADDC@SJEXCHCCR02.corp.ad.broadcom.com>
On Sun, 4 Apr 2010 02:15:52 -0700
"Vladislav Zolotarov" <vladz@broadcom.com> wrote:
> According to the changes in a PCI-DMA-mapping.txt it sounds like the
> trend is to stop using the pci_dma_* API and start using the dma_*
> API instead. Does this mean that using the pci_dma_* API is
> deprecated?
Sorry that I didn't put the enough information in the patch
description.
In the long term, I want to remove the pci_dma_* API.
We had the various bus-specific DMA API (pci, sbus, etc). It was the
headache for driver writers that handle multiple bus devices. So we
invented the generic DMA API long ago. Now we have only two bus
specific APIs: pci and ssb so I want to remove them and make sure that
driver writers are always able to use the generic DMA API with any
bus.
http://lwn.net/Articles/374137/
I don't plan to convert the whole tree to use the DMA API over the PCI
DMA API all together. I convert the drivers gradually. I already
removed the PCI DMA API from the docs under Documentation/. It would
be nice if some driver maintainers (or others) convert their drivers.
The patchset convert only the pci state API (such as
DECLARE_PCI_UNMAP_ADDR) because akpm complained of the API and I
promised him to clean up it:
http://lkml.org/lkml/2010/2/12/415
I'll send a patch if you like me to convert bnx2x to use the the DMA
API completely.
^ permalink raw reply
* Re: [BUG] latest net-next-2.6 doesnt fly
From: Eric Dumazet @ 2010-04-04 9:29 UTC (permalink / raw)
To: FUJITA Tomonori; +Cc: netdev, davem
In-Reply-To: <20100404181542N.fujita.tomonori@lab.ntt.co.jp>
Le dimanche 04 avril 2010 à 18:16 +0900, FUJITA Tomonori a écrit :
> > + return 0;
>
> Sorry about that and thanks for the fix.
>
> I think, if pdev is null, returning 1 here is safer since the device
> doesn't set up dma info properly.
>
> Do you know what device hits this bug? You said that you use bnx2 and
> tg3. Both call SET_NETDEV_DEV with pdev->dev. I tested bnx2 and seems
> that netdev->dev.parent is set up correctly.
> --
Might be because of my setup, I suspect I had two reasons to hit the
bug :
A bonding of eth2 (bnx2) and eth3 (tg3)
Then vlans on top of this bond0
When first dev_queue_xmit() was called, it was for a virtual device :)
# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
qlen 1000
link/ether 00:1e:0b:ec:d3:dc brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq
master bond0 state UP qlen 1000
link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc hfsc
master bond0 state UP qlen 1000
link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
link/ether 00:1e:0b:92:78:51 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP
link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
7: vlan.103@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500
qdisc pfifo_fast state UP qlen 100
link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
8: vlan.825@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500
qdisc pfifo_fast state UP qlen 1000
link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
^ permalink raw reply
* Re: [BUG] latest net-next-2.6 doesnt fly
From: FUJITA Tomonori @ 2010-04-04 9:16 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, davem, fujita.tomonori
In-Reply-To: <1270202304.1989.14.camel@edumazet-laptop>
On Fri, 02 Apr 2010 11:58:24 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index e19cdae..c6b5206 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1801,7 +1801,7 @@ EXPORT_SYMBOL(netdev_rx_csum_fault);
> * 2. No high memory really exists on this machine.
> */
>
> -static inline int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
> +static int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
> {
> #ifdef CONFIG_HIGHMEM
> int i;
> @@ -1814,6 +1814,8 @@ static inline int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
> if (PCI_DMA_BUS_IS_PHYS) {
> struct device *pdev = dev->dev.parent;
>
> + if (!pdev)
> + return 0;
Sorry about that and thanks for the fix.
I think, if pdev is null, returning 1 here is safer since the device
doesn't set up dma info properly.
Do you know what device hits this bug? You said that you use bnx2 and
tg3. Both call SET_NETDEV_DEV with pdev->dev. I tested bnx2 and seems
that netdev->dev.parent is set up correctly.
^ permalink raw reply
* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Vladislav Zolotarov @ 2010-04-04 9:15 UTC (permalink / raw)
To: David Miller
Cc: fujita.tomonori@lab.ntt.co.jp, netdev@vger.kernel.org,
Eilon Greenstein
In-Reply-To: <20100404.013920.120463402.davem@davemloft.net>
According to the changes in a PCI-DMA-mapping.txt it sounds like the trend is to stop using the pci_dma_* API and start using the dma_* API instead. Does this mean that using the pci_dma_* API is deprecated?
Thanks,
vlad
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Sunday, April 04, 2010 11:39 AM
> To: Vladislav Zolotarov
> Cc: fujita.tomonori@lab.ntt.co.jp; netdev@vger.kernel.org;
> Eilon Greenstein
> Subject: Re: [PATCH] bnx2x: use the dma state API instead of
> the pci equivalents
>
> From: "Vladislav Zolotarov" <vladz@broadcom.com>
> Date: Sun, 4 Apr 2010 01:19:07 -0700
>
> > So, if we don't need to use pci_X() interface anymore, lets replace
> > pci_X() properly all over the bnx2x with dma_X() functions.
>
> I think Fujita's plan of gradual and partial transformations is
> legitimate, and his changes shouldn't be rejected because he simply
> isn't modifying all of the interfaces used by this driver but rather
> just a specific subset he is trying to transform across the tree.
>
> Please rescind your objections.
>
> Thanks.
>
>
^ permalink raw reply
* Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net
From: Michael S. Tsirkin @ 2010-04-04 8:55 UTC (permalink / raw)
To: David Stevens; +Cc: kvm, kvm-owner, netdev, rusty, virtualization
In-Reply-To: <OF9AF90021.62EF0EE4-ON882576F8.00618B3E-882576F8.0064F223@us.ibm.com>
On Thu, Apr 01, 2010 at 11:22:37AM -0700, David Stevens wrote:
> kvm-owner@vger.kernel.org wrote on 04/01/2010 03:54:15 AM:
>
> > On Wed, Mar 31, 2010 at 03:04:43PM -0700, David Stevens wrote:
>
> > >
> > > > > + head.iov_base = (void
> *)vhost_get_vq_desc(&net->dev,
> > > vq,
> > > > > + vq->iov, ARRAY_SIZE(vq->iov), &out, &in,
> NULL,
> > >
> > > > > NULL);
> > > >
> > > > I this casting confusing.
> > > > Is it really expensive to add an array of heads so that
> > > > we do not need to cast?
> > >
> > > It needs the heads and the lengths, which looks a lot
> > > like an iovec. I was trying to resist adding a new
> > > struct XXX { unsigned head; unsigned len; } just for this,
> > > but I could make these parallel arrays, one with head index and
> > > the other with length.
>
> Michael, on this one, if I add vq->heads as an argument to
> vhost_get_heads (aka vhost_get_desc_n), I'd need the length too.
> Would you rather this 1) remain an iovec (and a single arg added) but
> cast still there, 2) 2 arrays (head and length) and 2 args added, or
> 3) a new struct type of {unsigned,int} to carry for the heads+len
> instead of iovec?
> My preference would be 1). I agree the casts are ugly, but
> it is essentially an iovec the way we use it; it's just that the
> base isn't a pointer but a descriptor index instead.
I prefer 2 or 3. If you prefer 1 strongly, I think we should
add a detailed comment near the iovec, and
a couple of inline wrappers to store/get data in the iovec.
> > >
> > > EAGAIN is not possible after the change, because we don't
> > > even enter the loop unless we have an skb on the read queue; the
> > > other cases bomb out, so I figured the comment for future work is
> > > now done. :-)
> >
> > Guest could be buggy so we'll get EFAULT.
> > If skb is taken off the rx queue (as below), we might get EAGAIN.
>
> We break on any error. If we get EAGAIN because someone read
> on the socket, this code would break the loop, but EAGAIN is a more
> serious problem if it changed since we peeked (because it means
> someone else is reading the socket).
> But I don't understand -- are you suggesting that the error
> handling be different than that, or that the comment is still
> relevant?
> My intention here is to do the "TODO" from the comment
> so that it can be removed, by handling all error cases. I think
> because of the peek, EAGAIN isn't something to be ignored anymore,
> but the effect is the same whether we break out of the loop or
> not, since we retry the packet next time around. Essentially, we
> ignore every error since we will redo it with the same packet the
> next time around. Maybe we should print something here, but since
> we'll be retrying the packet that's still on the socket, a permanent
> error would spew continuously. Maybe we should shut down entirely
> if we get any negative return value here (including EAGAIN, since
> that tells us someone messed with the socket when we don't want them
> to).
> If you want the comment still there, ok, but I do think EAGAIN
> isn't a special case per the comment anymore, and is handled as all
> other errors are: by exiting the loop and retrying next time.
>
> +-DLS
Yes, I just think some comment should stay, as you say, because
otherwise we simply retry continuously. Maybe we should trigger vq_err.
It needs to be given some thought which I have not given it yet.
Thinking aloud, EAGAIN means someone reads the socket
together with us, I prefer that this condition is made a fatal
error, we should make sure we are polling the socket
so we see packets if more appear.
--
MST
^ permalink raw reply
* Re: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: David Miller @ 2010-04-04 8:39 UTC (permalink / raw)
To: vladz; +Cc: fujita.tomonori, netdev, eilong
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDC525ADDB@SJEXCHCCR02.corp.ad.broadcom.com>
From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Sun, 4 Apr 2010 01:19:07 -0700
> So, if we don't need to use pci_X() interface anymore, lets replace
> pci_X() properly all over the bnx2x with dma_X() functions.
I think Fujita's plan of gradual and partial transformations is
legitimate, and his changes shouldn't be rejected because he simply
isn't modifying all of the interfaces used by this driver but rather
just a specific subset he is trying to transform across the tree.
Please rescind your objections.
Thanks.
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04 8:35 UTC (permalink / raw)
To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB8319B.3040209@iki.fi>
On Sun, Apr 04, 2010 at 09:28:43AM +0300, Timo Teräs wrote:
>
> No. The flow cache flush removal does not prevent bundle deletion.
> The flow cache flush is in current code *after* deleting the bundles
> from the policy. Freeing bundles and flushing cache are completely
> two separate things in current code. Only in 2/4 the bundle deletion
> becomes dependent on flow cache flush.
Ah yes I confused myself. The problem I thought would occur
doesn't because 1/4 doesn't put the bundles in the flow cache
yet.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Vladislav Zolotarov @ 2010-04-04 8:19 UTC (permalink / raw)
To: FUJITA Tomonori, netdev@vger.kernel.org; +Cc: Eilon Greenstein
In-Reply-To: <20100402115054A.fujita.tomonori@lab.ntt.co.jp>
Why is it preferable? As far as I can see the current patch is not going to introduce any functional change.
Is there a plan to remove pci_map_X()/pci_alloc_consistent() functions family in the future and completely replaced with dma_X() functions? Is it appropriate to use dma_X() in all the places where pci_X() is used? For instance, we do use DAC mode and as far as I understand we should use pci_X() interface in this case. Is this rule not relevant anymore?
So, if we don't need to use pci_X() interface anymore, lets replace pci_X() properly all over the bnx2x with dma_X() functions. And if not, this patch mixes the macros from one API (dma_X) and functions from another (pci_X()) which may hardly be called "preferable"...
Thanks,
vlad
> -----Original Message-----
> From: netdev-owner@vger.kernel.org
> [mailto:netdev-owner@vger.kernel.org] On Behalf Of FUJITA Tomonori
> Sent: Friday, April 02, 2010 5:57 AM
> To: netdev@vger.kernel.org
> Cc: Eilon Greenstein
> Subject: [PATCH] bnx2x: use the dma state API instead of the
> pci equivalents
>
> The DMA API is preferred.
>
> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> ---
> drivers/net/bnx2x.h | 4 ++--
> drivers/net/bnx2x_main.c | 28 ++++++++++++++--------------
> 2 files changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/net/bnx2x.h b/drivers/net/bnx2x.h
> index 3c48a7a..ae9c89e 100644
> --- a/drivers/net/bnx2x.h
> +++ b/drivers/net/bnx2x.h
> @@ -163,7 +163,7 @@ do {
> \
>
> struct sw_rx_bd {
> struct sk_buff *skb;
> - DECLARE_PCI_UNMAP_ADDR(mapping)
> + DEFINE_DMA_UNMAP_ADDR(mapping);
> };
>
> struct sw_tx_bd {
> @@ -176,7 +176,7 @@ struct sw_tx_bd {
>
> struct sw_rx_page {
> struct page *page;
> - DECLARE_PCI_UNMAP_ADDR(mapping)
> + DEFINE_DMA_UNMAP_ADDR(mapping);
> };
>
> union db_prod {
> diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
> index 6c042a7..2a77611 100644
> --- a/drivers/net/bnx2x_main.c
> +++ b/drivers/net/bnx2x_main.c
> @@ -1086,7 +1086,7 @@ static inline void
> bnx2x_free_rx_sge(struct bnx2x *bp,
> if (!page)
> return;
>
> - pci_unmap_page(bp->pdev, pci_unmap_addr(sw_buf, mapping),
> + pci_unmap_page(bp->pdev, dma_unmap_addr(sw_buf, mapping),
> SGE_PAGE_SIZE*PAGES_PER_SGE, PCI_DMA_FROMDEVICE);
> __free_pages(page, PAGES_PER_SGE_SHIFT);
>
> @@ -1123,7 +1123,7 @@ static inline int
> bnx2x_alloc_rx_sge(struct bnx2x *bp,
> }
>
> sw_buf->page = page;
> - pci_unmap_addr_set(sw_buf, mapping, mapping);
> + dma_unmap_addr_set(sw_buf, mapping, mapping);
>
> sge->addr_hi = cpu_to_le32(U64_HI(mapping));
> sge->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1151,7 +1151,7 @@ static inline int
> bnx2x_alloc_rx_skb(struct bnx2x *bp,
> }
>
> rx_buf->skb = skb;
> - pci_unmap_addr_set(rx_buf, mapping, mapping);
> + dma_unmap_addr_set(rx_buf, mapping, mapping);
>
> rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
> rx_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1174,12 +1174,12 @@ static void bnx2x_reuse_rx_skb(struct
> bnx2x_fastpath *fp,
> struct eth_rx_bd *prod_bd = &fp->rx_desc_ring[prod];
>
> pci_dma_sync_single_for_device(bp->pdev,
> -
> pci_unmap_addr(cons_rx_buf, mapping),
> +
> dma_unmap_addr(cons_rx_buf, mapping),
> RX_COPY_THRESH,
> PCI_DMA_FROMDEVICE);
>
> prod_rx_buf->skb = cons_rx_buf->skb;
> - pci_unmap_addr_set(prod_rx_buf, mapping,
> - pci_unmap_addr(cons_rx_buf, mapping));
> + dma_unmap_addr_set(prod_rx_buf, mapping,
> + dma_unmap_addr(cons_rx_buf, mapping));
> *prod_bd = *cons_bd;
> }
>
> @@ -1285,7 +1285,7 @@ static void bnx2x_tpa_start(struct
> bnx2x_fastpath *fp, u16 queue,
> prod_rx_buf->skb = fp->tpa_pool[queue].skb;
> mapping = pci_map_single(bp->pdev,
> fp->tpa_pool[queue].skb->data,
> bp->rx_buf_size, PCI_DMA_FROMDEVICE);
> - pci_unmap_addr_set(prod_rx_buf, mapping, mapping);
> + dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
>
> /* move partial skb from cons to pool (don't unmap yet) */
> fp->tpa_pool[queue] = *cons_rx_buf;
> @@ -1361,7 +1361,7 @@ static int bnx2x_fill_frag_skb(struct
> bnx2x *bp, struct bnx2x_fastpath *fp,
> }
>
> /* Unmap the page as we r going to pass it to
> the stack */
> - pci_unmap_page(bp->pdev,
> pci_unmap_addr(&old_rx_pg, mapping),
> + pci_unmap_page(bp->pdev,
> dma_unmap_addr(&old_rx_pg, mapping),
> SGE_PAGE_SIZE*PAGES_PER_SGE,
> PCI_DMA_FROMDEVICE);
>
> /* Add one frag and update the appropriate
> fields in the skb */
> @@ -1389,7 +1389,7 @@ static void bnx2x_tpa_stop(struct bnx2x
> *bp, struct bnx2x_fastpath *fp,
> /* Unmap skb in the pool anyway, as we are going to change
> pool entry status to BNX2X_TPA_STOP even if new skb
> allocation
> fails. */
> - pci_unmap_single(bp->pdev, pci_unmap_addr(rx_buf, mapping),
> + pci_unmap_single(bp->pdev, dma_unmap_addr(rx_buf, mapping),
> bp->rx_buf_size, PCI_DMA_FROMDEVICE);
>
> if (likely(new_skb)) {
> @@ -1621,7 +1621,7 @@ static int bnx2x_rx_int(struct
> bnx2x_fastpath *fp, int budget)
> }
>
> pci_dma_sync_single_for_device(bp->pdev,
> - pci_unmap_addr(rx_buf, mapping),
> + dma_unmap_addr(rx_buf, mapping),
> pad +
> RX_COPY_THRESH,
>
> PCI_DMA_FROMDEVICE);
> prefetch(skb);
> @@ -1666,7 +1666,7 @@ static int bnx2x_rx_int(struct
> bnx2x_fastpath *fp, int budget)
> } else
> if (likely(bnx2x_alloc_rx_skb(bp, fp,
> bd_prod) == 0)) {
> pci_unmap_single(bp->pdev,
> - pci_unmap_addr(rx_buf, mapping),
> + dma_unmap_addr(rx_buf, mapping),
> bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
> skb_reserve(skb, pad);
> @@ -4941,7 +4941,7 @@ static inline void
> bnx2x_free_tpa_pool(struct bnx2x *bp,
>
> if (fp->tpa_state[i] == BNX2X_TPA_START)
> pci_unmap_single(bp->pdev,
> - pci_unmap_addr(rx_buf,
> mapping),
> + dma_unmap_addr(rx_buf,
> mapping),
> bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
>
> dev_kfree_skb(skb);
> @@ -4978,7 +4978,7 @@ static void bnx2x_init_rx_rings(struct
> bnx2x *bp)
> fp->disable_tpa = 1;
> break;
> }
> - pci_unmap_addr_set((struct sw_rx_bd *)
> + dma_unmap_addr_set((struct sw_rx_bd *)
>
> &bp->fp->tpa_pool[i],
> mapping, 0);
> fp->tpa_state[i] = BNX2X_TPA_STOP;
> @@ -6907,7 +6907,7 @@ static void bnx2x_free_rx_skbs(struct bnx2x *bp)
> continue;
>
> pci_unmap_single(bp->pdev,
> - pci_unmap_addr(rx_buf,
> mapping),
> + dma_unmap_addr(rx_buf,
> mapping),
> bp->rx_buf_size,
> PCI_DMA_FROMDEVICE);
>
> rx_buf->skb = NULL;
> --
> 1.7.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply
* Re: [PATCH net-next-2.6 v4 00/14] l2tp: Introduce L2TPv3 support
From: Eric Dumazet @ 2010-04-04 8:14 UTC (permalink / raw)
To: David Miller; +Cc: jchapman, netdev
In-Reply-To: <20100404.010259.161188935.davem@davemloft.net>
Le dimanche 04 avril 2010 à 01:02 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sun, 04 Apr 2010 09:54:28 +0200
>
> > [PATCH net-next-2.6] l2tp: unmanaged L2TPv3 tunnels fixes
> >
> > Followup to commit 789a4a2c
> > (l2tp: Add support for static unmanaged L2TPv3 tunnels)
> >
> > One missing init in l2tp_tunnel_sock_create() could access random kernel
> > memory, and a bit field should be unsigned.
> >
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>
> Applied, thanks Eric.
I am going to work on net/l2tp/l2tp_core.c, since RCU conversion is
wrong (but original code was wong too...)
Example :
There is no real protection in following code, since no refcount is
taken on session before releasing rcu_read_lock :
static struct l2tp_session *l2tp_session_find_2(struct net *net, u32 session_id)
{
struct l2tp_net *pn = l2tp_pernet(net);
struct hlist_head *session_list =
l2tp_session_id_hash_2(pn, session_id);
struct l2tp_session *session;
struct hlist_node *walk;
rcu_read_lock_bh();
hlist_for_each_entry_rcu(session, walk, session_list, global_hlist) {
if (session->session_id == session_id) {
rcu_read_unlock_bh();
return session;
}
}
rcu_read_unlock_bh();
return NULL;
}
^ permalink raw reply
* Re: [PATCH net-next-2.6 v4 00/14] l2tp: Introduce L2TPv3 support
From: David Miller @ 2010-04-04 8:02 UTC (permalink / raw)
To: eric.dumazet; +Cc: jchapman, netdev
In-Reply-To: <1270367668.1971.3.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 04 Apr 2010 09:54:28 +0200
> [PATCH net-next-2.6] l2tp: unmanaged L2TPv3 tunnels fixes
>
> Followup to commit 789a4a2c
> (l2tp: Add support for static unmanaged L2TPv3 tunnels)
>
> One missing init in l2tp_tunnel_sock_create() could access random kernel
> memory, and a bit field should be unsigned.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied, thanks Eric.
^ permalink raw reply
* Re: [PATCH net-next-2.6 v4 00/14] l2tp: Introduce L2TPv3 support
From: Eric Dumazet @ 2010-04-04 7:54 UTC (permalink / raw)
To: David Miller; +Cc: jchapman, netdev
In-Reply-To: <20100403.142522.01040744.davem@davemloft.net>
Le samedi 03 avril 2010 à 14:25 -0700, David Miller a écrit :
> From: James Chapman <jchapman@katalix.com>
> Date: Fri, 02 Apr 2010 17:18:23 +0100
>
> > This patch series adds L2TPv3 support. It splits the existing pppol2tp
> > driver to separate its L2TP and PPP parts, then adds new L2TPv3
> > functionality. The patches implement a new socket family for L2TPv3 IP
> > encapsulation, expose virtual netdevices for each L2TPv3 ethernet
> > pseudowire and add a netlink interface.
>
> Ok I'm going to toss this into net-next-2.6, we have time to
> fixup any remaining issues people might discover.
>
[PATCH net-next-2.6] l2tp: unmanaged L2TPv3 tunnels fixes
Followup to commit 789a4a2c
(l2tp: Add support for static unmanaged L2TPv3 tunnels)
One missing init in l2tp_tunnel_sock_create() could access random kernel
memory, and a bit field should be unsigned.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/l2tp/l2tp_core.c | 2 +-
net/l2tp/l2tp_core.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 13ed85b..98dfcce 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1227,7 +1227,7 @@ static int l2tp_tunnel_sock_create(u32 tunnel_id, u32 peer_tunnel_id, struct l2t
int err = -EINVAL;
struct sockaddr_in udp_addr;
struct sockaddr_l2tpip ip_addr;
- struct socket *sock;
+ struct socket *sock = NULL;
switch (cfg->encap) {
case L2TP_ENCAPTYPE_UDP:
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 91b1b9c..f0f318e 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -152,7 +152,7 @@ struct l2tp_tunnel_cfg {
struct in_addr peer_ip;
u16 local_udp_port;
u16 peer_udp_port;
- int use_udp_checksums:1;
+ unsigned int use_udp_checksums:1;
};
struct l2tp_tunnel {
^ permalink raw reply related
* Re: linux-next: build warning after merge of the tree
From: Stephen Rothwell @ 2010-04-04 7:26 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-next, linux-kernel, kyle
In-Reply-To: <20100328.185707.161194820.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 342 bytes --]
Hi Dave,
On Sun, 28 Mar 2010 18:57:07 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>
> Strange, it didn't warn here.
>
> I just committed the following, let me know if it fixes
> this.
Yeah, that did it, thanks
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply
* Re: [PATCHv3] drivers/net/usb: Add new driver ipheth
From: Oliver Neukum @ 2010-04-04 7:24 UTC (permalink / raw)
To: L. Alberto Giménez
Cc: linux-kernel, netdev, linux-usb, linville, j.dumon,
steve.glendinning, davem, gregkh, dgiagio, dborca
In-Reply-To: <4BB63619.6070607@sysvalve.es>
Am Freitag, 2. April 2010 20:23:21 schrieb L. Alberto Giménez:
> On 03/31/2010 10:33 PM, Oliver Neukum wrote:
> > Am Mittwoch, 31. März 2010 21:42:07 schrieb L. Alberto Giménez:
>
> Hi Oliver,
>
> Just like with Ben's comments I still have a couple of doubts about your
> comments.
>
>
> >> +
> >> +static int ipheth_open(struct net_device *net)
> >> +{
> >> + struct ipheth_device *dev = netdev_priv(net);
> >> + struct usb_device *udev = dev->udev;
> >> + int retval = 0;
> >> +
> >> + usb_set_interface(udev, IPHETH_INTFNUM, IPHETH_ALT_INTFNUM);
> >> + usb_clear_halt(udev, usb_rcvbulkpipe(udev, dev->bulk_in));
> >> + usb_clear_halt(udev, usb_sndbulkpipe(udev, dev->bulk_out));
> >
> > Is this really needed? If so, please add a comment.
>
> I understand that usb_clear_halt is only needed when the device has
> transmitted data, and as it is "open" time, we can assume that no
> transmissions ere made, so we don't need to clear anything (aka: remove
> both lines), am I right?
Clearing a halt is necessary only when a device has stalled due to an
error condition. Unless the device is buggy and produces errors for
no good reason you don't need these lines.
> >> +
> >> + retval = ipheth_carrier_set(dev);
> >> + if (retval)
> >> + goto error;
> >> +
> >> + retval = ipheth_rx_submit(dev, GFP_KERNEL);
> >> + if (retval)
> >> + goto error;
> >> +
> >> + schedule_delayed_work(&dev->carrier_work, IPHETH_CARRIER_CHECK_TIMEOUT);
> >
> > Does it make sense to start rx while you have no carrier?
>
> Well, I have no clue about this one. I think that upstream developers
> should take a look into this (Dario, Daniel, could you?) since I don't
> have the knowledge to decide what to do about it.
>
> But I assume that as with the previous one, we have just opened the
> device and we aren't (yet) doing anything with it, so we shouldn't start rx?
Your code as is is correct, I just wondered whether it could be made more
efficient.
> >> +static void ipheth_disconnect(struct usb_interface *intf)
> >> +{
> >> + struct ipheth_device *dev;
> >> +
> >> + dev = usb_get_intfdata(intf);
> >> + if (dev != NULL) {
> >
> > is this check needed?
>
> Does usb_get_infdata always return not NULL? I haven't found anything
It returns what you gave it with usb_set_intfdata().
> about it (just manual pages for the function, but can't spot if it
> cannot return NULL). We disconnected the device, but I understand that
> the kernel still has the information and the allocated memory, so the
> cleanup code is still needed, isn't it?
It is definitely needed.
> >> +static struct usb_driver ipheth_driver = {
> >> + .name = "ipheth",
> >> + .probe = ipheth_probe,
> >> + .disconnect = ipheth_disconnect,
> >> + .id_table = ipheth_table,
> >> + .supports_autosuspend = 0,
> >
> > redundant
>
> Why?
0 is the default.
Regards
Oliver
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-04 6:28 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100404061941.GA9642@gondor.apana.org.au>
Herbert Xu wrote:
> On Sun, Apr 04, 2010 at 09:07:12AM +0300, Timo Teräs wrote:
>> With 1/4 only, the bundle deletion is not touched. In that case
>> the policy GC deletes explicitly the bundles. The bundles get
>> deleted immediately, and only the struct xfrm_policy might get
>> held up allocated longer.
>>
>> The code flow would be:
>> 1. xfrm_policy_kill() queues to GC
>> 2. xfrm_policy_gc_kill() called from xfrm_policy_gc_task()
>> frees all bundles in that policy
>> 3. xfrm_policy_gc_kill() releases it's reference
>> 4. ... time passes (flush, randomization, or flow hit occurs)
>> 5. flow cache releases it's final reference, calls
>> xfrm_policy_destroy() which only frees the xfrm_policy memory
>
> With 1/4 only, you've removed the flow cache flush when policies
> are deleted. However, you don't add the flow cache flush to the
> NETDEV_DOWN case until 2/4. So with only 1/4, bundles (and hence
> netdev objects) may be held for up to 10 minutes.
No. The flow cache flush removal does not prevent bundle deletion.
The flow cache flush is in current code *after* deleting the bundles
from the policy. Freeing bundles and flushing cache are completely
two separate things in current code. Only in 2/4 the bundle deletion
becomes dependent on flow cache flush.
Please, read xfrm_policy_kill() and xfrm_policy_gc_kill() in current
code, and after applying 1/4. The diff of 1/4 is not as informative
as the bundle deletion code is not shown there (since it's not touched).
The only reason why current code does flow cache flush on policy
removal, is to make sure that it's not the flow cache atomic_dec
to drop last reference; if that happened xfrm_policy_destroy would
not get ever called.
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04 6:19 UTC (permalink / raw)
To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB82C90.70809@iki.fi>
On Sun, Apr 04, 2010 at 09:07:12AM +0300, Timo Teräs wrote:
>
> With 1/4 only, the bundle deletion is not touched. In that case
> the policy GC deletes explicitly the bundles. The bundles get
> deleted immediately, and only the struct xfrm_policy might get
> held up allocated longer.
>
> The code flow would be:
> 1. xfrm_policy_kill() queues to GC
> 2. xfrm_policy_gc_kill() called from xfrm_policy_gc_task()
> frees all bundles in that policy
> 3. xfrm_policy_gc_kill() releases it's reference
> 4. ... time passes (flush, randomization, or flow hit occurs)
> 5. flow cache releases it's final reference, calls
> xfrm_policy_destroy() which only frees the xfrm_policy memory
With 1/4 only, you've removed the flow cache flush when policies
are deleted. However, you don't add the flow cache flush to the
NETDEV_DOWN case until 2/4. So with only 1/4, bundles (and hence
netdev objects) may be held for up to 10 minutes.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-04 6:07 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100404055830.GA9484@gondor.apana.org.au>
Herbert Xu wrote:
> On Sun, Apr 04, 2010 at 08:50:41AM +0300, Timo Teräs wrote:
>> For the common case:
>>
>> 1. Policy deleted; policy->walk.dead set, policy->genid incremented
>> 2. NETDEV_DOWN hook called, calls flow_cache_flush()
>> 3. flow_cache_flush enumerates all policy and bundle refs
>> in it's cache
>> 4. for each bundle xfrm_bundle_check_fce() is called, which
>> calls stale_bundle()
>> 5. all bundles using stale policy, fail that check because
>> xdst->policy_genid != xdst->pols[0]->genid
>> (checked in xfrm_bundle_ok)
>> 6. flow cache calls entry's ->delete which is dst_free for bundles
>> 7. flow_cache_flush() returns
>
> Ah, you're doing it in 2/4. Can we please have each patch be
> a self-contained unit? It should be possible to apply 1/4 and
> have a resulting kernel that works properly without having to
> apply the rest of your patches.
With 1/4 only, the bundle deletion is not touched. In that case
the policy GC deletes explicitly the bundles. The bundles get
deleted immediately, and only the struct xfrm_policy might get
held up allocated longer.
The code flow would be:
1. xfrm_policy_kill() queues to GC
2. xfrm_policy_gc_kill() called from xfrm_policy_gc_task()
frees all bundles in that policy
3. xfrm_policy_gc_kill() releases it's reference
4. ... time passes (flush, randomization, or flow hit occurs)
5. flow cache releases it's final reference, calls
xfrm_policy_destroy() which only frees the xfrm_policy memory
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04 5:58 UTC (permalink / raw)
To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB828B1.5090109@iki.fi>
On Sun, Apr 04, 2010 at 08:50:41AM +0300, Timo Teräs wrote:
>
> For the common case:
>
> 1. Policy deleted; policy->walk.dead set, policy->genid incremented
> 2. NETDEV_DOWN hook called, calls flow_cache_flush()
> 3. flow_cache_flush enumerates all policy and bundle refs
> in it's cache
> 4. for each bundle xfrm_bundle_check_fce() is called, which
> calls stale_bundle()
> 5. all bundles using stale policy, fail that check because
> xdst->policy_genid != xdst->pols[0]->genid
> (checked in xfrm_bundle_ok)
> 6. flow cache calls entry's ->delete which is dst_free for bundles
> 7. flow_cache_flush() returns
Ah, you're doing it in 2/4. Can we please have each patch be
a self-contained unit? It should be possible to apply 1/4 and
have a resulting kernel that works properly without having to
apply the rest of your patches.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-04 5:50 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100404020657.GA8520@gondor.apana.org.au>
Herbert Xu wrote:
> On Sat, Apr 03, 2010 at 11:19:04PM +0300, Timo Teräs wrote:
>> If someone is then removing a net driver, we still execute
>> flush on the 'device down' hook, and all stale bundles
>> get flushed.
>
> Not if the bundle belongs to a policy recently deleted.
>
>> But yes, this means that xfrm_policy struct can now be held
>> allocated up to ten extra minutes. But it's only memory that
>> it's holding, not any extra refs. And it's still reclaimable
>> by the GC.
>
> You also hold down the bundle xdst's along with it, which can
> hold netdev references preventing modules from being unloaded.
>
>> If this feels troublesome, we could add asynchronous flush
>> request that would be called on policy removal. Or even stick
>> to the synchronous one.
>
> How about change xfrm_flush_bundles to flush bundles from the
> cache instead of xfrm_policy?
For the common case:
1. Policy deleted; policy->walk.dead set, policy->genid incremented
2. NETDEV_DOWN hook called, calls flow_cache_flush()
3. flow_cache_flush enumerates all policy and bundle refs
in it's cache
4. for each bundle xfrm_bundle_check_fce() is called, which
calls stale_bundle()
5. all bundles using stale policy, fail that check because
xdst->policy_genid != xdst->pols[0]->genid
(checked in xfrm_bundle_ok)
6. flow cache calls entry's ->delete which is dst_free for bundles
7. flow_cache_flush() returns
flow_cache_flush really frees the bundles in it on flush.
But now that I look my code again. Your statement is true for
per-socket bundles. They would not get deleted in this case.
I'll change NETDEV_DOWN to call garbage collect instead of
flow cache flush which will then also free the per-socket bundles.
^ permalink raw reply
* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04 2:06 UTC (permalink / raw)
To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB7A2B8.4040405@iki.fi>
On Sat, Apr 03, 2010 at 11:19:04PM +0300, Timo Teräs wrote:
>
> If someone is then removing a net driver, we still execute
> flush on the 'device down' hook, and all stale bundles
> get flushed.
Not if the bundle belongs to a policy recently deleted.
> But yes, this means that xfrm_policy struct can now be held
> allocated up to ten extra minutes. But it's only memory that
> it's holding, not any extra refs. And it's still reclaimable
> by the GC.
You also hold down the bundle xdst's along with it, which can
hold netdev references preventing modules from being unloaded.
> If this feels troublesome, we could add asynchronous flush
> request that would be called on policy removal. Or even stick
> to the synchronous one.
How about change xfrm_flush_bundles to flush bundles from the
cache instead of xfrm_policy?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* [PATCH 3/3] IPv6: Generic TTL Security Mechanism (unified version)
From: Stephen Hemminger @ 2010-04-03 23:21 UTC (permalink / raw)
To: davem, Pekka Savola, YOSHIFUJI Hideaki, Nick Hilliard; +Cc: netdev
In-Reply-To: <20100403232103.923025940@vyatta.com>
[-- Attachment #1: gtsm-ipv6.diff --]
[-- Type: text/plain, Size: 3126 bytes --]
This patch is one alternative IPv6 support for RFC5082 Generalized TTL
Security Mechanism.
This version takes a simplest (but least pure) approach.
It uses the same socket option for IPv6 as IPv4 because
the TCP code has to deal with mapped addresses already.
With this method, the server doesn't have to deal with both IPv4 and IPv6
socket options. But the client still does have to handle the different
options.
On client:
int ttl = 255;
getaddrinfo(argv[1], argv[2], &hint, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET) {
setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
} else if (rp->ai_family == AF_INET6) {
setsockopt(s, IPPROTO_IPV6, IPV6_UNICAST_HOPS,
&ttl, sizeof(ttl)))
}
if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
...
On server:
unsigned char minttl = 255 - maxhops;
getaddrinfo(NULL, port, &hints, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));
if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
break
..
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
---
net/ipv4/tcp_ipv4.c | 15 +++++++++++----
net/ipv6/tcp_ipv6.c | 10 ++++++++++
2 files changed, 21 insertions(+), 4 deletions(-)
--- a/net/ipv6/tcp_ipv6.c 2010-04-02 21:19:39.692013672 -0700
+++ b/net/ipv6/tcp_ipv6.c 2010-04-03 15:55:43.778224848 -0700
@@ -349,6 +349,11 @@ static void tcp_v6_err(struct sk_buff *s
if (sk->sk_state == TCP_CLOSE)
goto out;
+ if (ipv6_hdr(skb)->hop_limit < inet_sk(sk)->min_ttl) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto out;
+ }
+
tp = tcp_sk(sk);
seq = ntohl(th->seq);
if (sk->sk_state != TCP_LISTEN &&
@@ -1717,6 +1722,11 @@ process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
+ if (ipv6_hdr(skb)->hop_limit < inet_sk(sk)->min_ttl) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto discard_and_relse;
+ }
+
if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_and_relse;
--- a/net/ipv4/tcp_ipv4.c 2010-04-02 21:19:39.682014278 -0700
+++ b/net/ipv4/tcp_ipv4.c 2010-04-02 21:20:25.571077252 -0700
@@ -1660,10 +1660,14 @@ process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
- if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
- NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
- goto discard_and_relse;
- }
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+ if (skb->protocol == htons(ETH_P_IPV6)) {
+ if (ipv6_hdr(skb)->hop_limit < inet_sk(sk)->min_ttl)
+ goto min_ttl_discard;
+ } else
+#endif
+ if (iph->ttl < inet_sk(sk)->min_ttl)
+ goto min_ttl_discard;
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_and_relse;
@@ -1716,6 +1720,9 @@ discard_it:
kfree_skb(skb);
return 0;
+min_ttl_discard:
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+
discard_and_relse:
sock_put(sk);
goto discard_it;
--
^ permalink raw reply
* [PATCH 2/3] IPv6: Generic TTL Security Mechanism (alternate version)
From: Stephen Hemminger @ 2010-04-03 23:21 UTC (permalink / raw)
To: davem, Pekka Savola, YOSHIFUJI Hideaki, Nick Hilliard; +Cc: netdev
In-Reply-To: <20100403232103.923025940@vyatta.com>
[-- Attachment #1: gtsm-ipv6-ipv4-1a.diff --]
[-- Type: text/plain, Size: 4615 bytes --]
This patch adds IPv6 support for RFC5082 Generalized TTL
Security Mechanism.
Very similar to original proposed code; the IPV6 and IPV4
socket options are seperate, but the setting the IPV6 min hopcount
also sets IPV4 min ttl. This makes for less possible errors by the
programmer when using IPV4 mapped addresses.
With this method, the server does have to deal with both IPv4 and IPv6
socket options and the client has to handle the different for each
family.
On client:
int ttl = 255;
getaddrinfo(argv[1], argv[2], &hint, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET) {
setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
} else if (rp->ai_family == AF_INET6) {
setsockopt(s, IPPROTO_IPV6, IPV6_UNICAST_HOPS,
&ttl, sizeof(ttl)))
}
if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
...
On server:
int minttl = 255 - maxhops;
getaddrinfo(NULL, port, &hints, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET6)
setsockopt(s, IPPROTO_IPV6, IPV6_MINHOPCOUNT,
&minttl, sizeof(minttl));
else
setsockopt(s, IPPROTO_IP, IP_MINTTL,
&minttl, sizeof(minttl));
if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
break
..
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
---
include/linux/in6.h | 3 +++
include/linux/ipv6.h | 1 +
net/ipv6/ipv6_sockglue.c | 13 +++++++++++++
net/ipv6/tcp_ipv6.c | 14 +++++++++++++-
4 files changed, 30 insertions(+), 1 deletion(-)
--- a/net/ipv6/tcp_ipv6.c 2010-04-03 16:13:50.450038495 -0700
+++ b/net/ipv6/tcp_ipv6.c 2010-04-03 16:14:58.579101059 -0700
@@ -349,6 +349,11 @@ static void tcp_v6_err(struct sk_buff *s
if (sk->sk_state == TCP_CLOSE)
goto out;
+ if (ipv6_hdr(skb)->hop_limit < inet6_sk(sk)->min_hopcount) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto out;
+ }
+
tp = tcp_sk(sk);
seq = ntohl(th->seq);
if (sk->sk_state != TCP_LISTEN &&
@@ -1675,6 +1680,7 @@ ipv6_pktoptions:
static int tcp_v6_rcv(struct sk_buff *skb)
{
struct tcphdr *th;
+ struct ipv6hdr *hdr;
struct sock *sk;
int ret;
struct net *net = dev_net(skb->dev);
@@ -1701,12 +1707,13 @@ static int tcp_v6_rcv(struct sk_buff *sk
goto bad_packet;
th = tcp_hdr(skb);
+ hdr = ipv6_hdr(skb);
TCP_SKB_CB(skb)->seq = ntohl(th->seq);
TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
skb->len - th->doff*4);
TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
TCP_SKB_CB(skb)->when = 0;
- TCP_SKB_CB(skb)->flags = ipv6_get_dsfield(ipv6_hdr(skb));
+ TCP_SKB_CB(skb)->flags = ipv6_get_dsfield(hdr);
TCP_SKB_CB(skb)->sacked = 0;
sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
@@ -1717,6 +1724,11 @@ process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
+ if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto discard_and_relse;
+ }
+
if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_and_relse;
--- a/include/linux/in6.h 2010-04-03 16:13:50.460038505 -0700
+++ b/include/linux/in6.h 2010-04-03 16:14:58.579101059 -0700
@@ -265,6 +265,9 @@ struct in6_flowlabel_req {
#define IPV6_PREFER_SRC_CGA 0x0008
#define IPV6_PREFER_SRC_NONCGA 0x0800
+/* RFC5082: Generalized Ttl Security Mechanism */
+#define IPV6_MINHOPCOUNT 73
+
/*
* Multicast Routing:
* see include/linux/mroute6.h.
--- a/include/linux/ipv6.h 2010-04-03 16:13:50.470038394 -0700
+++ b/include/linux/ipv6.h 2010-04-03 16:14:58.579101059 -0700
@@ -348,6 +348,7 @@ struct ipv6_pinfo {
* 010: prefer public address
* 100: prefer care-of address
*/
+ __u8 min_hopcount;
__u8 tclass;
__u32 dst_cookie;
--- a/net/ipv6/ipv6_sockglue.c 2010-04-03 16:13:50.440037810 -0700
+++ b/net/ipv6/ipv6_sockglue.c 2010-04-03 16:16:45.973137844 -0700
@@ -766,6 +766,15 @@ pref_skip_coa:
break;
}
+ case IPV6_MINHOPCOUNT:
+ if (optlen < sizeof(int))
+ goto e_inval;
+ if (val < 0 || val > 255)
+ goto e_inval;
+ np->min_hopcount = val;
+ inet_sk(sk)->min_ttl = val;
+ retv = 0;
+ break;
}
release_sock(sk);
@@ -1114,6 +1123,10 @@ static int do_ipv6_getsockopt(struct soc
val |= IPV6_PREFER_SRC_HOME;
break;
+ case IPV6_MINHOPCOUNT:
+ val = np->min_hopcount;
+ break;
+
default:
return -ENOPROTOOPT;
}
--
^ permalink raw reply
* [PATCH 1/3] IPv6: Generic TTL Security Mechanism (original version)
From: Stephen Hemminger @ 2010-04-03 23:21 UTC (permalink / raw)
To: davem, Pekka Savola, YOSHIFUJI Hideaki, Nick Hilliard; +Cc: netdev
In-Reply-To: <20100403232103.923025940@vyatta.com>
[-- Attachment #1: gtsm-ipv6-ipv4-1.diff --]
[-- Type: text/plain, Size: 4384 bytes --]
This patch adds IPv6 support for RFC5082 Generalized TTL
Security Mechanism.
The original proposed code; the IPV6 and IPV4 socket options are seperate.
With this method, the server does have to deal with both IPv4 and IPv6
socket options and the client has to handle the different for each
family.
On client:
int ttl = 255;
getaddrinfo(argv[1], argv[2], &hint, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET) {
setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
} else if (rp->ai_family == AF_INET6) {
setsockopt(s, IPPROTO_IPV6, IPV6_UNICAST_HOPS,
&ttl, sizeof(ttl)))
}
if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
...
On server:
int minttl = 255 - maxhops;
getaddrinfo(NULL, port, &hints, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET6)
setsockopt(s, IPPROTO_IPV6, IPV6_MINHOPCOUNT,
&minttl, sizeof(minttl));
setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));
if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
break
..
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
---
include/linux/in6.h | 3 +++
include/linux/ipv6.h | 1 +
net/ipv6/ipv6_sockglue.c | 12 ++++++++++++
net/ipv6/tcp_ipv6.c | 14 +++++++++++++-
4 files changed, 29 insertions(+), 1 deletion(-)
--- a/net/ipv6/tcp_ipv6.c 2010-04-03 16:04:01.159412921 -0700
+++ b/net/ipv6/tcp_ipv6.c 2010-04-03 16:05:15.749413056 -0700
@@ -349,6 +349,11 @@ static void tcp_v6_err(struct sk_buff *s
if (sk->sk_state == TCP_CLOSE)
goto out;
+ if (ipv6_hdr(skb)->hop_limit < inet6_sk(sk)->min_hopcount) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto out;
+ }
+
tp = tcp_sk(sk);
seq = ntohl(th->seq);
if (sk->sk_state != TCP_LISTEN &&
@@ -1675,6 +1680,7 @@ ipv6_pktoptions:
static int tcp_v6_rcv(struct sk_buff *skb)
{
struct tcphdr *th;
+ struct ipv6hdr *hdr;
struct sock *sk;
int ret;
struct net *net = dev_net(skb->dev);
@@ -1701,12 +1707,13 @@ static int tcp_v6_rcv(struct sk_buff *sk
goto bad_packet;
th = tcp_hdr(skb);
+ hdr = ipv6_hdr(skb);
TCP_SKB_CB(skb)->seq = ntohl(th->seq);
TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
skb->len - th->doff*4);
TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
TCP_SKB_CB(skb)->when = 0;
- TCP_SKB_CB(skb)->flags = ipv6_get_dsfield(ipv6_hdr(skb));
+ TCP_SKB_CB(skb)->flags = ipv6_get_dsfield(hdr);
TCP_SKB_CB(skb)->sacked = 0;
sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
@@ -1717,6 +1724,11 @@ process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
+ if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
+ NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ goto discard_and_relse;
+ }
+
if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_and_relse;
--- a/include/linux/in6.h 2010-04-02 21:16:07.461701802 -0700
+++ b/include/linux/in6.h 2010-04-03 16:05:15.749413056 -0700
@@ -265,6 +265,9 @@ struct in6_flowlabel_req {
#define IPV6_PREFER_SRC_CGA 0x0008
#define IPV6_PREFER_SRC_NONCGA 0x0800
+/* RFC5082: Generalized Ttl Security Mechanism */
+#define IPV6_MINHOPCOUNT 73
+
/*
* Multicast Routing:
* see include/linux/mroute6.h.
--- a/include/linux/ipv6.h 2010-04-02 21:16:04.122638936 -0700
+++ b/include/linux/ipv6.h 2010-04-03 16:05:15.769412669 -0700
@@ -348,6 +348,7 @@ struct ipv6_pinfo {
* 010: prefer public address
* 100: prefer care-of address
*/
+ __u8 min_hopcount;
__u8 tclass;
__u32 dst_cookie;
--- a/net/ipv6/ipv6_sockglue.c 2010-04-02 21:15:26.071389733 -0700
+++ b/net/ipv6/ipv6_sockglue.c 2010-04-03 16:05:15.789412706 -0700
@@ -766,6 +766,14 @@ pref_skip_coa:
break;
}
+ case IPV6_MINHOPCOUNT:
+ if (optlen < sizeof(int))
+ goto e_inval;
+ if (val < 0 || val > 255)
+ goto e_inval;
+ np->min_hopcount = val;
+ retv = 0;
+ break;
}
release_sock(sk);
@@ -1114,6 +1122,10 @@ static int do_ipv6_getsockopt(struct soc
val |= IPV6_PREFER_SRC_HOME;
break;
+ case IPV6_MINHOPCOUNT:
+ val = np->min_hopcount;
+ break;
+
default:
return -ENOPROTOOPT;
}
--
^ permalink raw reply
* [PATCH 0/3] GTSM support for IPv6 three alternatives
From: Stephen Hemminger @ 2010-04-03 23:21 UTC (permalink / raw)
To: davem, Pekka Savola, YOSHIFUJI Hideaki, Nick Hilliard; +Cc: netdev
I have tested three alternative versions of this patch and can't
reach a diffinitive conclusion. I favor the unified version (simplest API),
but need more opinions. It would not be good if Linux implements
one option and BSD choose the other! That is why the original
BSD developer and the quagga patch author are receiving this.
In case it is not obvious, only one of these choices should be
applied!
--
^ permalink raw reply
* Re: [PATCH net-next-2.6] icmp: Account for ICMP out errors
From: David Miller @ 2010-04-03 22:09 UTC (permalink / raw)
To: eric.dumazet; +Cc: afleming, netdev
In-Reply-To: <1270192174.1936.33.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 02 Apr 2010 09:09:34 +0200
> [PATCH net-next-2.6] icmp: Account for ICMP out errors
>
> When ip_append() fails because of socket limit or memory shortage,
> increment ICMP_MIB_OUTERRORS counter, so that "netstat -s" can report
> these errors.
>
> LANG=C netstat -s | grep "ICMP messages failed"
> 0 ICMP messages failed
>
> For IPV6, implement ICMP6_MIB_OUTERRORS counter as well.
>
> # grep Icmp6OutErrors /proc/net/dev_snmp6/*
> /proc/net/dev_snmp6/eth0:Icmp6OutErrors 0
> /proc/net/dev_snmp6/lo:Icmp6OutErrors 0
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied, thanks Eric.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox