* Re: dhcp client packet sniffing...
From: David Miller @ 2010-04-08 11:01 UTC (permalink / raw)
To: nhorman; +Cc: netdev, herbert
In-Reply-To: <20100408105951.GA16868@hmsreliant.think-freely.org>
From: Neil Horman <nhorman@tuxdriver.com>
Date: Thu, 8 Apr 2010 06:59:51 -0400
> Dave, sorry to be late to the discussion, but can you refesh us on
> why dhcp has to sniff every packet?
It wants to see DHCP packets from the server.
And it can't use ipv4 RAW sockets due to some limitations wrt. getting
at the link layer headers when using that interfaces.
^ permalink raw reply
* Re: dhcp client packet sniffing...
From: Neil Horman @ 2010-04-08 10:59 UTC (permalink / raw)
To: David Miller; +Cc: netdev, herbert
In-Reply-To: <20100408.035049.177640912.davem@davemloft.net>
On Thu, Apr 08, 2010 at 03:50:49AM -0700, David Miller wrote:
>
> This is an old topic, but looking at traces tonight I was reminded
> about it.
>
> dhcp clients sniff every packet in the system, the reason it does this
> and the things we can do to make it not have to do so have been
> discussed before. Actually, I don't remember where we got with
> that and if we were able to make it such that the dhcp client
> doesn't have to do this any more. Herbert?
>
Dave, sorry to be late to the discussion, but can you refesh us on why dhcp has
to sniff every packet?
Regards
Neil
^ permalink raw reply
* dhcp client packet sniffing...
From: David Miller @ 2010-04-08 10:50 UTC (permalink / raw)
To: netdev; +Cc: herbert
This is an old topic, but looking at traces tonight I was reminded
about it.
dhcp clients sniff every packet in the system, the reason it does this
and the things we can do to make it not have to do so have been
discussed before. Actually, I don't remember where we got with
that and if we were able to make it such that the dhcp client
doesn't have to do this any more. Herbert?
But, in any event, the fact of the matter is that currently it still
does on many machines.
This means every packet in the machine gets sniffed.
The DHCP client at least installs a socket filter that only accepts
the packets that the DHCP client is actually interested in.
The problem is that we clone the SKB and do some other operations
before running the socket filter.
I was thinking, what if we simply move the sk_filter() call up to
dev_queue_xmit_nit()? And if sk_filter() rejects we don't even need
to clone the packet.
^ permalink raw reply
* FEC driver: rcv is not +last
From: Matthias Kaehlcke @ 2010-04-08 10:40 UTC (permalink / raw)
To: netdev; +Cc: Sascha Hauer
hi,
i have problems with the FEC on a i.MX25 3-Stack board. the kernel is
v2.6.34-rc2 plus the following patch:
http://patchwork.ozlabs.org/patch/41235/
the following traces are generated at boot time:
FEC Ethernet Driver
fec: PHY @ 0x1, ID 0x20005ce1 -- unknown PHY!
...
eth0: config: auto-negotiation on, 100FDX, 100HDX, 10FDX, 10HDX.
...
FEC ENET: rcv is not +last
FEC ENET: rcv is not +last
FEC ENET: rcv is not +last
FEC ENET: rcv is not +last
...
the PHY of the board is a DP83840, which is not supported by the
driver. could this be the problem? i tried to make the kernel think
the DP83840 is a DP83848, which is supported, but the behaviour is the
same except the 'unknown PHY' warning.
any idea what could be wrong?
--
Matthias Kaehlcke
Embedded Linux Developer
Barcelona
You can't separate peace from freedom because no
one can be at peace unless he has his freedom
(Malcolm X)
.''`.
using free software / Debian GNU/Linux | http://debian.org : :' :
`. `'`
gpg --keyserver pgp.mit.edu --recv-keys 47D8E5D4 `-
^ permalink raw reply
* Re: IP/UDP encapsulation
From: ZioPRoTo (Saverio Proto) @ 2010-04-08 10:39 UTC (permalink / raw)
To: Eric Dumazet
Cc: Mark Smith, Gustavo F. Padovan, netdev, marco bonola,
Behling Mario, L. Aaron Kaplan
In-Reply-To: <1270722573.2215.47.camel@edumazet-laptop>
>> I'm a bit confused. How can tunnelling IP in UDP in IP be faster than IP in IP?
>>
>
> Maybe the 'gateway' doesnt handle IPIP at all ;)
Yes that's the point. IP in UDP is more supported. It has much an
easier way when your traffic travels over the Internet, and maybe you
have to pass some NAT. Some NAT will not handle at all IP in IP
packets.
IP in IP has of course less overhead but you can experience problems
and many network setups. This is why at Freifunk we are thinking of
developing this kernel module.
Regards
Saverio Proto
^ permalink raw reply
* Re: IP/UDP encapsulation
From: Eric Dumazet @ 2010-04-08 10:29 UTC (permalink / raw)
To: Mark Smith
Cc: Gustavo F. Padovan, netdev, marco bonola,
ZioPRoTo (Saverio Proto), Behling Mario, L. Aaron Kaplan
In-Reply-To: <20100408191831.08cd8d7b@opy.nosense.org>
Le jeudi 08 avril 2010 à 19:18 +0930, Mark Smith a écrit :
> On Thu, 8 Apr 2010 04:42:47 -0300
> "Gustavo F. Padovan" <gustavo@padovan.org> wrote:
>
> > Hi,
> >
> > I'm looking for some advice on that work. The Freifunk organization is
> > planning work on the IP/UDP encapsulation kernel module as a GSoC
> > project. The idea is to create a IP-in-UDP tunnel like we do for
> > IP-in-IP or IP-in-GRE tunnels. The only way to do that today is to use
> > some VPN software.
> >
> > The module will export its virtual interface through sockets and will
> > have support for the standard syscalls like the others encapsulation
> > modules.
> >
> > It will improve the performance of mesh networks that will we be able
> > to use IP-in-UDP rather than IP-in-IP.
>
> I'm a bit confused. How can tunnelling IP in UDP in IP be faster than IP in IP?
>
Maybe the 'gateway' doesnt handle IPIP at all ;)
Until 2.6.32, IPIP tunnels were not so scalable then UDP (RCU enabled)
ipip_rcv() was hitting a global rwlock
git describe 8f95dd63a2ab6fe7243c4f0bd2c3266e3a5525ab
v2.6.32-rc3-468-g8f95dd6
^ permalink raw reply
* Re: [PATCH v3 0/4] xfrm: add x86 CONFIG_COMPAT support
From: David Miller @ 2010-04-08 9:54 UTC (permalink / raw)
To: kaber; +Cc: fw, netdev, johannes
In-Reply-To: <4BBDA58B.8090301@trash.net>
From: Patrick McHardy <kaber@trash.net>
Date: Thu, 08 Apr 2010 11:44:43 +0200
> Either the kernel or the userspace programs have to be updated
> either way.
The only case in which we only have to change one side is if we add
the full set of compat support to the kernel.
If we take any other option (new XFRM numbers and new datastructures,
or only convert sendmsg() to do compat translations), it requires both
the kernel and userspace to change.
And the currently existing 32-bit binaries don't work on 64-bit
kernels because of something that cannot be classified any other way
than as being a kernel bug.
^ permalink raw reply
* Re: IP/UDP encapsulation
From: Mark Smith @ 2010-04-08 9:48 UTC (permalink / raw)
To: Gustavo F. Padovan
Cc: netdev, marco bonola, ZioPRoTo (Saverio Proto), Behling Mario,
L. Aaron Kaplan
In-Reply-To: <20100408074247.GA19798@vigoh>
On Thu, 8 Apr 2010 04:42:47 -0300
"Gustavo F. Padovan" <gustavo@padovan.org> wrote:
> Hi,
>
> I'm looking for some advice on that work. The Freifunk organization is
> planning work on the IP/UDP encapsulation kernel module as a GSoC
> project. The idea is to create a IP-in-UDP tunnel like we do for
> IP-in-IP or IP-in-GRE tunnels. The only way to do that today is to use
> some VPN software.
>
> The module will export its virtual interface through sockets and will
> have support for the standard syscalls like the others encapsulation
> modules.
>
> It will improve the performance of mesh networks that will we be able
> to use IP-in-UDP rather than IP-in-IP.
I'm a bit confused. How can tunnelling IP in UDP in IP be faster than IP in IP?
> So, instead of push all data to
> local gateway into the mesh all the data can be tunneled to a faster
> server and from there to the Internet. With all the data exiting with
> the same IP address (the fast server IP). That will improve bandwidth,
> especially for upload.
>
> Is such module acceptable for merge into the Linux Kernel?
>
> Any comments or suggestions to the module architecture and
> implementation? If you want more information about the module I can
> provide that.
>
> Regards,
>
> --
> Gustavo F. Padovan
> http://padovan.org
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v3 0/4] xfrm: add x86 CONFIG_COMPAT support
From: Patrick McHardy @ 2010-04-08 9:44 UTC (permalink / raw)
To: David Miller; +Cc: fw, netdev, johannes
In-Reply-To: <20100407.164842.54065324.davem@davemloft.net>
David Miller wrote:
> From: Patrick McHardy <kaber@trash.net>
> Date: Wed, 07 Apr 2010 15:45:51 +0200
>
>> Florian Westphal wrote:
>>> David Miller <davem@davemloft.net> wrote:
>>>> From: Florian Westphal <fw@strlen.de>
>>>> Date: Tue, 6 Apr 2010 00:27:07 +0200
>>> [..]
>>>
>>>>> I sent a patch that solved this by adding a sys_compat_write syscall
>>>>> and a ->compat_aio_write() to struct file_operations to the
>>>>> vfs mailing list, but that patch was ignored by the vfs people,
>>>>> and the x86 folks did not exactly like the idea either.
>>>>>
>>>>> So this leaves three alternatives:
>>>>> 1 - drop the whole idea and keep the current status.
>>>>> 2 - Add new structure definitions (with new numbering) that would work
>>>>> everywhere, keep the old ones for backwards compatibility (This
>>>>> was suggested by Arnd Bergmann).
>> Given that there is only a quite small number of users of this
>> interface, that would in my opinion be the best way.
>
> Can you explain that line of reasoning?
>
> It's not that there are only "3 or 4 tools" using these interfaces,
> it's the fact that 32-bit binaries of those tools are on millions and
> millions of systems out there.
Yeah, but they're currently not working if used on 64 bit hosts,
so I don't think its unreasonable to consider just the number of
tools that need to be fixed. Either the kernel or the userspace
programs have to be updated either way.
^ permalink raw reply
* Re: [PATCH] tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb
From: Joe Perches @ 2010-04-08 9:20 UTC (permalink / raw)
To: David Miller; +Cc: netdev, herbert
In-Reply-To: <20100408.012617.98660541.davem@davemloft.net>
On Thu, 2010-04-08 at 01:26 -0700, David Miller wrote:
> Fix this by setting skb->ip_summed in the common non-data packet
> constructor. It already is setting skb->csum to zero.
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f181b78..00afbb0 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -349,6 +349,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
> */
> static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
> {
> + skb->ip_summed = CHECKSUM_PARTIAL;
> skb->csum = 0;
>
> TCP_SKB_CB(skb)->flags = flags;
There might be trivial value in using the
struct layout order for the sets avoiding
crossing cachelines.
from:
static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
{
skb->ip_summed = CHECKSUM_PARTIAL;
skb->csum = 0;
TCP_SKB_CB(skb)->flags = flags;
TCP_SKB_CB(skb)->sacked = 0;
skb_shinfo(skb)->gso_segs = 1;
skb_shinfo(skb)->gso_size = 0;
skb_shinfo(skb)->gso_type = 0;
TCP_SKB_CB(skb)->seq = seq;
if (flags & (TCPCB_FLAG_SYN | TCPCB_FLAG_FIN))
seq++;
TCP_SKB_CB(skb)->end_seq = seq;
}
to:
static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
{
skb->ip_summed = CHECKSUM_PARTIAL;
skb->csum = 0;
TCP_SKB_CB(skb)->seq = seq;
if (flags & (TCPCB_FLAG_SYN | TCPCB_FLAG_FIN))
seq++;
TCP_SKB_CB(skb)->end_seq = seq;
TCP_SKB_CB(skb)->sacked = 0;
TCP_SKB_CB(skb)->flags = flags;
skb_shinfo(skb)->gso_size = 0;
skb_shinfo(skb)->gso_segs = 1;
skb_shinfo(skb)->gso_type = 0;
}
^ permalink raw reply
* Re:[PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
From: xiaohui.xin @ 2010-04-08 9:07 UTC (permalink / raw)
To: mst; +Cc: netdev, kvm, linux-kernel, jdike, Xin Xiaohui
In-Reply-To: <20100407081800.GC9550@redhat.com>
From: Xin Xiaohui <xiaohui.xin@intel.com>
---
Michael,
This is a small patch for the write logging issue with async queue.
I have made a __vhost_get_vq_desc() func which may compute the log
info with any valid buffer index. The __vhost_get_vq_desc() is
coming from the code in vq_get_vq_desc().
And I use it to recompute the log info when logging is enabled.
Thanks
Xiaohui
drivers/vhost/net.c | 27 ++++++++---
drivers/vhost/vhost.c | 115 ++++++++++++++++++++++++++++---------------------
drivers/vhost/vhost.h | 5 ++
3 files changed, 90 insertions(+), 57 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2aafd90..00a45ef 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -115,7 +115,8 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
struct kiocb *iocb = NULL;
struct vhost_log *vq_log = NULL;
int rx_total_len = 0;
- int log, size;
+ unsigned int head, log, in, out;
+ int size;
if (vq->link_state != VHOST_VQ_LINK_ASYNC)
return;
@@ -130,14 +131,25 @@ static void handle_async_rx_events_notify(struct vhost_net *net,
iocb->ki_pos, iocb->ki_nbytes);
log = (int)iocb->ki_user_data;
size = iocb->ki_nbytes;
+ head = iocb->ki_pos;
rx_total_len += iocb->ki_nbytes;
if (iocb->ki_dtor)
iocb->ki_dtor(iocb);
kmem_cache_free(net->cache, iocb);
- if (unlikely(vq_log))
+ /* when log is enabled, recomputing the log info is needed,
+ * since these buffers are in async queue, and may not get
+ * the log info before.
+ */
+ if (unlikely(vq_log)) {
+ if (!log)
+ __vhost_get_vq_desc(&net->dev, vq, vq->iov,
+ ARRAY_SIZE(vq->iov),
+ &out, &in, vq_log,
+ &log, head);
vhost_log_write(vq, vq_log, log, size);
+ }
if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
vhost_poll_queue(&vq->poll);
break;
@@ -313,14 +325,13 @@ static void handle_rx(struct vhost_net *net)
vhost_disable_notify(vq);
hdr_size = vq->hdr_size;
- /* In async cases, for write logging, the simple way is to get
- * the log info always, and really logging is decided later.
- * Thus, when logging enabled, we can get log, and when logging
- * disabled, we can get log disabled accordingly.
+ /* In async cases, when write log is enabled, in case the submitted
+ * buffers did not get log info before the log enabling, so we'd
+ * better recompute the log info when needed. We do this in
+ * handle_async_rx_events_notify().
*/
- vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) |
- (vq->link_state == VHOST_VQ_LINK_ASYNC) ?
+ vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
vq->log : NULL;
handle_async_rx_events_notify(net, vq);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 97233d5..53dab80 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -715,66 +715,21 @@ static unsigned get_indirect(struct vhost_dev *dev, struct vhost_virtqueue *vq,
return 0;
}
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access. Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which
- * is never a valid descriptor number) if none was found. */
-unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+unsigned __vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num)
+ struct vhost_log *log, unsigned int *log_num,
+ unsigned int head)
{
struct vring_desc desc;
- unsigned int i, head, found = 0;
- u16 last_avail_idx;
+ unsigned int i = head, found = 0;
int ret;
- /* Check it isn't doing very strange things with descriptor numbers. */
- last_avail_idx = vq->last_avail_idx;
- if (get_user(vq->avail_idx, &vq->avail->idx)) {
- vq_err(vq, "Failed to access avail idx at %p\n",
- &vq->avail->idx);
- return vq->num;
- }
-
- if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
- vq_err(vq, "Guest moved used index from %u to %u",
- last_avail_idx, vq->avail_idx);
- return vq->num;
- }
-
- /* If there's nothing new since last we looked, return invalid. */
- if (vq->avail_idx == last_avail_idx)
- return vq->num;
-
- /* Only get avail ring entries after they have been exposed by guest. */
- rmb();
-
- /* Grab the next descriptor number they're advertising, and increment
- * the index we've seen. */
- if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
- vq_err(vq, "Failed to read head: idx %d address %p\n",
- last_avail_idx,
- &vq->avail->ring[last_avail_idx % vq->num]);
- return vq->num;
- }
-
- /* If their number is silly, that's an error. */
- if (head >= vq->num) {
- vq_err(vq, "Guest says index %u > %u is available",
- head, vq->num);
- return vq->num;
- }
-
/* When we start there are none of either input nor output. */
*out_num = *in_num = 0;
if (unlikely(log))
*log_num = 0;
- i = head;
do {
unsigned iov_count = *in_num + *out_num;
if (i >= vq->num) {
@@ -833,8 +788,70 @@ unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
*out_num += ret;
}
} while ((i = next_desc(&desc)) != -1);
+ return head;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access. Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which
+ * is never a valid descriptor number) if none was found. */
+unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)
+{
+ struct vring_desc desc;
+ unsigned int i, head, found = 0;
+ u16 last_avail_idx;
+ unsigned int ret;
+
+ /* Check it isn't doing very strange things with descriptor numbers. */
+ last_avail_idx = vq->last_avail_idx;
+ if (get_user(vq->avail_idx, &vq->avail->idx)) {
+ vq_err(vq, "Failed to access avail idx at %p\n",
+ &vq->avail->idx);
+ return vq->num;
+ }
+
+ if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
+ vq_err(vq, "Guest moved used index from %u to %u",
+ last_avail_idx, vq->avail_idx);
+ return vq->num;
+ }
+
+ /* If there's nothing new since last we looked, return invalid. */
+ if (vq->avail_idx == last_avail_idx)
+ return vq->num;
+
+ /* Only get avail ring entries after they have been exposed by guest. */
+ rmb();
+
+ /* Grab the next descriptor number they're advertising, and increment
+ * the index we've seen. */
+ if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
+ vq_err(vq, "Failed to read head: idx %d address %p\n",
+ last_avail_idx,
+ &vq->avail->ring[last_avail_idx % vq->num]);
+ return vq->num;
+ }
+
+ /* If their number is silly, that's an error. */
+ if (head >= vq->num) {
+ vq_err(vq, "Guest says index %u > %u is available",
+ head, vq->num);
+ return vq->num;
+ }
+
+ ret = __vhost_get_vq_desc(dev, vq, iov, iov_size,
+ out_num, in_num,
+ log, log_num, head);
/* On success, increment avail index. */
+ if (ret == vq->num)
+ return ret;
vq->last_avail_idx++;
return head;
}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index cffe39a..a74a6d4 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -132,6 +132,11 @@ unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num);
+unsigned __vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
+ struct iovec iov[], unsigned int iov_count,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num,
+ unsigned int head);
void vhost_discard_vq_desc(struct vhost_virtqueue *);
int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
--
1.5.4.4
^ permalink raw reply related
* Re: [net-next-2.6 PATCH 4/4] vxge: Version update.
From: David Miller @ 2010-04-08 8:49 UTC (permalink / raw)
To: Sreenivasa.Honnur; +Cc: netdev, support
In-Reply-To: <Pine.GSO.4.10.11004080444280.365-100000@guinness>
From: Sreenivasa Honnur <Sreenivasa.Honnur@neterion.com>
Date: Thu, 8 Apr 2010 04:45:07 -0400 (EDT)
> - Version update.
>
> Signed-off-by: Sreenivasa Honnur <sreenivasa.honnur@exar.com>
> Signed-off-by: Ramkrishna Vepa <ram.vepa@exar.com>
Also applied, thanks.
^ permalink raw reply
* Re: [net-next-2.6 PATCH 3/4] vxge: Pass correct number of VFs value to pci_sriov_enable().
From: David Miller @ 2010-04-08 8:49 UTC (permalink / raw)
To: Sreenivasa.Honnur; +Cc: netdev, support
In-Reply-To: <Pine.GSO.4.10.11004080443390.365-100000@guinness>
From: Sreenivasa Honnur <Sreenivasa.Honnur@neterion.com>
Date: Thu, 8 Apr 2010 04:44:26 -0400 (EDT)
> - max_config_dev loadable parameter is set to 0xFF by default. Pass correct
> number of VFs value to pci_sriov_enable() if max_config_dev is set to its
> default value.
>
> Signed-off-by: Sreenivasa Honnur <sreenivasa.honnur@exar.com>
> Signed-off-by: Ramkrishna Vepa <ram.vepa@exar.com>
Applied.
^ permalink raw reply
* Re: [net-next-2.6 PATCH 2/4] vxge: Allow driver load for all enumerated pci functions.
From: David Miller @ 2010-04-08 8:48 UTC (permalink / raw)
To: Sreenivasa.Honnur; +Cc: netdev, support
In-Reply-To: <Pine.GSO.4.10.11004080442170.365-100000@guinness>
From: Sreenivasa Honnur <Sreenivasa.Honnur@neterion.com>
Date: Thu, 8 Apr 2010 04:43:37 -0400 (EDT)
> - Allow all instances of the driver be loaded when multiple pci functions are
> enumerated. The max_config_dev driver loadable option limits the driver
> load instances if required. The X3100's function configuration of single/multi
> function, SR and MR IOV allows the user to select the number of pci functions.
>
> Signed-off-by: Sreenivasa Honnur <sreenivasa.honnur@exar.com>
> Signed-off-by: Ramkrishna Vepa <ram.vepa@exar.com>
Applied.
^ permalink raw reply
* [net-next-2.6 PATCH 4/4] vxge: Version update.
From: Sreenivasa Honnur @ 2010-04-08 8:45 UTC (permalink / raw)
To: davem; +Cc: netdev, support
- Version update.
Signed-off-by: Sreenivasa Honnur <sreenivasa.honnur@exar.com>
Signed-off-by: Ramkrishna Vepa <ram.vepa@exar.com>
---
diff -urpN patch3/drivers/net/vxge/vxge-version.h patch4/drivers/net/vxge/vxge-version.h
--- patch3/drivers/net/vxge/vxge-version.h 2010-04-01 12:13:17.000000000 +0530
+++ patch4/drivers/net/vxge/vxge-version.h 2010-04-07 11:39:50.000000000 +0530
@@ -17,7 +17,7 @@
#define VXGE_VERSION_MAJOR "2"
#define VXGE_VERSION_MINOR "0"
-#define VXGE_VERSION_FIX "7"
-#define VXGE_VERSION_BUILD "20144"
+#define VXGE_VERSION_FIX "8"
+#define VXGE_VERSION_BUILD "20182"
#define VXGE_VERSION_FOR "k"
#endif
^ permalink raw reply
* Re: net-next-2.6 PATCH 1/4] vxge: Fix a possible memory leak in vxge_hw_device_initialize().
From: David Miller @ 2010-04-08 8:44 UTC (permalink / raw)
To: Sreenivasa.Honnur; +Cc: netdev, support
In-Reply-To: <Pine.GSO.4.10.11004080441240.365-100000@guinness>
From: Sreenivasa Honnur <Sreenivasa.Honnur@neterion.com>
Date: Thu, 8 Apr 2010 04:42:02 -0400 (EDT)
> - Fix a possible memory leak in vxge_hw_device_initialize(). Free hldev if
> vxge_hw_device_reg_addr_get() fails.
>
> Signed-off-by: Sreenivasa Honnur <sreenivasa.honnur@exar.com>
> Signed-off-by: Ramkrishna Vepa <ram.vepa@exar.com>
Applied, thanks.
^ permalink raw reply
* [net-next-2.6 PATCH 3/4] vxge: Pass correct number of VFs value to pci_sriov_enable().
From: Sreenivasa Honnur @ 2010-04-08 8:44 UTC (permalink / raw)
To: davem; +Cc: netdev, support
- max_config_dev loadable parameter is set to 0xFF by default. Pass correct
number of VFs value to pci_sriov_enable() if max_config_dev is set to its
default value.
Signed-off-by: Sreenivasa Honnur <sreenivasa.honnur@exar.com>
Signed-off-by: Ramkrishna Vepa <ram.vepa@exar.com>
---
diff -urpN patch2/drivers/net/vxge/vxge-config.h patch3/drivers/net/vxge/vxge-config.h
--- patch2/drivers/net/vxge/vxge-config.h 2010-04-01 12:06:55.000000000 +0530
+++ patch3/drivers/net/vxge/vxge-config.h 2010-04-01 14:20:19.000000000 +0530
@@ -764,10 +764,18 @@ struct vxge_hw_device_hw_info {
#define VXGE_HW_SR_VH_VIRTUAL_FUNCTION 6
#define VXGE_HW_VH_NORMAL_FUNCTION 7
u64 function_mode;
-#define VXGE_HW_FUNCTION_MODE_MULTI_FUNCTION 0
-#define VXGE_HW_FUNCTION_MODE_SINGLE_FUNCTION 1
+#define VXGE_HW_FUNCTION_MODE_SINGLE_FUNCTION 0
+#define VXGE_HW_FUNCTION_MODE_MULTI_FUNCTION 1
#define VXGE_HW_FUNCTION_MODE_SRIOV 2
#define VXGE_HW_FUNCTION_MODE_MRIOV 3
+#define VXGE_HW_FUNCTION_MODE_MRIOV_8 4
+#define VXGE_HW_FUNCTION_MODE_MULTI_FUNCTION_17 5
+#define VXGE_HW_FUNCTION_MODE_SRIOV_8 6
+#define VXGE_HW_FUNCTION_MODE_SRIOV_4 7
+#define VXGE_HW_FUNCTION_MODE_MULTI_FUNCTION_2 8
+#define VXGE_HW_FUNCTION_MODE_MULTI_FUNCTION_4 9
+#define VXGE_HW_FUNCTION_MODE_MRIOV_4 10
+
u32 func_id;
u64 vpath_mask;
struct vxge_hw_device_version fw_version;
@@ -2265,4 +2273,6 @@ enum vxge_hw_status vxge_hw_vpath_rts_rt
struct vxge_hw_rth_hash_types *hash_type,
u16 bucket_size);
+enum vxge_hw_status
+__vxge_hw_device_is_privilaged(u32 host_type, u32 func_id);
#endif
diff -urpN patch2/drivers/net/vxge/vxge-main.c patch3/drivers/net/vxge/vxge-main.c
--- patch2/drivers/net/vxge/vxge-main.c 2010-04-01 12:12:50.000000000 +0530
+++ patch3/drivers/net/vxge/vxge-main.c 2010-04-01 14:33:30.000000000 +0530
@@ -3965,6 +3965,36 @@ static void vxge_io_resume(struct pci_de
netif_device_attach(netdev);
}
+static inline u32 vxge_get_num_vfs(u64 function_mode)
+{
+ u32 num_functions = 0;
+
+ switch (function_mode) {
+ case VXGE_HW_FUNCTION_MODE_MULTI_FUNCTION:
+ case VXGE_HW_FUNCTION_MODE_SRIOV_8:
+ num_functions = 8;
+ break;
+ case VXGE_HW_FUNCTION_MODE_SINGLE_FUNCTION:
+ num_functions = 1;
+ break;
+ case VXGE_HW_FUNCTION_MODE_SRIOV:
+ case VXGE_HW_FUNCTION_MODE_MRIOV:
+ case VXGE_HW_FUNCTION_MODE_MULTI_FUNCTION_17:
+ num_functions = 17;
+ break;
+ case VXGE_HW_FUNCTION_MODE_SRIOV_4:
+ num_functions = 4;
+ break;
+ case VXGE_HW_FUNCTION_MODE_MULTI_FUNCTION_2:
+ num_functions = 2;
+ break;
+ case VXGE_HW_FUNCTION_MODE_MRIOV_8:
+ num_functions = 8; /* TODO */
+ break;
+ }
+ return num_functions;
+}
+
/**
* vxge_probe
* @pdev : structure containing the PCI related information of the device.
@@ -3992,14 +4022,19 @@ vxge_probe(struct pci_dev *pdev, const s
u8 *macaddr;
struct vxge_mac_addrs *entry;
static int bus = -1, device = -1;
+ u32 host_type;
u8 new_device = 0;
+ enum vxge_hw_status is_privileged;
+ u32 function_mode;
+ u32 num_vfs = 0;
vxge_debug_entryexit(VXGE_TRACE, "%s:%d", __func__, __LINE__);
attr.pdev = pdev;
- if (bus != pdev->bus->number)
- new_device = 1;
- if (device != PCI_SLOT(pdev->devfn))
+ /* In SRIOV-17 mode, functions of the same adapter
+ * can be deployed on different buses */
+ if ((!pdev->is_virtfn) && ((bus != pdev->bus->number) ||
+ (device != PCI_SLOT(pdev->devfn))))
new_device = 1;
bus = pdev->bus->number;
@@ -4133,6 +4168,11 @@ vxge_probe(struct pci_dev *pdev, const s
"%s:%d Vpath mask = %llx", __func__, __LINE__,
(unsigned long long)vpath_mask);
+ function_mode = ll_config.device_hw_info.function_mode;
+ host_type = ll_config.device_hw_info.host_type;
+ is_privileged = __vxge_hw_device_is_privilaged(host_type,
+ ll_config.device_hw_info.func_id);
+
/* Check how many vpaths are available */
for (i = 0; i < VXGE_HW_MAX_VIRTUAL_PATHS; i++) {
if (!((vpath_mask) & vxge_mBIT(i)))
@@ -4140,14 +4180,18 @@ vxge_probe(struct pci_dev *pdev, const s
max_vpath_supported++;
}
+ if (new_device)
+ num_vfs = vxge_get_num_vfs(function_mode) - 1;
+
/* Enable SRIOV mode, if firmware has SRIOV support and if it is a PF */
- if ((VXGE_HW_FUNCTION_MODE_SRIOV ==
- ll_config.device_hw_info.function_mode) &&
- (max_config_dev > 1) && (pdev->is_physfn)) {
- ret = pci_enable_sriov(pdev, max_config_dev - 1);
- if (ret)
- vxge_debug_ll_config(VXGE_ERR,
- "Failed to enable SRIOV: %d\n", ret);
+ if (is_sriov(function_mode) && (max_config_dev > 1) &&
+ (ll_config.intr_type != INTA) &&
+ (is_privileged == VXGE_HW_OK)) {
+ ret = pci_enable_sriov(pdev, ((max_config_dev - 1) < num_vfs)
+ ? (max_config_dev - 1) : num_vfs);
+ if (ret)
+ vxge_debug_ll_config(VXGE_ERR,
+ "Failed in enabling SRIOV mode: %d\n", ret);
}
/*
diff -urpN patch2/drivers/net/vxge/vxge-main.h patch3/drivers/net/vxge/vxge-main.h
--- patch2/drivers/net/vxge/vxge-main.h 2010-04-01 12:06:55.000000000 +0530
+++ patch3/drivers/net/vxge/vxge-main.h 2010-04-01 13:52:56.000000000 +0530
@@ -90,6 +90,11 @@
#define VXGE_LL_MAX_FRAME_SIZE(dev) ((dev)->mtu + VXGE_HW_MAC_HEADER_MAX_SIZE)
+#define is_sriov(function_mode) \
+ ((function_mode == VXGE_HW_FUNCTION_MODE_SRIOV) || \
+ (function_mode == VXGE_HW_FUNCTION_MODE_SRIOV_8) || \
+ (function_mode == VXGE_HW_FUNCTION_MODE_SRIOV_4))
+
enum vxge_reset_event {
/* reset events */
VXGE_LL_VPATH_RESET = 0,
^ permalink raw reply
* [net-next-2.6 PATCH 2/4] vxge: Allow driver load for all enumerated pci functions.
From: Sreenivasa Honnur @ 2010-04-08 8:43 UTC (permalink / raw)
To: davem; +Cc: netdev, support
- Allow all instances of the driver be loaded when multiple pci functions are
enumerated. The max_config_dev driver loadable option limits the driver
load instances if required. The X3100's function configuration of single/multi
function, SR and MR IOV allows the user to select the number of pci functions.
Signed-off-by: Sreenivasa Honnur <sreenivasa.honnur@exar.com>
Signed-off-by: Ramkrishna Vepa <ram.vepa@exar.com>
---
diff -urpN patch1/drivers/net/vxge/vxge-main.c patch2/drivers/net/vxge/vxge-main.c
--- patch1/drivers/net/vxge/vxge-main.c 2010-04-01 12:04:00.000000000 +0530
+++ patch2/drivers/net/vxge/vxge-main.c 2010-04-01 12:12:50.000000000 +0530
@@ -4016,9 +4016,11 @@ vxge_probe(struct pci_dev *pdev, const s
driver_config->total_dev_cnt);
driver_config->config_dev_cnt = 0;
driver_config->total_dev_cnt = 0;
- driver_config->g_no_cpus = 0;
}
-
+ /* Now making the CPU based no of vpath calculation
+ * applicable for individual functions as well.
+ */
+ driver_config->g_no_cpus = 0;
driver_config->vpath_per_dev = max_config_vpath;
driver_config->total_dev_cnt++;
^ permalink raw reply
* net-next-2.6 PATCH 1/4] vxge: Fix a possible memory leak in vxge_hw_device_initialize().
From: Sreenivasa Honnur @ 2010-04-08 8:42 UTC (permalink / raw)
To: davem; +Cc: netdev, support
- Fix a possible memory leak in vxge_hw_device_initialize(). Free hldev if
vxge_hw_device_reg_addr_get() fails.
Signed-off-by: Sreenivasa Honnur <sreenivasa.honnur@exar.com>
Signed-off-by: Ramkrishna Vepa <ram.vepa@exar.com>
---
diff -urpN orig/drivers/net/vxge/vxge-config.c patch1/drivers/net/vxge/vxge-config.c
--- orig/drivers/net/vxge/vxge-config.c 2010-04-01 12:03:33.000000000 +0530
+++ patch1/drivers/net/vxge/vxge-config.c 2010-04-01 12:05:55.000000000 +0530
@@ -634,8 +634,10 @@ vxge_hw_device_initialize(
__vxge_hw_device_pci_e_init(hldev);
status = __vxge_hw_device_reg_addr_get(hldev);
- if (status != VXGE_HW_OK)
+ if (status != VXGE_HW_OK) {
+ vfree(hldev);
goto exit;
+ }
__vxge_hw_device_id_get(hldev);
__vxge_hw_device_host_info_get(hldev);
^ permalink raw reply
* [PATCH v2] packet: support for TX time stamps on RAW sockets
From: Richard Cochran @ 2010-04-08 8:41 UTC (permalink / raw)
To: netdev
Enable the SO_TIMESTAMPING socket infrastructure for raw packet sockets.
We introduce PACKET_TX_TIMESTAMP for the control message cmsg_type.
Similar support for UDP and CAN sockets was added in commit
51f31cabe3ce5345b51e4a4f82138b38c4d5dc91
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
---
include/linux/if_packet.h | 1 +
net/packet/af_packet.c | 61 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 61 insertions(+), 1 deletions(-)
diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index aa57a5f..6ac23ef 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -47,6 +47,7 @@ struct sockaddr_ll {
#define PACKET_TX_RING 13
#define PACKET_LOSS 14
#define PACKET_VNET_HDR 15
+#define PACKET_TX_TIMESTAMP 16
struct tpacket_stats {
unsigned int tp_packets;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b0f037c..1199b20 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -81,6 +81,7 @@
#include <linux/mutex.h>
#include <linux/if_vlan.h>
#include <linux/virtio_net.h>
+#include <linux/errqueue.h>
#ifdef CONFIG_INET
#include <net/inet_common.h>
@@ -314,6 +315,8 @@ static inline struct packet_sock *pkt_sk(struct sock *sk)
static void packet_sock_destruct(struct sock *sk)
{
+ skb_queue_purge(&sk->sk_error_queue);
+
WARN_ON(atomic_read(&sk->sk_rmem_alloc));
WARN_ON(atomic_read(&sk->sk_wmem_alloc));
@@ -482,6 +485,9 @@ retry:
skb->dev = dev;
skb->priority = sk->sk_priority;
skb->mark = sk->sk_mark;
+ err = sock_tx_timestamp(msg, sk, skb_tx(skb));
+ if (err < 0)
+ goto out_unlock;
dev_queue_xmit(skb);
rcu_read_unlock();
@@ -1187,6 +1193,9 @@ static int packet_snd(struct socket *sock,
err = skb_copy_datagram_from_iovec(skb, offset, msg->msg_iov, 0, len);
if (err)
goto out_free;
+ err = sock_tx_timestamp(msg, sk, skb_tx(skb));
+ if (err < 0)
+ goto out_free;
skb->protocol = proto;
skb->dev = dev;
@@ -1486,6 +1495,51 @@ out:
return err;
}
+static int packet_recv_error(struct sock *sk, struct msghdr *msg, int len)
+{
+ struct sock_exterr_skb *serr;
+ struct sk_buff *skb, *skb2;
+ int copied, err;
+
+ err = -EAGAIN;
+ skb = skb_dequeue(&sk->sk_error_queue);
+ if (skb == NULL)
+ goto out;
+
+ copied = skb->len;
+ if (copied > len) {
+ msg->msg_flags |= MSG_TRUNC;
+ copied = len;
+ }
+ err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);
+ if (err)
+ goto out_free_skb;
+
+ sock_recv_timestamp(msg, sk, skb);
+
+ serr = SKB_EXT_ERR(skb);
+ put_cmsg(msg, SOL_PACKET, PACKET_TX_TIMESTAMP,
+ sizeof(serr->ee), &serr->ee);
+
+ msg->msg_flags |= MSG_ERRQUEUE;
+ err = copied;
+
+ /* Reset and regenerate socket error */
+ spin_lock_bh(&sk->sk_error_queue.lock);
+ sk->sk_err = 0;
+ if ((skb2 = skb_peek(&sk->sk_error_queue)) != NULL) {
+ sk->sk_err = SKB_EXT_ERR(skb2)->ee.ee_errno;
+ spin_unlock_bh(&sk->sk_error_queue.lock);
+ sk->sk_error_report(sk);
+ } else
+ spin_unlock_bh(&sk->sk_error_queue.lock);
+
+out_free_skb:
+ kfree_skb(skb);
+out:
+ return err;
+}
+
/*
* Pull a packet from our receive queue and hand it to the user.
* If necessary we block.
@@ -1501,7 +1555,7 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock,
int vnet_hdr_len = 0;
err = -EINVAL;
- if (flags & ~(MSG_PEEK|MSG_DONTWAIT|MSG_TRUNC|MSG_CMSG_COMPAT))
+ if (flags & ~(MSG_PEEK|MSG_DONTWAIT|MSG_TRUNC|MSG_CMSG_COMPAT|MSG_ERRQUEUE))
goto out;
#if 0
@@ -1510,6 +1564,11 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock,
return -ENODEV;
#endif
+ if (flags & MSG_ERRQUEUE) {
+ err = packet_recv_error(sk, msg, len);
+ goto out;
+ }
+
/*
* Call the generic datagram receiver. This handles all sorts
* of horrible races and re-entrancy so we can forget about it
--
1.6.0.4
^ permalink raw reply related
* Re: linux-next: powerpc boot failure
From: Timo Teräs @ 2010-04-08 8:40 UTC (permalink / raw)
To: Stephen Rothwell; +Cc: David Miller, netdev, linux-next, LKML
In-Reply-To: <20100408174549.2f45ceea.sfr@canb.auug.org.au>
Stephen Rothwell wrote:
> On Thu, 08 Apr 2010 10:29:49 +0300 Timo Teräs <timo.teras@iki.fi> wrote:
>> You don't probably have any xfrm policies then. And that code should not
>> really get executed.
>>
>> Some of the changes touch globally visible structs, and inline functions.
>> Was this a clean rebuild? And did you update all kernel modules, also in
>> the initramfs?
>
> Yes, the build is started from scratch and the kernel and modules are
> updated (this is our automated build and test system).
>
> I have attached the config in case that is of use.
It looks like my new code uses xfrm_pols_put assuming it always does the
proper thing. But seems like it's doing funny stuff if CONFIG_XFRM_SUB_POLICY
is not set, which is your case.
Can you try if this helps?
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 625dd61..cccb049 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -735,19 +735,12 @@ static inline void xfrm_pol_put(struct xfrm_policy *policy
xfrm_policy_destroy(policy);
}
-#ifdef CONFIG_XFRM_SUB_POLICY
static inline void xfrm_pols_put(struct xfrm_policy **pols, int npols)
{
int i;
for (i = npols - 1; i >= 0; --i)
xfrm_pol_put(pols[i]);
}
-#else
-static inline void xfrm_pols_put(struct xfrm_policy **pols, int npols)
-{
- xfrm_pol_put(pols[0]);
-}
-#endif
extern void __xfrm_state_destroy(struct xfrm_state *);
^ permalink raw reply related
* Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net
From: Michael S. Tsirkin @ 2010-04-08 8:35 UTC (permalink / raw)
To: David Stevens; +Cc: kvm, kvm-owner, netdev, rusty, virtualization
In-Reply-To: <OF09358937.789A275B-ON882576FE.0067E654-882576FE.007405B5@us.ibm.com>
On Wed, Apr 07, 2010 at 02:07:18PM -0700, David Stevens wrote:
> kvm-owner@vger.kernel.org wrote on 04/07/2010 11:09:30 AM:
>
> > On Wed, Apr 07, 2010 at 10:37:17AM -0700, David Stevens wrote:
> > > >
> > > > Thanks!
> > > > There's some whitespace damage, are you sending with your new
> > > > sendmail setup? It seems to have worked for qemu patches ...
> > >
> > > Yes, I saw some line wraps in what I received, but I checked
> > > the original draft to be sure and they weren't there. Possibly from
> > > the relay; Sigh.
> > >
> > >
> > > > > @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
> > > > > /* TODO: Check specific error and bomb out unless ENOBUFS?
> */
> > > > > err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > > > > if (unlikely(err < 0)) {
> > > > > - vhost_discard_vq_desc(vq);
> > > > > - tx_poll_start(net, sock);
> > > > > + if (err == -EAGAIN) {
> > > > > + vhost_discard_desc(vq, 1);
> > > > > + tx_poll_start(net, sock);
> > > > > + } else {
> > > > > + vq_err(vq, "sendmsg: errno %d\n", -err);
> > > > > + /* drop packet; do not discard/resend */
> > > > > + vhost_add_used_and_signal(&net->dev, vq, head,
> > > > > + 0);
> > > >
> > > > vhost does not currently has a consistent error handling strategy:
> > > > if we drop packets, need to think which other errors should cause
> > > > packet drops. I prefer to just call vq_err for now,
> > > > and have us look at handling segfaults etc in a consistent way
> > > > separately.
> > >
> > > I had to add this to avoid an infinite loop when I wrote a bad
> > > packet on the socket. I agree error handling needs a better look,
> > > but retrying a bad packet continuously while dumping in the log
> > > is what it was doing when I hit an error before this code. Isn't
> > > this better, at least until a second look?
> > >
> >
> > Hmm, what do you mean 'continuously'? Don't we only try again
> > on next kick?
>
> If the packet is corrupt (in my case, a missing vnet header
> during testing), every send will fail and we never make progress.
> I had thousands of error messages in the log (for the same packet)
> before I added this code.
Hmm, we do not want a buggy guest to be able to fill
host logs. This is only if debugging is enabled though, right?
We can also rate-limit the errors.
> If the problem is with the packet,
> retrying the same one as the original code does will never recover.
> This isn't required for mergeable rx buffer support, so I
> can certainly remove it from this patch, but I think the original
> error handling doesn't handle a single corrupted packet very
> gracefully.
>
An approach I considered was to have qemu poll vq_err fd
and stop the device when an error is seen. My concern with
dropping a tx packet is that it would make debugging
very difficult.
> > > > > @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
> > > > > vq_log = unlikely(vhost_has_feature(&net->dev,
> VHOST_F_LOG_ALL)) ?
> > > > > vq->log : NULL;
> > > > >
> > > > > - for (;;) {
> > > > > - head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > > > > - ARRAY_SIZE(vq->iov),
> > > > > - &out, &in,
> > > > > - vq_log, &log);
> > > > > + while ((datalen = vhost_head_len(sock->sk))) {
> > > > > + headcount = vhost_get_desc_n(vq, vq->heads, datalen, &in,
> > > > > + vq_log, &log);
> > > >
> > > > This looks like a bug, I think we need to pass
> > > > datalen + header size to vhost_get_desc_n.
> > > > Not sure how we know the header size that backend will use though.
> > > > Maybe just look at our features.
> > >
> > > Yes; we have hdr_size, so I can add it here. It'll be 0 for
> > > the cases where the backend and guest both have vnet header (either
> > > the regular or larger mergeable buffers one), but should be added
> > > in for future raw socket support.
> >
> > So hdr_size is the wrong thing to add then.
> > We need to add non-zero value for tap now.
>
> datalen includes the vnet_hdr in the tap case, so we don't need
> a non-zero hdr_size. The socket data has the entire packet and vnet_hdr
> and that length is what we're getting from vhost_head_len().
I only see vhost_head_len returning skb->len. You are sure skb->len
includes vnet_hdr for tap rx?
> >
> > > >
> > > > > /* OK, now we need to know about added descriptors. */
> > > > > - if (head == vq->num) {
> > > > > - if (unlikely(vhost_enable_notify(vq))) {
> > > > > + if (!headcount) {
> > > > > + if (retries == 0 && unlikely(vhost_enable_notify(vq))) {
> > > > > /* They have slipped one in as we were
> > > > > * doing that: check again. */
> > > > > vhost_disable_notify(vq);
> > > > > + retries++;
> > > > > continue;
> > > > > }
> > > >
> > > > Hmm. The reason we have the code at all, as the comment says, is
> because
> > > > guest could have added more buffers between the time we read last
> index
> > > > and the time we enabled notification. So if we just break like this
> > > > the race still exists. We could remember the
> > > > last head value we observed, and have vhost_enable_notify check
> > > > against this value?
> > >
> > > This is to prevent a spin loop in the case where we have some
> > > buffers available, but not enough for the current packet (ie, this
> > > is the replacement code for the "rxmaxheadcount" business). If they
> > > actually added something new, retrying once should see it, but what
> > > vhost_enable_notify() returns non-zero on is not "new buffers" but
> > > rather "not empty". In the case mentioned, we aren't empty, so
> > > vhost_enable_notify() returns nonzero every time, but the guest hasn't
> > > given us enough buffers to proceed, so we continuously retry; this
> > > code breaks the spin loop until we've really got new buffers from
> > > the guest.
> >
> > What about the race I point out above? It seems to potentially
> > cause a deadlock.
>
> It certainly handles a single race, since the code is identical
> when retries==0. I was assuming that a real update would always get us
> enough buffers, so could not happen twice in a row. The false case of
> having 1 buffer when we need 3, say, would return nonzero every time,
> so this code would detect that and break that loop; never hang if a
> real guest update gives us a full ring.
> If we think we've exhausted the ring and a guest update gives us
> a couple buffers, but not the complete ring (which is what it does in
> practice), then we'd still have a race. In that case, we should be
> comparing avail_idx with itself across multiple calls, rather than
> last_avail_idx (which is only updated when we post a new packet).
> I'll look at this some more. With the guest I have, this code
> will always work, but I don't know that the guest is required to fill
> the ring, which is what this assumes.
I think it is legal for a guest to add buffers one by one.
My suggesting for handling this is to cache last value of available
index we have seen so that vhost_enable_notify returns
true if it changed meanwhile.
Care needs to be taken when index is changed with an ioctl.
> >
> > > >
> > > > Need to think about it.
> > > >
> > > > Another concern here is that on retries vhost_get_desc_n
> > > > is doing extra work, rescanning the same descriptor
> > > > again and again. Not sure how common this is, might be
> > > > worthwhile to add a TODO to consider this at least.
> > >
> > > I had a printk in there to test the code and with the
> > > retries counter, it happens when we fill the ring (once,
> > > because of the retries checks), and then proceeds as
> > > desired when the guest gives us more buffers. Without the
> > > check, it spews until we hear from the guest.
> >
> > I don't understand whether what you write above means 'yes this happens
> > and is worth optimizing for' or 'this happens very rarely
> > and is not worth optimizing for'.
>
> I'm saying "no",
OK then.
> even with high load, we don't hit the
> retries==1 very often with the new code. It only happens when the ring
> is completely full. In that case, with the old code, we would spin until
> we
> hear from the guest (tens of thousands of times); with the new code,
> we hit it once when the ring is full and wait for the guest to give
> us more buffers; then we proceed.
> With the new code in a test run of tens of millions of packets,
> I got maybe 5 or 6 times where we exhausted the ring and waited, but
> with the old code, we did enable/disable tens of thousands of times
> because the ring wasn't completely empty and we were spinning until
> we got new buffers from the guest.
>
I presume under old code you mean some previous version of
merg. buffers patch? I don't see where current upstream would spin
waiting for guest (because a single descriptor is always
enough to fit a packet in), if we have this must fix ASAP.
> > > > > - vhost_discard_vq_desc(vq);
> > > > > + vhost_discard_desc(vq, headcount);
> > > > > break;
> > > > > }
> > > > > /* TODO: Should check and handle checksum. */
> > > > > + if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF)) {
> > > > > + struct virtio_net_hdr_mrg_rxbuf *vhdr =
> > > > > + (struct virtio_net_hdr_mrg_rxbuf *)
> > > > > + vq->iov[0].iov_base;
> > > > > + /* add num_buffers */
> > > > > + if (vhost_has_feature(&net->dev,
> > > > > + VHOST_NET_F_VIRTIO_NET_HDR))
> > > > > + hdr.num_buffers = headcount;
> > > >
> > > > Why is the above necessary?
> > >
> > > This adds mergeable buffer support for the raw case.
> >
> > So my suggestion is to have a copy of the header
> > and then same code will fill in the field for
> > both raw and tap.
>
> The recvmsg() has written the vnethdr into the user
> buffer in the tap case. We don't have an in-kernel copy of it at
> all (hdr_size == 0) in vhost. In the raw case, recvmsg isn't writing it
> to the user buffer and then we do have the local copy in hdr.
> So the code updates num_buffer (only) in user space in
> the tap case, and does the necessary copy of the entire header
> in the raw case. I don't think a double copy of the entire vnet_hdr
> on every packet is a good idea in the tap case,
Right. But what I suggested only copies the num_buffer.
> and the simple
> assignment with the existing copy_to_user() of the header is
> cheaper than double-copying num_buffers in the raw case.
Here I disagree. I think a single write to a buffer that
we know is in cache will be cheaper than a branch.
It's also less codelines :)
> So, I
> do think this code is best. It could use hdr_size instead of
> the (in-line) feature-bit check, but there are no unnecessary
> copies as-is, and I think there would be if we don't use the
> two separate ways of updating num_buffers.
A 2 byte copy is almost free.
> > >
> > > >
> > > > > + vq_err(vq, "tiny buffers < %d unsupported",
> > > > > + vq->iov[0].iov_len);
> > > > > + vhost_discard_desc(vq, headcount);
> > > > > + break;
> > > >
> > > > Problem here is that recvmsg might modify iov.
> > > > This is why I think we need to use vq->hdr here.
> > >
> > > rcvmsg() can never modify the iovec;
> >
> >
> > Look at af_packet code for an example where it does.
> > The pointer is non-const, it's OK to modify.
> > tap used to modify iovec as well, the fact it doesn't
> > now is a side-effect of reusing same code for
> > recvmsg and aio read.
>
> OK, even assuming it can, the check is done after
> the recvmsg -- it can't change between the length check
> and the put_user() of num_buffer, so I'm not sure what
> your concern is here. Are you saying that vq->iov may
> be trashed so that it no longer points to the ring buffer?
recvmsg updates at least the length.
> What I need is to write the num_buffer value at
> offset 10 in the userspace ring buffer after the recvmsg().
> If the iovec has been modified, I'm using the modified.
> If you're saying that it may be garbage, then your suggestion
> is that I save the pointer to offset 10 of the ring buffer
> before the call to recvmsg so I can update it after?
Essentially, yes. Upstream code (move_iovec_hdr) assumes that header
size we might need to save is same as size we hide from backend,
but here it is different. If we generalize move_iovec_hdr to something like
the below (warning: completely untested, likely has integer overflow
or other problems), then I think it will do what we want,
so that memcpy_toiovecend will work:
/* Copy len bytes, and pop the first pop_len bytes from iovec.
* Arguments must satisfy pop_len <= len.
* Return number of segments used. */
static int move_iovec_hdr(struct iovec *from, struct iovec *to,
size_t len, size_t pop_len, int iov_count)
{
int seg = 0;
size_t size;
while (len && seg < iov_count) {
size = min(from->iov_len, len);
to->iov_base = from->iov_base;
to->iov_len = size;
if (pop_len > 0) {
from->iov_len -= size;
from->iov_base += size;
pop_len -= size;
}
len -= size;
++from;
++to;
++seg;
}
return seg;
}
> > > To use vq->hdr, we'd have to copy
> > > it in from user space, modify it, and then copy it back
> > > in to user space, on every packet. In the normal tap case,
> > > hdr_size is 0 and we read it directly from the socket to
> > > user space. It is already correct for mergeable buffers,
> > > except for the num_buffers value added here.
> >
> >
> > I don't understand what you write above, sorry.
> > We have iov, all we need is store the first address+length
> > in the hdr field.
>
> Sorry, in that discussion, I was confusing "hdr" with vq->hdr.
> In the tap case, hdr_size is 0 and we have nothing in vq->hdr.
> As discussed before, if you mean updating a local copy of the
> header, we don't have a local copy of the header -- recvmsg()
> has written it directly to the user buffer. To get a local
> copy, we'd need to either copy_from_user() out of the ring or
> get it from recvmsg() by changing the iovec and then later
> do an extra copy of it to get it in the user buffer where we
> need it.
Yes but we only need to update 2 bytes in userspace memory.
Don't touch anything else.
> >
> > > >
> > > > How about using
> > > > memcpy_toiovecend(vq->hdr, &headcount,
> > > > offsetof(struct virtio_net_hdr_mrg_rxbuf, num_buffers),
> > > > sizeof headcount);
> > > >
> > > > this way we also do not make any assumptions about layout.
> > >
> > > Because this overwrites the valid vnet header we got from
> > > the tap socket with a local copy that isn't correct.
> >
> > It does? I think that this only writes out 2 bytes at an offset.
>
> Oh, sorry, I misread. vq->hdr doesn't have anything in it in
> the tap case, but something like this may eliminate the need to check
> iov_len > sizeof(*vhdr); will investigate.
>
>
> +-DLS
>
^ permalink raw reply
* [PATCH] tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb
From: David Miller @ 2010-04-08 8:26 UTC (permalink / raw)
To: netdev; +Cc: herbert
Back in commit 04a0551c87363f100b04d28d7a15a632b70e18e7
("loopback: Drop obsolete ip_summed setting") we stopped
setting CHECKSUM_UNNECESSARY in the loopback xmit.
This is because such a setting was a lie since it implies that the
checksum field of the packet is properly filled in.
Instead what happens normally is that CHECKSUM_PARTIAL is set and
skb->csum is calculated as needed.
But this was only happening for TCP data packets (via the
skb->ip_summed assignment done in tcp_sendmsg()). It doesn't
happen for non-data packets like ACKs etc.
Fix this by setting skb->ip_summed in the common non-data packet
constructor. It already is setting skb->csum to zero.
But this reminds us that we still have things like ip_output.c's
ip_dev_loopback_xmit() which sets skb->ip_summed to the value
CHECKSUM_UNNECESSARY, which Herbert's patch teaches us is not
valid. So we'll have to address that at some point too.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f181b78..00afbb0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -349,6 +349,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
*/
static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
{
+ skb->ip_summed = CHECKSUM_PARTIAL;
skb->csum = 0;
TCP_SKB_CB(skb)->flags = flags;
^ permalink raw reply related
* Re: hackbench regression due to commit 9dfc6e68bfe6e
From: Eric Dumazet @ 2010-04-08 8:09 UTC (permalink / raw)
To: Zhang, Yanmin
Cc: Christoph Lameter, Pekka Enberg, netdev, Tejun Heo, alex.shi,
linux-kernel@vger.kernel.org, Ma, Ling, Chen, Tim C,
Andrew Morton, Ingo Molnar
In-Reply-To: <1270713272.2215.6.camel@edumazet-laptop>
Le jeudi 08 avril 2010 à 09:54 +0200, Eric Dumazet a écrit :
> Le jeudi 08 avril 2010 à 15:54 +0800, Zhang, Yanmin a écrit :
>
> > If there are 2 nodes in the machine, processes on node 0 will contact MCH of
> > node 1 to access memory of node 1. I suspect the MCH of node 1 might enter
> > a power-saving mode when all the cpus of node 1 are free. So the transactions
> > from MCH 1 to MCH 0 has a larger latency.
> >
>
> Hmm, thanks for the hint, I will investigate this.
Oh well,
perf timechart record &
Instant crash
Call Trace:
perf_trace_sched_switch+0xd5/0x120
schedule+0x6b5/0x860
retint_careful+0xd/0x21
RIP ffffffff81010955 perf_arch_fetch_caller_regs+0x15/0x40
CR2: 00000000d21f1422
^ permalink raw reply
* Re: IP/UDP encapsulation
From: Eric Dumazet @ 2010-04-08 7:58 UTC (permalink / raw)
To: Gustavo F. Padovan
Cc: netdev, marco bonola, ZioPRoTo (Saverio Proto), Behling Mario,
L. Aaron Kaplan
In-Reply-To: <20100408074247.GA19798@vigoh>
Le jeudi 08 avril 2010 à 04:42 -0300, Gustavo F. Padovan a écrit :
> Hi,
>
> I'm looking for some advice on that work. The Freifunk organization is
> planning work on the IP/UDP encapsulation kernel module as a GSoC
> project. The idea is to create a IP-in-UDP tunnel like we do for
> IP-in-IP or IP-in-GRE tunnels. The only way to do that today is to use
> some VPN software.
>
> The module will export its virtual interface through sockets and will
> have support for the standard syscalls like the others encapsulation
> modules.
>
> It will improve the performance of mesh networks that will we be able
> to use IP-in-UDP rather than IP-in-IP. So, instead of push all data to
> local gateway into the mesh all the data can be tunneled to a faster
> server and from there to the Internet. With all the data exiting with
> the same IP address (the fast server IP). That will improve bandwidth,
> especially for upload.
>
> Is such module acceptable for merge into the Linux Kernel?
>
> Any comments or suggestions to the module architecture and
> implementation? If you want more information about the module I can
> provide that.
>
> Regards,
>
My suggestion would be to take a look at l2tp, which uses udp
encapsulation.
net-next-2.6
net/l2tp/l2tp_core.c:1108: (udp_sk(sk))->encap_type = 0;
net/l2tp/l2tp_core.c:1304: enum l2tp_encap_type encap = L2TP_ENCAPTYPE_UDP;
net/l2tp/l2tp_core.c:1383: udp_sk(sk)->encap_type = UDP_ENCAP_L2TPINUDP;
net/l2tp/l2tp_core.h:148: enum l2tp_encap_type encap;
net/l2tp/l2tp_core.h:171: enum l2tp_encap_type encap;
net/ipv4/udp.c:1299: if (up->encap_type) {
net/ipv4/udp.c:1660: up->encap_type = val;
net/ipv4/udp.c:1747: val = up->encap_type;
include/linux/l2tp.h:81: L2TP_ATTR_ENCAP_TYPE, /* u16, enum l2tp_encap_type */
include/linux/l2tp.h:146:enum l2tp_encap_type {
include/linux/udp.h:63: __u16 encap_type; /* Is this an Encapsulation socket? */
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox