* Re: [Bugme-new] [Bug 35862] New: arp requests from wrong src IP
From: Victor Mataré @ 2011-05-27 23:58 UTC (permalink / raw)
To: Julian Anastasov
Cc: David Miller, akpm, netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <alpine.LFD.2.00.1105270814360.1520@ja.ssi.bg>
On Friday, 27.05.2011 07:27:23 Julian Anastasov wrote:
>
> Hello,
>
> On Thu, 26 May 2011, Victor Mataré wrote:
>
> > Examining the host which now has 137.226.164.2 (used to have 137.226.164.13):
> >
> > # ip addr show dev eth0
> > 4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
> > link/ether 00:e0:81:41:1f:e4 brd ff:ff:ff:ff:ff:ff
> > inet 137.226.164.2/24 brd 137.226.164.255 scope global eth0
> > inet 192.168.23.2/24 brd 137.226.164.255 scope global eth0:0
> > [...]
> >
> > Sorry, got confused with all the swapping. I'm *not* keeping the old address around, it's completely *gone*, from both ifconfig and ip. But still it's being used as arp src address. That's what this bug is about. Sorry for the confusion.
>
> It looks strange. Can you confirm the following things:
>
> - the kernel version
This host runs 2.6.36-hardened-r9. I'm not sure which vanilla release that's based on, but it's patched with grsec and PAX. However another host which exhibits the exact same behaviour runs 2.6.29-gentoo-r5. This one does not have hardened or grsec, but gentoo patches, so I'd assume this is neither a version- nor a patch-specific problem.
>
> - the order of 'ip' command used to add and change IPs on this box
ok - starting situation was 2 IPs: 137.226.164.13/24 (eth0) and 192.168.23.13/24 (eth0:0)
then I did "ifconfig eth0 137.226.164.2 netmask 255.255.255.0"
I'm not exactly sure what happened then, but the result was that "ip addr show dev eth0" showed that eth0 still had the old IP address, while ifconfig didn't. Ifconfig was misbehaving in some kind of way, that's why I checked the situation with the ip tool. Then I used ip to configure everything as intended and now I have the situation described in this bug. Note that the server has been in productive use for a week now despite of that.
>
> - output of 'ip route list table local' after IPs are changed and
> before starting arping
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
broadcast 192.168.23.0 dev eth0 proto kernel scope link src 192.168.23.2
local 192.168.23.2 dev eth0 proto kernel scope host src 192.168.23.2
local 137.226.164.2 dev eth0 proto kernel scope host src 137.226.164.2
local 137.226.164.13 dev eth0 proto kernel scope host src 137.226.164.13
broadcast 192.168.23.255 dev eth0 proto kernel scope link src 192.168.23.2
broadcast 137.226.164.255 dev eth0 proto kernel scope link src 137.226.164.2
broadcast 137.226.164.255 dev eth0 proto kernel scope link src 192.168.23.2
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
I guess that entry "local 137.226.164.13" shouldn't be there? But shouldn't that be removed automatically when I delete the IP from eth0?
>
> - output of 'strace arping', I assume it is using getsockname
> after UDP connect
# strace arping 137.226.164.13
[...]
socket(PF_PACKET, SOCK_DGRAM, 0) = 3
setuid(0) = 0
ioctl(3, SIOCGIFINDEX, {ifr_name="eth0", ifr_index=4}) = 0
ioctl(3, SIOCGIFFLAGS, {ifr_name="eth0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 4
setsockopt(4, SOL_SOCKET, SO_BINDTODEVICE, "eth0\0", 5) = 0
setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
connect(4, {sa_family=AF_INET, sin_port=htons(1025), sin_addr=inet_addr("137.226.164.13")}, 16) = 0
getsockname(4, {sa_family=AF_INET, sin_port=htons(44125), sin_addr=inet_addr("137.226.164.13")}, [16]) = 0
close(4) = 0
bind(3, {sa_family=AF_PACKET, proto=0x806, if4, pkttype=PACKET_HOST, addr(0)={0, }, 128) = 0
getsockname(3, {sa_family=AF_PACKET, proto=0x806, if4, pkttype=PACKET_HOST, addr(6)={1, 00e081411fe4}, [18]) = 0
[...] no reply [...]
compare that with:
# strace arping 137.226.164.3
[...]
socket(PF_PACKET, SOCK_DGRAM, 0) = 3
setuid(0) = 0
ioctl(3, SIOCGIFINDEX, {ifr_name="eth0", ifr_index=4}) = 0
ioctl(3, SIOCGIFFLAGS, {ifr_name="eth0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 4
setsockopt(4, SOL_SOCKET, SO_BINDTODEVICE, "eth0\0", 5) = 0
setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
connect(4, {sa_family=AF_INET, sin_port=htons(1025), sin_addr=inet_addr("137.226.164.3")}, 16) = 0
getsockname(4, {sa_family=AF_INET, sin_port=htons(45467), sin_addr=inet_addr("137.226.164.2")}, [16]) = 0
close(4) = 0
bind(3, {sa_family=AF_PACKET, proto=0x806, if4, pkttype=PACKET_HOST, addr(0)={0, }, 128) = 0
getsockname(3, {sa_family=AF_PACKET, proto=0x806, if4, pkttype=PACKET_HOST, addr(6)={1, 00e081411fe4}, [18]) = 0
[...] reply [...]
So that's the change in source address, and I guess it's due to the table above? Then this is more like a bug in the "ip" utility?
>
> - any reason to use broadcast 137.226.164.255 for all addresses?
Nope, none at all. I didn't see that because I thought ifconfig and ip use sensible defaults. Well...
So thanks, looks like you're pointing in the right direction.
Victor
^ permalink raw reply
* [PATCH] Use unsigned variables for packet lengths in ip[6]_queue.
From: Dave Jones @ 2011-05-28 0:36 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20110419.204105.68144653.davem@davemloft.net>
On Tue, Apr 19, 2011 at 08:41:05PM -0700, David Miller wrote:
> From: Dave Jones <davej@redhat.com>
> Date: Tue, 19 Apr 2011 21:42:22 -0400
>
> > Not catastrophic, but ipqueue seems to be too trusting of what it gets
> > passed from userspace, and passes it on down to the page allocator,
> > where it will spew warnings if the page order is too high.
> >
> > __ipq_rcv_skb has several checks for lengths too small, but doesn't
> > seem to have any for oversized ones. I'm not sure what the maximum
> > we should check for is. I'll code up a diff if anyone has any ideas
> > on a sane maximum.
>
> Maybe the thing to do is to simply pass __GFP_NOWARN to nlmsg_new()
> in netlink_ack()?
>
> Anyone else have a better idea?
So I went back to this today, and found something that doesn't look right.
After adding some instrumentation, and re-running my tests, I found that
the reason we were blowing up with enormous allocations was that we
were passing down a nlmsglen's like -1061109568
Is there any reason for that to be signed ?
The nlmsg_len entry of nlmsghdr is a u32, so I'm assuming this is a bug.
With the patch below, I haven't been able to reproduce the problem, but
I don't know if I've inadvertantly broken some other behaviour somewhere
deeper in netlink where this is valid.
Dave
--
Netlink message lengths can't be negative, so use unsigned variables.
Signed-off-by: Dave Jones <davej@redhat.com>
diff --git a/net/ipv4/netfilter/ip_queue.c b/net/ipv4/netfilter/ip_queue.c
index d2c1311..f7f9bd7 100644
--- a/net/ipv4/netfilter/ip_queue.c
+++ b/net/ipv4/netfilter/ip_queue.c
@@ -402,7 +402,8 @@ ipq_dev_drop(int ifindex)
static inline void
__ipq_rcv_skb(struct sk_buff *skb)
{
- int status, type, pid, flags, nlmsglen, skblen;
+ int status, type, pid, flags;
+ unsigned int nlmsglen, skblen;
struct nlmsghdr *nlh;
skblen = skb->len;
diff --git a/net/ipv6/netfilter/ip6_queue.c b/net/ipv6/netfilter/ip6_queue.c
index 413ab07..065fe40 100644
--- a/net/ipv6/netfilter/ip6_queue.c
+++ b/net/ipv6/netfilter/ip6_queue.c
@@ -403,7 +403,8 @@ ipq_dev_drop(int ifindex)
static inline void
__ipq_rcv_skb(struct sk_buff *skb)
{
- int status, type, pid, flags, nlmsglen, skblen;
+ int status, type, pid, flags;
+ unsigned int nlmsglen, skblen;
struct nlmsghdr *nlh;
skblen = skb->len;
^ permalink raw reply related
* RE: [PATCH 2/6] unicore32: add pkunity-v3 mac/net driver (umal)
From: Guan Xuetao @ 2011-05-28 2:52 UTC (permalink / raw)
To: 'Arnd Bergmann'; +Cc: linux-kernel, linux-arch, greg, netdev
In-Reply-To: <201105271119.41323.arnd@arndb.de>
> -----Original Message-----
> From: Arnd Bergmann [mailto:arnd@arndb.de]
> Sent: Friday, May 27, 2011 5:20 PM
> To: GuanXuetao
> Cc: linux-kernel@vger.kernel.org; linux-arch@vger.kernel.org; greg@kroah.com; netdev@vger.kernel.org
> Subject: Re: [PATCH 2/6] unicore32: add pkunity-v3 mac/net driver (umal)
>
> On Thursday 26 May 2011, GuanXuetao wrote:
> > From: Guan Xuetao <gxt@mprc.pku.edu.cn>
> >
> > Signed-off-by: Guan Xuetao <gxt@mprc.pku.edu.cn>
> > ---
> > MAINTAINERS | 1 +
> > arch/unicore32/configs/debug_defconfig | 2 +-
> > drivers/net/Kconfig | 5 +
> > drivers/net/Makefile | 1 +
> > drivers/net/mac-puv3.c | 1942 ++++++++++++++++++++++++++++++++
> > 5 files changed, 1950 insertions(+), 1 deletions(-)
> > create mode 100644 drivers/net/mac-puv3.c
>
> I also have a few comments after looking through the driver.
>
> > +
> > +/**********************************************************************
> > + * Globals
> > + ********************************************************************* */
>
> Regular commenting style would be
>
> /*
> * Globals
> */
>
> > +/**********************************************************************
> > + * Prototypes
> > + ********************************************************************* */
> > +static int umal_mii_reset(struct mii_bus *bus);
> > +static int umal_mii_read(struct mii_bus *bus, int phyaddr, int regidx);
> > +static int umal_mii_write(struct mii_bus *bus, int phyaddr, int regidx,
> > + u16 val);
> > +static int umal_mii_probe(struct net_device *dev);
> > +
> > +static void umaldma_initctx(struct umaldma *d, struct umal_softc *s,
> > + int rxtx, int maxdescr);
> > +static void umaldma_uninitctx(struct umaldma *d);
> > +static void umaldma_channel_start(struct umaldma *d, int rxtx);
> > +static void umaldma_channel_stop(struct umaldma *d);
> > +static int umaldma_add_rcvbuffer(struct umal_softc *sc, struct umaldma *d,
> > + struct sk_buff *m);
> > +static int umaldma_add_txbuffer(struct umaldma *d, struct sk_buff *m);
> > +static void umaldma_emptyring(struct umaldma *d);
> > +static void umaldma_fillring(struct umal_softc *sc, struct umaldma *d);
> > +static int umaldma_rx_process(struct umal_softc *sc, struct umaldma *d,
> > + int work_to_do, int poll);
> > +static void umaldma_tx_process(struct umal_softc *sc, struct umaldma *d,
> > + int poll);
> > +
> > +static int umal_initctx(struct umal_softc *s);
> > +static void umal_uninitctx(struct umal_softc *s);
> > +static void umal_channel_start(struct umal_softc *s);
> > +static void umal_channel_stop(struct umal_softc *s);
> > +static enum umal_state umal_set_channel_state(struct umal_softc *,
> > + enum umal_state);
> > +
> > +static int umal_init(struct platform_device *pldev, long long base);
> > +static int umal_open(struct net_device *dev);
> > +static int umal_close(struct net_device *dev);
> > +static int umal_mii_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
> > +static irqreturn_t umal_intr(int irq, void *dev_instance);
> > +static void umal_clr_intr(struct net_device *dev);
> > +static int umal_start_tx(struct sk_buff *skb, struct net_device *dev);
> > +static void umal_tx_timeout(struct net_device *dev);
> > +static void umal_set_rx_mode(struct net_device *dev);
> > +static void umal_promiscuous_mode(struct umal_softc *sc, int onoff);
> > +static void umal_setmulti(struct umal_softc *sc);
> > +static int umal_set_speed(struct umal_softc *s, enum umal_speed speed);
> > +static int umal_set_duplex(struct umal_softc *s, enum umal_duplex duplex,
> > + enum umal_fc fc);
> > +static int umal_change_mtu(struct net_device *_dev, int new_mtu);
> > +static void umal_miipoll(struct net_device *dev);
>
> Best avoid all these prototypes. Instead, reorder the functions in the
> driver so you don't need them. That is the order in which reviewers expect
> them.
>
> > +/**********************************************************************
> > + * UMAL_MII_RESET(bus)
> > + *
> > + * Reset MII bus.
> > + *
> > + * Input parameters:
> > + * bus - MDIO bus handle
> > + *
> > + * Return value:
> > + * 0 if ok
> > + ********************************************************************* */
>
> For extended function documentation, use the kerneldoc style, e.g.
>
> /**
> * umal_mii_reset - reset MII bus
> *
> * @bus: MDIO bus handle
> *
> * Returns 0
> */
>
> See also Documentation/kernel-doc-nano-HOWTO.txt
>
> > +/**********************************************************************
> > + * UMALDMA_RX_PROCESS(sc,d,work_to_do,poll)
> > + *
> > + * Process "completed" receive buffers on the specified DMA channel.
> > + *
> > + * Input parameters:
> > + * sc - softc structure
> > + * d - DMA channel context
> > + * work_to_do - no. of packets to process before enabling interrupt
> > + * again (for NAPI)
> > + * poll - 1: using polling (for NAPI)
> > + *
> > + * Return value:
> > + * nothing
> > + ********************************************************************* */
> > +static int umaldma_rx_process(struct umal_softc *sc, struct umaldma *d,
> > + int work_to_do, int poll)
>
> It seems that you tried to convert the driver to NAPI but did not succeed,
> as this function is only called from the interrupt handler directly.
>
> There is usually a significant performance win from using NAPI, so you
> should better try again. If you had problems doing that, please ask
> on netdev.
>
> > +
> > +#ifdef CONFIG_CMDLINE_FORCE
> > + eaddr[0] = 0x00;
> > + eaddr[1] = 0x25;
> > + eaddr[2] = 0x9b;
> > + eaddr[3] = 0xff;
> > + eaddr[4] = 0x00;
> > + eaddr[5] = 0x00;
> > +#endif
> > +
> > + for (i = 0; i < 6; i++)
> > + dev->dev_addr[i] = eaddr[i];
>
> You can use random_ether_addr() to generate a working unique MAC address
> if the hardware does not provide one.
>
> Arnd
Thanks Arnd.
I will redo this driver.
Guan Xuetao
^ permalink raw reply
* Re: [PATCH 2/2] net: make dev_disable_lro use physical device if passed a vlan dev (v2)
From: Ben Hutchings @ 2011-05-28 3:11 UTC (permalink / raw)
To: Neil Horman; +Cc: netdev, davem
In-Reply-To: <1306261869-7276-3-git-send-email-nhorman@tuxdriver.com>
On Tue, 2011-05-24 at 14:31 -0400, Neil Horman wrote:
> If the device passed into dev_disable_lro is a vlan, then repoint the dev
> poniter so that we actually modify the underlying physical device.
[...]
Thanks Neil, this looks good.
There seems to be a slightly weird bug remaining, in that NETIF_F_LRO
may remain set on the VLAN device. But that's really cosmetic and
should go away in 2.6.40 along with the old ethtool operations for
offload setting.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: Kernel crash after using new Intel NIC (igb)
From: Eric Dumazet @ 2011-05-28 5:41 UTC (permalink / raw)
To: Arun Sharma
Cc: David Miller, Maximilian Engelhardt, linux-kernel, netdev,
StuStaNet Vorstand, Yann Dupont, Denys Fedoryshchenko,
Ingo Molnar, Thomas Gleixner
In-Reply-To: <20110527211419.GA6793@dev1756.snc6.facebook.com>
Le vendredi 27 mai 2011 à 14:14 -0700, Arun Sharma a écrit :
> The attached works for me for x86_64. Cc'ing Ingo/Thomas for comment.
>
> -Arun
>
> atomic: Refactor atomic_add_unless
>
> Commit 686a7e3 (inetpeer: fix race in unused_list manipulations)
> in net-2.6 added a atomic_add_unless_return() variant that tries
> to detect 0->1 transitions of an atomic reference count.
>
> This sounds like a generic functionality that could be expressed
> in terms of an __atomic_add_unless() that returned the old value
> instead of a bool.
>
> Signed-off-by: Arun Sharma <asharma@fb.com>
> ---
> arch/x86/include/asm/atomic.h | 22 ++++++++++++++++++----
> 1 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/atomic.h b/arch/x86/include/asm/atomic.h
> index 952a826..bbdbffe 100644
> --- a/arch/x86/include/asm/atomic.h
> +++ b/arch/x86/include/asm/atomic.h
> @@ -221,15 +221,15 @@ static inline int atomic_xchg(atomic_t *v, int new)
> }
>
> /**
> - * atomic_add_unless - add unless the number is already a given value
> + * __atomic_add_unless - add unless the number is already a given value
> * @v: pointer of type atomic_t
> * @a: the amount to add to v...
> * @u: ...unless v is equal to u.
> *
> * Atomically adds @a to @v, so long as @v was not already @u.
> - * Returns non-zero if @v was not @u, and zero otherwise.
> + * Returns the old value of v
> */
> -static inline int atomic_add_unless(atomic_t *v, int a, int u)
> +static inline int __atomic_add_unless(atomic_t *v, int a, int u)
> {
> int c, old;
> c = atomic_read(v);
> @@ -241,7 +241,21 @@ static inline int atomic_add_unless(atomic_t *v, int a, int u)
> break;
> c = old;
> }
> - return c != (u);
> + return c;
> +}
> +
> +/**
> + * atomic_add_unless - add unless the number is already a given value
> + * @v: pointer of type atomic_t
> + * @a: the amount to add to v...
> + * @u: ...unless v is equal to u.
> + *
> + * Atomically adds @a to @v, so long as @v was not already @u.
> + * Returns non-zero if @v was not @u, and zero otherwise.
> + */
> +static inline int atomic_add_unless(atomic_t *v, int a, int u)
> +{
> + return __atomic_add_unless(v, a, u) != u;
> }
>
> #define atomic_inc_not_zero(v) atomic_add_unless((v), 1, 0)
As I said, atomic_add_unless() has several implementations in various
arches. You must take care of all, not only x86.
^ permalink raw reply
* Re: [PATCH 1/2 v2] af-packet: Use existing netdev reference for bound sockets.
From: Eric Dumazet @ 2011-05-28 6:20 UTC (permalink / raw)
To: Ben Greear; +Cc: David Miller, netdev
In-Reply-To: <4DE00711.6070000@candelatech.com>
Le vendredi 27 mai 2011 à 13:18 -0700, Ben Greear a écrit :
> On 05/27/2011 01:15 PM, David Miller wrote:
> > From: Eric Dumazet<eric.dumazet@gmail.com>
> > Date: Fri, 27 May 2011 22:08:41 +0200
> >
> >> Le jeudi 26 mai 2011 à 21:11 -0700, Ben Greear a écrit :
> >>> On 05/26/2011 08:42 PM, Eric Dumazet wrote:
> >>>> Le jeudi 26 mai 2011 à 16:55 -0700, greearb@candelatech.com a écrit :
> >>>
> >>>>> out_free:
> >>>>> kfree_skb(skb);
> >>>>> out_unlock:
> >>>>> - if (dev)
> >>>>> + if (dev&& need_rls_dev)
> >>>>> dev_put(dev);
> >>>>> out:
> >>>>> return err;
> >>>>
> >>>> Hmmm, I wonder why you want this Ben.
> >>>>
> >>>> IMHO this is buggy, because we can sleep in this function.
> >>>>
> >>>> We must take a ref on device (its really cheap these days, now we have a
> >>>> percpu device refcnt)
> >>>
> >>> Why must you take the reference? And if we must, why isn't the
> >>> current code that assigns the prot_hook.dev without taking a
> >>> reference OK?
> >>>
> >>
> >> If we sleep, device can disappear under us.
> >>
> >> The only way to not take a reference is to hold rcu_read_lock(), but
> >> you're not allowed to sleep under rcu_read_lock().
> >
> > You still have not addresses Ben's point.
> >
> > Why is it ok for the po->prot_hook.dev handling to not take a
> > reference? It's been doing this forever. Ben is just borrowing this
> > behavior for his uses.
> >
> > After some more research I think it happens to be OK because
> > ->prot_hook.dev is used _only_ for pointer comparisons, it is never
> > actually dereferenced or used in any other way. Probably, we should
> > just use ->ifindex for this.
>
> It's easy enough to add a dev_hold() when I assign the skb instead
> of looking it up in my patch, but perhaps it would be cleaner over all to
> just hold a ref on the prot_hook.dev when it is originally assigned?
Problem is : if packet_notifier(NETDEV_DOWN|UNREGISTER) is run while we
sleep, what happens then ?
Normally, if we sleep a long time in tpacket_snd() after device ref
increment, and before dev_queue_xmit(), the unregister process can enter
the infamous msleep(250) loop in netdev_wait_allrefs(), but at least we
dont crash.
But if you dont take the reference, we can crash in dev_queue_xmit()
when dereferencing the freed netdev structure.
Please check commit 1a35ca80c1db7 (packet: dont call sleeping functions
while holding rcu_read_lock()) for reference on possible problems.
Thanks !
^ permalink raw reply
* Re: [PATCH] ethtool: ETHTOOL_SFEATURES: remove NETIF_F_COMPAT return
From: Michał Mirosław @ 2011-05-28 7:35 UTC (permalink / raw)
To: Ben Hutchings; +Cc: David Miller, netdev, xen-devel
In-Reply-To: <1306538755.2759.31.camel@bwh-desktop>
On Sat, May 28, 2011 at 12:25:55AM +0100, Ben Hutchings wrote:
> On Fri, 2011-05-27 at 18:34 +0200, Michał Mirosław wrote:
> > On Fri, May 27, 2011 at 04:45:50PM +0100, Ben Hutchings wrote:
> > > On Fri, 2011-05-27 at 17:28 +0200, Michał Mirosław wrote:
> [...]
> > > > (note: ETHTOOL_S{SG,...} are not ever going away)
> > > > - causes NETIF_F_* to be an ABI
> > > If feature flag numbers are not stable then what is the point of
> > > /sys/class/net/<name>/features? Also, I'm not aware that features have
> > > ever been renumbered in the past.
> > Since no NETIF_F_* were exported earlier, I assume /sys/class/net/*/features
> > is a debugging aid. What is it used for besides that?
> xen-api <https://github.com/xen-org/xen-api> uses it in
> scripts/InterfaceReconfigureVswitch.py. Though it doesn't seem to be
> used for a particularly good reason...
Look like it should use ETHTOOL_GFLAGS instead for netdev_has_vlan_accel().
Best Regards,
Michał Mirosław
[added Cc: xen-devel]
^ permalink raw reply
* Re: Section conflict compile failures in net
From: Michał Mirosław @ 2011-05-28 9:05 UTC (permalink / raw)
To: James Bottomley; +Cc: netdev, David Miller
In-Reply-To: <1306537477.12244.13.camel@mulgrave.site>
On Fri, May 27, 2011 at 06:04:37PM -0500, James Bottomley wrote:
> On Fri, 2011-05-27 at 10:07 +0200, Michał Mirosław wrote:
> > On Thu, May 26, 2011 at 04:39:53PM -0500, James Bottomley wrote:
> > > I'm now getting a ton of errors like this in git head:
> > >
> > > CC [M] drivers/net/3c59x.o
> > > CC [M] drivers/net/hp100.o
> > > CC [M] drivers/net/ne3210.o
> > > CC [M] drivers/net/3c509.o
> > > CC [M] drivers/net/depca.o
> > > drivers/net/ne3210.c:83: error: irq_map causes a section type conflict
> > > drivers/net/ne3210.c:85: error: shmem_map causes a section type conflict
> > > drivers/net/ne3210.c:89: error: ifmap_val causes a section type conflict
> > > drivers/net/ne3210.c:319: error: ne3210_ids causes a section type conflict
> > > make[2]: *** [drivers/net/ne3210.o] Error 1
> > > make[2]: *** Waiting for unfinished jobs....
> > > drivers/net/hp100.c:198: error: hp100_eisa_tbl causes a section type conflict
> > > drivers/net/hp100.c:211: error: hp100_pci_tbl causes a section type conflict
> > > make[2]: *** [drivers/net/hp100.o] Error 1
> > > drivers/net/depca.c:544: error: de1xx_irq causes a section type conflict
> > > drivers/net/depca.c:545: error: de2xx_irq causes a section type conflict
> > > drivers/net/depca.c:546: error: de422_irq causes a section type conflict
> > [...]
> >
> > Those three are only used in depca_hw_init() marked __devinit. What compiler
> > [flags] do you use to build this?
>
> It's a standard debian one.
>
> jejb@ion> hppa64-linux-gnu-gcc -v
> Using built-in specs.
> Target: hppa64-linux-gnu
> Configured with: ../src/configure --enable-languages=c --prefix=/usr
> --libexecdir=/usr/lib --disable-shared --disable-nls --disable-threads
> --disable-libffi --disable-libgomp --disable-libmudflap --disable-libssp
> --with-as=/usr/bin/hppa64-linux-gnu-as
> --with-ld=/usr/bin/hppa64-linux-gnu-ld
> --includedir=/usr/hppa64-linux-gnu/include --host=hppa-linux-gnu
> --build=hppa-linux-gnu --target=hppa64-linux-gnu
> Thread model: single
> gcc version 4.2.4 (Debian 4.2.4-6)
>
> the problem is definitely the depca_irq[i] in the loop ... replace that
> with a constant and the error goes away.
Looks like arch-specific problem. Build test passes for x86 allyesconfig.
Best Regards,
Michał Mirosław
^ permalink raw reply
* Re: [Bugme-new] [Bug 35992] New: Regression: oops when using a bridge interface with tg3
From: Bernd Zeimetz @ 2011-05-28 8:58 UTC (permalink / raw)
To: Andrew Morton
Cc: netdev, bugzilla-daemon, bugme-daemon, Stephen Hemminger, bridge
In-Reply-To: <20110527152158.0453899d.akpm@linux-foundation.org>
On 05/28/2011 12:21 AM, Andrew Morton wrote:
> 2.6.38->2.6.39 regression, appears to be bridge-related. There's a
> partial screencap of the oops linked below.
The oops happens as soon as I add the lxc veth interface to the bridge, before
everything works as expected.
A new, full screenshot is attached
https://bugzilla.kernel.org/attachment.cgi?id=59732
(For yet unknows reasons I can't access the serial port redirection via ssh and
I won't be able to attach a real serial port cable before Monday, so I'm afraid
you have to work with the screenshot for now.)
> Bernd, it would be helpful if you could set the screen to more rows
> (50?) and then retake that photo. Documentation/svga.txt might help
> out. Thanks.
(the kernel says that vga=... is deprecated, guess that documentation is outdated?)
Thanks and cheers,
Bernd
--
Bernd Zeimetz Debian GNU/Linux Developer
http://bzed.de http://www.debian.org
GPG Fingerprint: ECA1 E3F2 8E11 2432 D485 DD95 EB36 171A 6FF9 435F
^ permalink raw reply
* Re: [Xen-devel] Re: [PATCH] ethtool: ETHTOOL_SFEATURES: remove NETIF_F_COMPAT return
From: Ian Campbell @ 2011-05-28 10:07 UTC (permalink / raw)
To: Michał Mirosław
Cc: dev-yBygre7rU0TnMu66kgdUjQ,
xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR@public.gmane.org,
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
xen-api-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR, Ben Hutchings,
David Miller
In-Reply-To: <20110528073525.GA19033-CoA6ZxLDdyEEUmgCuDUIdw@public.gmane.org>
On Sat, 2011-05-28 at 08:35 +0100, Michał Mirosław wrote:
> On Sat, May 28, 2011 at 12:25:55AM +0100, Ben Hutchings wrote:
> > On Fri, 2011-05-27 at 18:34 +0200, Michał Mirosław wrote:
> > > On Fri, May 27, 2011 at 04:45:50PM +0100, Ben Hutchings wrote:
> > > > On Fri, 2011-05-27 at 17:28 +0200, Michał Mirosław wrote:
> > [...]
> > > > > (note: ETHTOOL_S{SG,...} are not ever going away)
> > > > > - causes NETIF_F_* to be an ABI
> > > > If feature flag numbers are not stable then what is the point of
> > > > /sys/class/net/<name>/features? Also, I'm not aware that features have
> > > > ever been renumbered in the past.
> > > Since no NETIF_F_* were exported earlier, I assume /sys/class/net/*/features
> > > is a debugging aid. What is it used for besides that?
> > xen-api <https://github.com/xen-org/xen-api> uses it in
> > scripts/InterfaceReconfigureVswitch.py. Though it doesn't seem to be
> > used for a particularly good reason...
>
> Look like it should use ETHTOOL_GFLAGS instead for netdev_has_vlan_accel().
>
> Best Regards,
> Michał Mirosław
>
> [added Cc: xen-devel]
added Cc: xen-api list and dev@openvswitch as well.
Complete thread is at
http://thread.gmane.org/gmane.linux.network/195552/focus=197019
Ian.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev
^ permalink raw reply
* Re: [Bugme-new] [Bug 35992] New: Regression: oops when using a bridge interface with tg3
From: Eric Dumazet @ 2011-05-28 10:13 UTC (permalink / raw)
To: Bernd Zeimetz
Cc: Andrew Morton, netdev, bugzilla-daemon, bugme-daemon,
Stephen Hemminger, bridge
In-Reply-To: <4DE0B91E.1080407@debian.org>
Le samedi 28 mai 2011 à 10:58 +0200, Bernd Zeimetz a écrit :
> On 05/28/2011 12:21 AM, Andrew Morton wrote:
> > 2.6.38->2.6.39 regression, appears to be bridge-related. There's a
> > partial screencap of the oops linked below.
>
> The oops happens as soon as I add the lxc veth interface to the bridge, before
> everything works as expected.
>
>
> A new, full screenshot is attached
> https://bugzilla.kernel.org/attachment.cgi?id=59732
>
>
> (For yet unknows reasons I can't access the serial port redirection via ssh and
> I won't be able to attach a real serial port cable before Monday, so I'm afraid
> you have to work with the screenshot for now.)
>
>
> > Bernd, it would be helpful if you could set the screen to more rows
> > (50?) and then retake that photo. Documentation/svga.txt might help
> > out. Thanks.
>
> (the kernel says that vga=... is deprecated, guess that documentation is outdated?)
>
OK, this sounds like an already fixed bug.
(commit : 33eb9873a283a bridge: initialize fake_rtable metrics)
Could you try latest linux-2.6 tree ?
By the way, if panic stills happen, could you try netconsole ?
Here I just add
"netconsole=4444@192.168.20.108/eth1,4444@192.168.20.112/00:1e:0b:ec:c3:e4" to my boot param
192.168.20.108 is my ip addr,
192.168.20.112 the ip addr of "remote machine",
00:1e:0b:ec:c3:e4 the mac addr of "remote machine"
On "remote machine" I start : netcat -l -u -p 4444 </dev/null
Thanks
^ permalink raw reply
* Re: [Bug 35992] Regression: oops when using a bridge interface with tg3
From: Bernd Zeimetz @ 2011-05-28 11:35 UTC (permalink / raw)
To: Eric Dumazet, Andrew Morton, netdev
Cc: bugzilla-daemon, bugme-daemon, bzed, Stephen Hemminger, bridge
In-Reply-To: <201105281013.p4SADfPS024949@demeter1.kernel.org>
Hi,
> OK, this sounds like an already fixed bug.
>
> (commit : 33eb9873a283a bridge: initialize fake_rtable metrics)
>
> Could you try latest linux-2.6 tree ?
I've picked the commit into 2.6.39 and it fixed the issue, thanks for the pointer.
Could we please get that included in 2.6.39.1?
Thanks,
Bernd
--
Bernd Zeimetz Debian GNU/Linux Developer
http://bzed.de http://www.debian.org
GPG Fingerprint: ECA1 E3F2 8E11 2432 D485 DD95 EB36 171A 6FF9 435F
^ permalink raw reply
* Re: [PATCH 1/2 v2] af-packet: Use existing netdev reference for bound sockets.
From: Ben Greear @ 2011-05-28 17:01 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1306563630.2533.25.camel@edumazet-laptop>
On 05/27/2011 11:20 PM, Eric Dumazet wrote:
> Le vendredi 27 mai 2011 à 13:18 -0700, Ben Greear a écrit :
>> On 05/27/2011 01:15 PM, David Miller wrote:
>>> From: Eric Dumazet<eric.dumazet@gmail.com>
>>> Date: Fri, 27 May 2011 22:08:41 +0200
>>>
>>>> Le jeudi 26 mai 2011 à 21:11 -0700, Ben Greear a écrit :
>>>>> On 05/26/2011 08:42 PM, Eric Dumazet wrote:
>>>>>> Le jeudi 26 mai 2011 à 16:55 -0700, greearb@candelatech.com a écrit :
>>>>>
>>>>>>> out_free:
>>>>>>> kfree_skb(skb);
>>>>>>> out_unlock:
>>>>>>> - if (dev)
>>>>>>> + if (dev&& need_rls_dev)
>>>>>>> dev_put(dev);
>>>>>>> out:
>>>>>>> return err;
>>>>>>
>>>>>> Hmmm, I wonder why you want this Ben.
>>>>>>
>>>>>> IMHO this is buggy, because we can sleep in this function.
>>>>>>
>>>>>> We must take a ref on device (its really cheap these days, now we have a
>>>>>> percpu device refcnt)
>>>>>
>>>>> Why must you take the reference? And if we must, why isn't the
>>>>> current code that assigns the prot_hook.dev without taking a
>>>>> reference OK?
>>>>>
>>>>
>>>> If we sleep, device can disappear under us.
>>>>
>>>> The only way to not take a reference is to hold rcu_read_lock(), but
>>>> you're not allowed to sleep under rcu_read_lock().
>>>
>>> You still have not addresses Ben's point.
>>>
>>> Why is it ok for the po->prot_hook.dev handling to not take a
>>> reference? It's been doing this forever. Ben is just borrowing this
>>> behavior for his uses.
>>>
>>> After some more research I think it happens to be OK because
>>> ->prot_hook.dev is used _only_ for pointer comparisons, it is never
>>> actually dereferenced or used in any other way. Probably, we should
>>> just use ->ifindex for this.
>>
>> It's easy enough to add a dev_hold() when I assign the skb instead
>> of looking it up in my patch, but perhaps it would be cleaner over all to
>> just hold a ref on the prot_hook.dev when it is originally assigned?
>
>
> Problem is : if packet_notifier(NETDEV_DOWN|UNREGISTER) is run while we
> sleep, what happens then ?
>
> Normally, if we sleep a long time in tpacket_snd() after device ref
> increment, and before dev_queue_xmit(), the unregister process can enter
> the infamous msleep(250) loop in netdev_wait_allrefs(), but at least we
> dont crash.
>
> But if you dont take the reference, we can crash in dev_queue_xmit()
> when dereferencing the freed netdev structure.
>
> Please check commit 1a35ca80c1db7 (packet: dont call sleeping functions
> while holding rcu_read_lock()) for reference on possible problems.
I'll create a new patch to hold ref on the prot_hook.dev when it's assigned,
and then layer the 'existing netdev reference' patch on top of that. Might
be a day or two...
Thanks,
Ben
>
> Thanks !
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* Re: [Xen-devel] Re: [PATCH] ethtool: ETHTOOL_SFEATURES: remove NETIF_F_COMPAT return
From: Jesse Gross @ 2011-05-28 17:31 UTC (permalink / raw)
To: Ian Campbell
Cc: dev-yBygre7rU0TnMu66kgdUjQ,
xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR@public.gmane.org,
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
xen-api-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR, Ben Hutchings,
Michał Mirosław, David Miller
In-Reply-To: <1306577228.23577.17.camel-ztPmHsLffjjnO4AKDKe2m+kiAK3p4hvP@public.gmane.org>
2011/5/28 Ian Campbell <Ian.Campbell@citrix.com>:
> On Sat, 2011-05-28 at 08:35 +0100, Michał Mirosław wrote:
>> On Sat, May 28, 2011 at 12:25:55AM +0100, Ben Hutchings wrote:
>> > On Fri, 2011-05-27 at 18:34 +0200, Michał Mirosław wrote:
>> > > On Fri, May 27, 2011 at 04:45:50PM +0100, Ben Hutchings wrote:
>> > > > On Fri, 2011-05-27 at 17:28 +0200, Michał Mirosław wrote:
>> > [...]
>> > > > > (note: ETHTOOL_S{SG,...} are not ever going away)
>> > > > > - causes NETIF_F_* to be an ABI
>> > > > If feature flag numbers are not stable then what is the point of
>> > > > /sys/class/net/<name>/features? Also, I'm not aware that features have
>> > > > ever been renumbered in the past.
>> > > Since no NETIF_F_* were exported earlier, I assume /sys/class/net/*/features
>> > > is a debugging aid. What is it used for besides that?
>> > xen-api <https://github.com/xen-org/xen-api> uses it in
>> > scripts/InterfaceReconfigureVswitch.py. Though it doesn't seem to be
>> > used for a particularly good reason...
>>
>> Look like it should use ETHTOOL_GFLAGS instead for netdev_has_vlan_accel().
ETHTOOL_GFLAGS didn't expose the vlan acceleration flags until 2.6.37,
which is why /sys/class/net was used instead.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev
^ permalink raw reply
* [TRIVIAL PATCH next 00/15] treewide: Convert vmalloc/memset to vzalloc
From: Joe Perches @ 2011-05-28 17:36 UTC (permalink / raw)
To: linux-atm-general, netdev, drbd-user, dm-devel, linux-raid,
linux-mtd, linux-s
Cc: linux-s390, linux-kernel, linux-media, devel, xfs
Resubmittal of patches from November 2010 and a few new ones.
Joe Perches (15):
s390: Convert vmalloc/memset to vzalloc
x86: Convert vmalloc/memset to vzalloc
atm: Convert vmalloc/memset to vzalloc
drbd: Convert vmalloc/memset to vzalloc
char: Convert vmalloc/memset to vzalloc
isdn: Convert vmalloc/memset to vzalloc
md: Convert vmalloc/memset to vzalloc
media: Convert vmalloc/memset to vzalloc
mtd: Convert vmalloc/memset to vzalloc
scsi: Convert vmalloc/memset to vzalloc
staging: Convert vmalloc/memset to vzalloc
video: Convert vmalloc/memset to vzalloc
fs: Convert vmalloc/memset to vzalloc
mm: Convert vmalloc/memset to vzalloc
net: Convert vmalloc/memset to vzalloc
arch/s390/hypfs/hypfs_diag.c | 3 +--
arch/x86/mm/pageattr-test.c | 3 +--
drivers/atm/idt77252.c | 11 ++++++-----
drivers/atm/lanai.c | 3 +--
drivers/block/drbd/drbd_bitmap.c | 5 ++---
drivers/char/agp/backend.c | 3 +--
drivers/char/raw.c | 3 +--
drivers/isdn/i4l/isdn_common.c | 4 ++--
drivers/isdn/mISDN/dsp_core.c | 3 +--
drivers/isdn/mISDN/l1oip_codec.c | 6 ++----
drivers/md/dm-log.c | 3 +--
drivers/md/dm-snap-persistent.c | 3 +--
drivers/md/dm-table.c | 4 +---
drivers/media/video/videobuf2-dma-sg.c | 8 ++------
drivers/mtd/mtdswap.c | 3 +--
drivers/s390/cio/blacklist.c | 3 +--
drivers/scsi/bfa/bfad.c | 3 +--
drivers/scsi/bfa/bfad_debugfs.c | 8 ++------
drivers/scsi/cxgbi/libcxgbi.h | 6 ++----
drivers/scsi/qla2xxx/qla_attr.c | 6 ++----
drivers/scsi/qla2xxx/qla_bsg.c | 3 +--
drivers/scsi/scsi_debug.c | 7 ++-----
drivers/staging/rts_pstor/ms.c | 3 +--
drivers/staging/rts_pstor/rtsx_chip.c | 6 ++----
drivers/video/arcfb.c | 5 ++---
drivers/video/broadsheetfb.c | 4 +---
drivers/video/hecubafb.c | 5 ++---
drivers/video/metronomefb.c | 4 +---
drivers/video/xen-fbfront.c | 3 +--
fs/coda/coda_linux.h | 5 ++---
fs/reiserfs/journal.c | 9 +++------
fs/reiserfs/resize.c | 4 +---
fs/xfs/linux-2.6/kmem.h | 7 +------
mm/page_cgroup.c | 3 +--
net/netfilter/x_tables.c | 5 ++---
net/rds/ib_cm.c | 6 ++----
36 files changed, 57 insertions(+), 113 deletions(-)
--
1.7.5.rc3.dirty
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* [TRIVIAL PATCH next 03/15] atm: Convert vmalloc/memset to vzalloc
From: Joe Perches @ 2011-05-28 17:36 UTC (permalink / raw)
To: Chas Williams, Jiri Kosina; +Cc: linux-atm-general, netdev, linux-kernel
In-Reply-To: <cover.1306603968.git.joe@perches.com>
Signed-off-by: Joe Perches <joe@perches.com>
---
drivers/atm/idt77252.c | 11 ++++++-----
drivers/atm/lanai.c | 3 +--
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/atm/idt77252.c b/drivers/atm/idt77252.c
index 1f8d724..8d7b663 100644
--- a/drivers/atm/idt77252.c
+++ b/drivers/atm/idt77252.c
@@ -3415,27 +3415,28 @@ init_card(struct atm_dev *dev)
size = sizeof(struct vc_map *) * card->tct_size;
IPRINTK("%s: allocate %d byte for VC map.\n", card->name, size);
- if (NULL == (card->vcs = vmalloc(size))) {
+ card->vcs = vzalloc(size);
+ if (!card->vcs) {
printk("%s: memory allocation failure.\n", card->name);
deinit_card(card);
return -1;
}
- memset(card->vcs, 0, size);
size = sizeof(struct vc_map *) * card->scd_size;
IPRINTK("%s: allocate %d byte for SCD to VC mapping.\n",
card->name, size);
- if (NULL == (card->scd2vc = vmalloc(size))) {
+ card->scd2vc = vzalloc(size);
+ if (!card->scd2vc) {
printk("%s: memory allocation failure.\n", card->name);
deinit_card(card);
return -1;
}
- memset(card->scd2vc, 0, size);
size = sizeof(struct tst_info) * (card->tst_size - 2);
IPRINTK("%s: allocate %d byte for TST to VC mapping.\n",
card->name, size);
- if (NULL == (card->soft_tst = vmalloc(size))) {
+ card->soft_tst = vmalloc(size);
+ if (!card->soft_tst) {
printk("%s: memory allocation failure.\n", card->name);
deinit_card(card);
return -1;
diff --git a/drivers/atm/lanai.c b/drivers/atm/lanai.c
index 4e8ba56..be57a14 100644
--- a/drivers/atm/lanai.c
+++ b/drivers/atm/lanai.c
@@ -1457,10 +1457,9 @@ static int __devinit vcc_table_allocate(struct lanai_dev *lanai)
return (lanai->vccs == NULL) ? -ENOMEM : 0;
#else
int bytes = (lanai->num_vci) * sizeof(struct lanai_vcc *);
- lanai->vccs = (struct lanai_vcc **) vmalloc(bytes);
+ lanai->vccs = vzalloc(bytes);
if (unlikely(lanai->vccs == NULL))
return -ENOMEM;
- memset(lanai->vccs, 0, bytes);
return 0;
#endif
}
--
1.7.5.rc3.dirty
^ permalink raw reply related
* [TRIVIAL PATCH next 15/15] net: Convert vmalloc/memset to vzalloc
From: Joe Perches @ 2011-05-28 17:36 UTC (permalink / raw)
To: Patrick McHardy, Andy Grover, Jiri Kosina
Cc: David S. Miller, netfilter-devel, netfilter, coreteam, netdev,
linux-kernel, rds-devel
In-Reply-To: <cover.1306603968.git.joe@perches.com>
Signed-off-by: Joe Perches <joe@perches.com>
---
net/netfilter/x_tables.c | 5 ++---
net/rds/ib_cm.c | 6 ++----
2 files changed, 4 insertions(+), 7 deletions(-)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index b0869fe..71441b9 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -776,12 +776,11 @@ static int xt_jumpstack_alloc(struct xt_table_info *i)
size = sizeof(void **) * nr_cpu_ids;
if (size > PAGE_SIZE)
- i->jumpstack = vmalloc(size);
+ i->jumpstack = vzalloc(size);
else
- i->jumpstack = kmalloc(size, GFP_KERNEL);
+ i->jumpstack = kzalloc(size, GFP_KERNEL);
if (i->jumpstack == NULL)
return -ENOMEM;
- memset(i->jumpstack, 0, size);
i->stacksize *= xt_jumpstack_multiplier;
size = sizeof(void *) * i->stacksize;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index fd453dd..6ecaf78 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -374,23 +374,21 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
goto out;
}
- ic->i_sends = vmalloc_node(ic->i_send_ring.w_nr * sizeof(struct rds_ib_send_work),
+ ic->i_sends = vzalloc_node(ic->i_send_ring.w_nr * sizeof(struct rds_ib_send_work),
ibdev_to_node(dev));
if (!ic->i_sends) {
ret = -ENOMEM;
rdsdebug("send allocation failed\n");
goto out;
}
- memset(ic->i_sends, 0, ic->i_send_ring.w_nr * sizeof(struct rds_ib_send_work));
- ic->i_recvs = vmalloc_node(ic->i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work),
+ ic->i_recvs = vzalloc_node(ic->i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work),
ibdev_to_node(dev));
if (!ic->i_recvs) {
ret = -ENOMEM;
rdsdebug("recv allocation failed\n");
goto out;
}
- memset(ic->i_recvs, 0, ic->i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work));
rds_ib_recv_init_ack(ic);
--
1.7.5.rc3.dirty
^ permalink raw reply related
* [TRIVIAL PATCH next 06/15] isdn: Convert vmalloc/memset to vzalloc
From: Joe Perches @ 2011-05-28 17:36 UTC (permalink / raw)
To: Jiri Kosina; +Cc: Karsten Keil, netdev, linux-kernel
In-Reply-To: <cover.1306603968.git.joe@perches.com>
Signed-off-by: Joe Perches <joe@perches.com>
---
drivers/isdn/i4l/isdn_common.c | 4 ++--
drivers/isdn/mISDN/dsp_core.c | 3 +--
drivers/isdn/mISDN/l1oip_codec.c | 6 ++----
3 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/drivers/isdn/i4l/isdn_common.c b/drivers/isdn/i4l/isdn_common.c
index 6ed82ad..6ddb795e 100644
--- a/drivers/isdn/i4l/isdn_common.c
+++ b/drivers/isdn/i4l/isdn_common.c
@@ -2308,11 +2308,11 @@ static int __init isdn_init(void)
int i;
char tmprev[50];
- if (!(dev = vmalloc(sizeof(isdn_dev)))) {
+ dev = vzalloc(sizeof(isdn_dev));
+ if (!dev) {
printk(KERN_WARNING "isdn: Could not allocate device-struct.\n");
return -EIO;
}
- memset((char *) dev, 0, sizeof(isdn_dev));
init_timer(&dev->timer);
dev->timer.function = isdn_timer_funct;
spin_lock_init(&dev->lock);
diff --git a/drivers/isdn/mISDN/dsp_core.c b/drivers/isdn/mISDN/dsp_core.c
index 2877291..0c41553 100644
--- a/drivers/isdn/mISDN/dsp_core.c
+++ b/drivers/isdn/mISDN/dsp_core.c
@@ -1052,12 +1052,11 @@ dspcreate(struct channel_req *crq)
if (crq->protocol != ISDN_P_B_L2DSP
&& crq->protocol != ISDN_P_B_L2DSPHDLC)
return -EPROTONOSUPPORT;
- ndsp = vmalloc(sizeof(struct dsp));
+ ndsp = vzalloc(sizeof(struct dsp));
if (!ndsp) {
printk(KERN_ERR "%s: vmalloc struct dsp failed\n", __func__);
return -ENOMEM;
}
- memset(ndsp, 0, sizeof(struct dsp));
if (dsp_debug & DEBUG_DSP_CTRL)
printk(KERN_DEBUG "%s: creating new dsp instance\n", __func__);
diff --git a/drivers/isdn/mISDN/l1oip_codec.c b/drivers/isdn/mISDN/l1oip_codec.c
index bbfd1b8..5a89972 100644
--- a/drivers/isdn/mISDN/l1oip_codec.c
+++ b/drivers/isdn/mISDN/l1oip_codec.c
@@ -330,14 +330,12 @@ l1oip_4bit_alloc(int ulaw)
return 0;
/* alloc conversion tables */
- table_com = vmalloc(65536);
- table_dec = vmalloc(512);
+ table_com = vzalloc(65536);
+ table_dec = vzalloc(512);
if (!table_com || !table_dec) {
l1oip_4bit_free();
return -ENOMEM;
}
- memset(table_com, 0, 65536);
- memset(table_dec, 0, 512);
/* generate compression table */
i1 = 0;
while (i1 < 256) {
--
1.7.5.rc3.dirty
^ permalink raw reply related
* Re: Kernel crash after using new Intel NIC (igb)
From: Ingo Molnar @ 2011-05-28 18:04 UTC (permalink / raw)
To: Eric Dumazet
Cc: Arun Sharma, David Miller, Maximilian Engelhardt, linux-kernel,
netdev, StuStaNet Vorstand, Yann Dupont, Denys Fedoryshchenko,
Thomas Gleixner
In-Reply-To: <1306561285.2533.9.camel@edumazet-laptop>
* Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > +static inline int atomic_add_unless(atomic_t *v, int a, int u)
> > +{
> > + return __atomic_add_unless(v, a, u) != u;
> > }
> >
> > #define atomic_inc_not_zero(v) atomic_add_unless((v), 1, 0)
>
> As I said, atomic_add_unless() has several implementations in
> various arches. You must take care of all, not only x86.
It's a bit sad to see local hacks to generic facilities committed
upstream like that.
Arun: the x86 bits look good at first sight.
Thanks,
Ingo
^ permalink raw reply
* Re: [PATCH V6 0/4 net-next] macvtap/vhost TX zero-copy support
From: Shirley Ma @ 2011-05-28 18:07 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
kvm, linux-kernel
In-Reply-To: <1306441885.5180.52.camel@localhost.localdomain>
On Thu, 2011-05-26 at 13:31 -0700, Shirley Ma wrote:
> On Thu, 2011-05-26 at 23:28 +0300, Michael S. Tsirkin wrote:
> > On Thu, May 26, 2011 at 01:00:20PM -0700, Shirley Ma wrote:
> > > 3. Add sleep in vhost shutting down instead of busy-wait for
> > outstanding
> > > DMAs.
> >
> > I still think this is not much better. We need to use a
> > completion structure and wait on it instead.
> > If this gets blocked thinkably a tx watchdog can fire and save us
> > from blocking forver :)
>
> Ok, I can add a completion structure here.
The code here doesn't block forever during shutdown, it will release all
outstanding userspace buffers anyway, see vhost_zerocopy_signal_used()
shutdown case.
Thanks
Shirley
^ permalink raw reply
* Re: [Bugme-new] [Bug 35862] New: arp requests from wrong src IP
From: Julian Anastasov @ 2011-05-28 19:10 UTC (permalink / raw)
To: Victor Mataré
Cc: David Miller, akpm, netdev, bugzilla-daemon, bugme-daemon
In-Reply-To: <201105280158.58927.matare@lih.rwth-aachen.de>
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2373 bytes --]
Hello,
On Sat, 28 May 2011, Victor Mataré wrote:
> ok - starting situation was 2 IPs: 137.226.164.13/24 (eth0) and 192.168.23.13/24 (eth0:0)
> then I did "ifconfig eth0 137.226.164.2 netmask 255.255.255.0"
> I'm not exactly sure what happened then, but the result was that "ip addr show dev eth0" showed that eth0 still had the old IP address, while ifconfig didn't. Ifconfig was misbehaving in some kind of way, that's why I checked the situation with the ip tool. Then I used ip to configure everything as intended and now I have the situation described in this bug. Note that the server has been in productive use for a week now despite of that.
>
> >
> > - output of 'ip route list table local' after IPs are changed and
> > before starting arping
>
> broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
> broadcast 192.168.23.0 dev eth0 proto kernel scope link src 192.168.23.2
> local 192.168.23.2 dev eth0 proto kernel scope host src 192.168.23.2
> local 137.226.164.2 dev eth0 proto kernel scope host src 137.226.164.2
> local 137.226.164.13 dev eth0 proto kernel scope host src 137.226.164.13
> broadcast 192.168.23.255 dev eth0 proto kernel scope link src 192.168.23.2
> broadcast 137.226.164.255 dev eth0 proto kernel scope link src 137.226.164.2
> broadcast 137.226.164.255 dev eth0 proto kernel scope link src 192.168.23.2
> broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
> local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
> local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
>
> I guess that entry "local 137.226.164.13" shouldn't be there? But shouldn't that be removed automatically when I delete the IP from eth0?
Yes, this problem looks like what we fixed recently:
http://marc.info/?l=linux-netdev&m=129848300922970&w=2
http://marc.info/?l=linux-netdev&m=130048961407666&w=2
http://marc.info/?l=linux-netdev&m=130057251901164&w=2
It can happen only when you add 137.226.164.13 many
times with different subnet mask at the same time,
eg. /32 and /24.
To understand what really happens for your setup
we should try commands that reproduce the problem, eg.
on some unused device such as eth1 or dummy0. The first
link has such test script as example. Leaving such routes
should be reproducible.
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply
* [PATCH V7 0/4 net-next] macvtap/vhost TX zero-copy support
From: Shirley Ma @ 2011-05-28 19:14 UTC (permalink / raw)
To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
Cc: netdev, kvm, linux-kernel
This patchset add supports for TX zero-copy between guest and host
kernel through vhost. It significantly reduces CPU utilization on the
local host on which the guest is located (It reduced about 50% CPU usage
for single stream test on the host, while 4K message size BW has
increased about 50%). The patchset is based on previous submission and
comments from the community regarding when/how to handle guest kernel
buffers to be released. This is the simplest approach I can think of
after comparing with several other solutions.
This patchset has integrated V3 review comments from community:
1. Add more comments on how to use device ZEROCOPY flag;
2. Change device ZEROCOPY to available bit 31
3. Fix skb header linear allocation when virtio_net GSO is not enabled
It has integrated V4 review comments from MST and Sridhar:
1. In vhost, using socket poll wake up for outstanding DMAs
2. Add detailed comments for vhost_zerocopy_signal_used call
3. Add sleep in vhost shutting down instead of busy-wait for outstanding
DMAs.
4. Copy small packets, don't do zero-copy callback in mavtap, mark it's
DMA done in vhost
5. change zerocopy to bool in macvtap.
It has integrated V5 review comments from MST and
Michał Mirosław <mirqus@gmail.com>
1. Prevent userspace apps from holding skb userspace buffers by copying
userspace buffers to kernel in skb_clone, skb_copy, pskb_copy,
pskb_expand_head.
2. It is also used HIGHDMA, SG feature bits to enable ZEROCOPY to remove
the dependency of a new feature bit, we can add it later when new
feature bit is available.
It has integrated V6 review comments from Eric Dumazet.
1. Moving ubuf_info object from skb to caller, just use one pointer in
skb_share_info to point ubuf_info object.
2. Change the zero-copy size from 256 bytes to PAGE_SIZE (4K) because of
the small message size performance issue.
3. During vhost shutting down, release outstanding userspace buffers w/o
waiting for lower device DMAs done if any. Do we really care about the
possible wrong data being sent on the wire during shutting down?
This patchset includes:
1/4: Add a new sock zero-copy flag, SOCK_ZEROCOPY;
2/4: Add a new struct skb_ubuf_info in skb_share_info for userspace
buffers release callback when lower device DMA has done for that skb,
which is the last reference count gone; Or whenever skb_clone, skb_copy,
pskb_copy, pskb_expand_head get call from tcpdump, filtering, these
userspace buffers will be copied into kernel ... we don't want userspace
apps to hold userspace buffers too long.
3/4: Add vhost zero-copy callback in vhost when skb last refcnt is gone;
add vhost_zerocopy_signal_used to notify guest to release TX skb
buffers.
4/4: Add macvtap zero-copy in lower device when sending packet is
greater than PAGE_SIZE.
The patchset is built against linux-2.6.39. It has passed
netperf/netserver multiple streams stress test, tcpdump
suspended test, dynamically SG change test.
Single TCP_STREAM 120 secs test results 2.6.39-rc3 over ixgbe 10Gb NIC
results:
Message BW(Gb/s)qemu-kvm (NumCPU)vhost-net(NumCPU) PerfTop irq/s
4K 7408.57 92.1% 22.6% 1229
4K(Orig)4913.17 118.1% 84.1% 2086
8K 9129.90 89.3% 23.3% 1141
8K(Orig)7094.55 115.9% 84.7% 2157
16K 9178.81 89.1% 23.3% 1139
16K(Orig)8927.1 118.7% 83.4% 2262
64K 9171.43 88.4% 24.9% 1253
64K(Orig)9085.85 115.9% 82.4% 2229
For message size less or equal than 2K, there is a known KVM guest TX
overrun issue. With this zero-copy patch, the issue becomes more severe,
guest io_exits has tripled than before, so the performance is not good.
Once the TX overrun problem has been addressed, I will retest the small
message size performance.
drivers/net/macvtap.c | 131
++++++++++++++++++++++++++++++++++++++++++++----
drivers/vhost/net.c | 45 ++++++++++++++++-
drivers/vhost/vhost.c | 51 +++++++++++++++++++
drivers/vhost/vhost.h | 15 ++++++
include/linux/skbuff.h | 25 +++++++++
include/net/sock.h | 1 +
net/core/skbuff.c | 83 ++++++++++++++++++++++++++++++-
7 files changed, 338 insertions(+), 13 deletions(-)
^ permalink raw reply
* [PATCH V7 1/4 net-next] sock.h: Add a new sock zero-copy flag
From: Shirley Ma @ 2011-05-28 19:15 UTC (permalink / raw)
To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
Cc: netdev, kvm, linux-kernel
include/net/sock.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index f2046e4..2229bd1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -563,6 +563,7 @@ enum sock_flags {
SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
SOCK_FASYNC, /* fasync() active */
SOCK_RXQ_OVFL,
+ SOCK_ZEROCOPY, /* buffers from userspace */
};
static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
^ permalink raw reply related
* [PATCH V7 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb
From: Shirley Ma @ 2011-05-28 19:23 UTC (permalink / raw)
To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
Cc: netdev, kvm, linux-kernel
This patch adds userspace buffers support in skb shared info. A new
struct skb_ubuf_info is needed to maintain the userspace buffers
argument and index, a callback is used to notify userspace to release
the buffers once lower device has done DMA (Last reference to that skb
has gone).
If there is any userspace apps to reference these userspace buffers,
then these userspaces buffers will be copied into kernel. This way we
can prevent userspace apps to hold these userspace buffers too long.
One userspace buffer info pointer is added in skb_shared_info. Is that
safe to use destructor_arg? From the comments, this destructor_arg is
used for destructor, destructor calls doesn't wait for last reference to
that skb is gone.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
include/linux/skbuff.h | 24 ++++++++++++++++
net/core/skbuff.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 97 insertions(+), 0 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e8b78ce..37a2cb4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -189,6 +189,17 @@ enum {
SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
};
+/*
+ * The callback notifies userspace to release buffers when skb DMA is done in
+ * lower device, the skb last reference should be 0 when calling this.
+ * The desc is used to track userspace buffer index.
+ */
+struct ubuf_info {
+ void (*callback)(void *);
+ void *arg;
+ unsigned long desc;
+};
+
/* This data is invariant across clones and lives at
* the end of the header data, ie. at skb->end.
*/
@@ -203,6 +214,9 @@ struct skb_shared_info {
struct sk_buff *frag_list;
struct skb_shared_hwtstamps hwtstamps;
+ /* DMA mapping from/to userspace buffers info */
+ void *ubuf_arg;
+
/*
* Warning : all fields before dataref are cleared in __alloc_skb()
*/
@@ -2261,5 +2275,15 @@ static inline void skb_checksum_none_assert(struct sk_buff *skb)
}
bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
+
+/*
+ * skb *uarg - is the buffer from userspace
+ * @skb: buffer to check
+ */
+static inline int skb_ubuf(const struct sk_buff *skb)
+{
+ return skb_shinfo(skb)->ubuf_arg != NULL;
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_SKBUFF_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 46cbd28..115c5ca 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -329,6 +329,19 @@ static void skb_release_data(struct sk_buff *skb)
put_page(skb_shinfo(skb)->frags[i].page);
}
+ /*
+ * If skb buf is from userspace, we need to notify the caller
+ * the lower device DMA has done;
+ */
+ if (skb_ubuf(skb)) {
+ struct ubuf_info *uarg;
+
+ uarg = (struct ubuf_info *)skb_shinfo(skb)->ubuf_arg;
+
+ if (uarg->callback)
+ uarg->callback(uarg);
+ }
+
if (skb_has_frag_list(skb))
skb_drop_fraglist(skb);
@@ -481,6 +494,9 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size)
if (irqs_disabled())
return false;
+ if (skb_ubuf(skb))
+ return false;
+
if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
return false;
@@ -573,6 +589,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
atomic_set(&n->users, 1);
atomic_inc(&(skb_shinfo(skb)->dataref));
+ skb_shinfo(skb)->ubuf_arg = NULL;
skb->cloned = 1;
return n;
@@ -596,6 +613,51 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
}
EXPORT_SYMBOL_GPL(skb_morph);
+/* skb frags copy userspace buffers to kernel */
+static int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
+{
+ int i;
+ int num_frags = skb_shinfo(skb)->nr_frags;
+ struct page *page, *head = NULL;
+ struct ubuf_info *uarg = skb_shinfo(skb)->ubuf_arg;
+
+ for (i = 0; i < num_frags; i++) {
+ u8 *vaddr;
+ skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+
+ page = alloc_page(GFP_ATOMIC);
+ if (!page) {
+ while (head) {
+ put_page(head);
+ head = (struct page *)head->private;
+ }
+ return -ENOMEM;
+ }
+ vaddr = kmap_skb_frag(&skb_shinfo(skb)->frags[i]);
+ memcpy(page_address(page),
+ vaddr + f->page_offset, f->size);
+ kunmap_skb_frag(vaddr);
+ page->private = (unsigned long)head;
+ head = page;
+ }
+
+ /* skb frags release userspace buffers */
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+ put_page(skb_shinfo(skb)->frags[i].page);
+
+ uarg->callback(uarg);
+ skb_shinfo(skb)->ubuf_arg = NULL;
+
+ /* skb frags point to kernel buffers */
+ for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
+ skb_shinfo(skb)->frags[i - 1].page_offset = 0;
+ skb_shinfo(skb)->frags[i - 1].page = head;
+ head = (struct page *)head->private;
+ }
+ return 0;
+}
+
+
/**
* skb_clone - duplicate an sk_buff
* @skb: buffer to clone
@@ -614,6 +676,11 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
{
struct sk_buff *n;
+ if (skb_ubuf(skb)) {
+ if (skb_copy_ubufs(skb, gfp_mask))
+ return NULL;
+ }
+
n = skb + 1;
if (skb->fclone == SKB_FCLONE_ORIG &&
n->fclone == SKB_FCLONE_UNAVAILABLE) {
@@ -731,6 +798,12 @@ struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask)
if (skb_shinfo(skb)->nr_frags) {
int i;
+ if (skb_ubuf(skb)) {
+ if (skb_copy_ubufs(skb, gfp_mask)) {
+ kfree(n);
+ goto out;
+ }
+ }
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
get_page(skb_shinfo(n)->frags[i].page);
^ permalink raw reply related
* [PATCH V7 3/4]macvtap: macvtap TX zero-copy support
From: Shirley Ma @ 2011-05-28 19:25 UTC (permalink / raw)
To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
Cc: netdev, kvm, linux-kernel
Only when buffer size is greater than or equal PAGE_SIZE, macvtap
enables zero-copy.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/net/macvtap.c | 131 ++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 120 insertions(+), 11 deletions(-)
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6696e56..a0fe315 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
*/
static dev_t macvtap_major;
#define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN 256
static struct class *macvtap_class;
static struct cdev macvtap_cdev;
@@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
{
struct net *net = current->nsproxy->net_ns;
struct net_device *dev = dev_get_by_index(net, iminor(inode));
+ struct macvlan_dev *vlan = netdev_priv(dev);
struct macvtap_queue *q;
int err;
@@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode, struct file *file)
q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
+ /*
+ * so far only KVM virtio_net uses macvtap, enable zero copy between
+ * guest kernel and host kernel when lower device supports zerocopy
+ */
+ if (vlan) {
+ if ((vlan->lowerdev->features & NETIF_F_HIGHDMA) &&
+ (vlan->lowerdev->features & NETIF_F_SG))
+ sock_set_flag(&q->sk, SOCK_ZEROCOPY);
+ }
+
err = macvtap_set_queue(dev, file, q);
if (err)
sock_put(&q->sk);
@@ -433,6 +445,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
return skb;
}
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
+ int offset, size_t count)
+{
+ int len = iov_length(from, count) - offset;
+ int copy = skb_headlen(skb);
+ int size, offset1 = 0;
+ int i = 0;
+ skb_frag_t *f;
+
+ /* Skip over from offset */
+ while (count && (offset >= from->iov_len)) {
+ offset -= from->iov_len;
+ ++from;
+ --count;
+ }
+
+ /* copy up to skb headlen */
+ while (count && (copy > 0)) {
+ size = min_t(unsigned int, copy, from->iov_len - offset);
+ if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+ size))
+ return -EFAULT;
+ if (copy > size) {
+ ++from;
+ --count;
+ }
+ copy -= size;
+ offset1 += size;
+ offset = 0;
+ }
+
+ if (len == offset1)
+ return 0;
+
+ while (count--) {
+ struct page *page[MAX_SKB_FRAGS];
+ int num_pages;
+ unsigned long base;
+
+ len = from->iov_len - offset1;
+ if (!len) {
+ offset1 = 0;
+ ++from;
+ continue;
+ }
+ base = (unsigned long)from->iov_base + offset1;
+ size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+ num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+ if ((num_pages != size) ||
+ (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+ /* put_page is in skb free */
+ return -EFAULT;
+ skb->data_len += len;
+ skb->len += len;
+ skb->truesize += len;
+ atomic_add(len, &skb->sk->sk_wmem_alloc);
+ while (len) {
+ f = &skb_shinfo(skb)->frags[i];
+ f->page = page[i];
+ f->page_offset = base & ~PAGE_MASK;
+ f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
+ skb_shinfo(skb)->nr_frags++;
+ /* increase sk_wmem_alloc */
+ base += f->size;
+ len -= f->size;
+ i++;
+ }
+ offset1 = 0;
+ ++from;
+ }
+ return 0;
+}
+
/*
* macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
* be shared with the tun/tap driver.
@@ -515,16 +601,18 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,
/* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
- const struct iovec *iv, size_t count,
- int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
+ const struct iovec *iv, unsigned long total_len,
+ size_t count, int noblock)
{
struct sk_buff *skb;
struct macvlan_dev *vlan;
- size_t len = count;
+ unsigned long len = total_len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
+ int copylen;
+ bool zerocopy = false;
if (q->flags & IFF_VNET_HDR) {
vnet_hdr_len = q->vnet_hdr_sz;
@@ -552,12 +640,30 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
if (unlikely(len < ETH_HLEN))
goto err;
- skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
- noblock, &err);
+ if (m && m->msg_control)
+ zerocopy = true;
+
+ if (zerocopy) {
+ /* There are 256 bytes to be copied in skb, so there is enough
+ * room for skb expand head in case it is used.
+ * The rest buffer is mapped from userspace.
+ */
+ copylen = vnet_hdr.hdr_len;
+ if (!copylen)
+ copylen = GOODCOPY_LEN;
+ } else
+ copylen = len;
+
+ skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen,
+ vnet_hdr.hdr_len, noblock, &err);
if (!skb)
goto err;
- err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
+ if (zerocopy)
+ err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
+ else
+ err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
+ len);
if (err)
goto err_kfree;
@@ -573,13 +679,16 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
rcu_read_lock_bh();
vlan = rcu_dereference_bh(q->vlan);
+ /* copy skb_ubuf_info for callback when skb has no error */
+ if (zerocopy)
+ skb_shinfo(skb)->ubuf_arg = m->msg_control;
if (vlan)
macvlan_start_xmit(skb, vlan->dev);
else
kfree_skb(skb);
rcu_read_unlock_bh();
- return count;
+ return total_len;
err_kfree:
kfree_skb(skb);
@@ -601,8 +710,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
ssize_t result = -ENOLINK;
struct macvtap_queue *q = file->private_data;
- result = macvtap_get_user(q, iv, iov_length(iv, count),
- file->f_flags & O_NONBLOCK);
+ result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
+ file->f_flags & O_NONBLOCK);
return result;
}
@@ -815,7 +924,7 @@ static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *m, size_t total_len)
{
struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
- return macvtap_get_user(q, m->msg_iov, total_len,
+ return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
m->msg_flags & MSG_DONTWAIT);
}
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox