* Re: [RFC PATCH net-next 2/3] virtio_net: Introduce one dummy function virtnet_filter_rfs()
From: Tom Herbert @ 2014-01-15 17:54 UTC (permalink / raw)
To: Zhi Yong Wu; +Cc: Linux Netdev List, Eric Dumazet, David Miller, Zhi Yong Wu
In-Reply-To: <1389795654-28381-3-git-send-email-zwu.kernel@gmail.com>
Zhi, this is promising work! I can't wait to see how this impacts
network virtualization performance :-)
On Wed, Jan 15, 2014 at 6:20 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> ---
> drivers/net/virtio_net.c | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b17240..046421c 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1295,6 +1295,14 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
> return 0;
> }
>
> +#ifdef CONFIG_RFS_ACCEL
> +static int virtnet_filter_rfs(struct net_device *net_dev,
> + const struct sk_buff *skb, u16 rxq_index, u32 flow_id)
> +{
Does this need to be filled out with more stuff?
> + return 0;
> +}
> +#endif /* CONFIG_RFS_ACCEL */
> +
> static const struct net_device_ops virtnet_netdev = {
> .ndo_open = virtnet_open,
> .ndo_stop = virtnet_close,
> @@ -1309,6 +1317,9 @@ static const struct net_device_ops virtnet_netdev = {
> #ifdef CONFIG_NET_POLL_CONTROLLER
> .ndo_poll_controller = virtnet_netpoll,
> #endif
> +#ifdef CONFIG_RFS_ACCEL
> + .ndo_rx_flow_steer = virtnet_filter_rfs,
> +#endif
> };
>
> static void virtnet_config_changed_work(struct work_struct *work)
> --
> 1.7.6.5
>
^ permalink raw reply
* Re: [PATCH v2 2/2] Documentation: Document the cephroot functionality
From: Randy Dunlap @ 2014-01-15 18:00 UTC (permalink / raw)
To: mark.doffman
Cc: ceph-devel, Rob Taylor, sage, netdev, linux-kernel, linux-nfs
In-Reply-To: <29b3bd9700e23df2aee095df8c34a15a62f57d27.1389806186.git.mark.doffman@codethink.co.uk>
On 01/15/2014 09:26 AM, mark.doffman@codethink.co.uk wrote:
> From: Rob Taylor <rob.taylor@codethink.co.uk>
>
> Document using the cephfs as a root device, its purpose,
> functionality and use.
>
> Signed-off-by: Mark Doffman <mark.doffman@codethink.co.uk>
> Signed-off-by: Rob Taylor <rob.taylor@codethink.co.uk>
> Reviewed-by: Ian Molton <ian.molton@codethink.co.uk>
> ---
> Documentation/filesystems/{ => ceph}/ceph.txt | 0
> Documentation/filesystems/ceph/cephroot.txt | 86 +++++++++++++++++++++++++++
> 2 files changed, 86 insertions(+)
> rename Documentation/filesystems/{ => ceph}/ceph.txt (100%)
> create mode 100644 Documentation/filesystems/ceph/cephroot.txt
>
> diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph/ceph.txt
> similarity index 100%
> rename from Documentation/filesystems/ceph.txt
> rename to Documentation/filesystems/ceph/ceph.txt
> diff --git a/Documentation/filesystems/ceph/cephroot.txt b/Documentation/filesystems/ceph/cephroot.txt
> new file mode 100644
> index 0000000..deda4f0
> --- /dev/null
> +++ b/Documentation/filesystems/ceph/cephroot.txt
> @@ -0,0 +1,86 @@
> +Mounting the root filesystem via Ceph (cephroot)
> +===============================================
> +
> +Written 2013 by Rob Taylor <rob.taylor@codethink.co.uk>
> +
> +derived from nfsroot.txt:
> +
> +Written 1996 by Gero Kuhlmann <gero@gkminix.han.de>
> +Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>
> +Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org>
> +Updated 2006 by Horms <horms@verge.net.au>
> +
> +
> +
> +In order to use a diskless system, such as an X-terminal or printer server
> +for example, it is necessary for the root filesystem to be present on a
> +non-disk device. This may be an initramfs (see Documentation/filesystems/
> +ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt), a
> +filesystem mounted via NFS or a filesystem mounted via Ceph. The following
> +text describes on how to use Ceph for the root filesystem.
> +
> +For the rest of this text 'client' means the diskless system, and 'server'
> +means the Ceph server.
> +
> +
> +1.) Enabling cephroot capabilities
> + -----------------------------
> +
> +In order to use cephroot, CEPH_FS needs to be selected as
> +built-in during configuration. Once this has been selected, the cephroot
> +option will become available, which should also be selected.
> +
> +In the networking options, kernel level autoconfiguration can be selected,
> +along with the types of autoconfiguration to support. Selecting all of
> +DHCP, BOOTP and RARP is safe.
> +
> +
> +2.) Kernel command line
> + -------------------
> +
> +When the kernel has been loaded by a boot loader (see below) it needs to be
> +told what root fs device to use. And in the case of cephroot, where to find
use, and
> +both the server and the name of the directory on the server to mount as root.
> +This can be established using the following kernel command line parameters:
> +
> +root=/dev/ceph
> +
> +This is necessary to enable the pseudo-Ceph-device. Note that it's not a
> +real device but just a synonym to tell the kernel to use Ceph instead of
> +a real device.
> +
> +If cephroot is not specified, it is expected that that a valid mount will be
drop duplicate: that
> +found via DHCP option 17, Root Path [1]
> +
> +cephroot=<monaddrs>:/[<subdir>],<ceph-opts>
> +
> + <monaddrs> Monitor addresses separated by commas. Each takes the form
> + host[:port]. If the port is not specified, the Ceph default
> + of 6789 is assumed.
> +
> + <subdir> A subdirectory subdir may be specified if a subset of the file
> + system is to be mounted
mounted.
> +
> + <ceph-opts> Standard Ceph options. All options are separated by commas.
> + See Documentation/filesystems/ceph/ceph.txt for options and
> + their defaults.
> +
> +4.) References
> + ----------
> +
> +[1] http://tools.ietf.org/html/rfc2132
> +
> +5.) Credits
> + -------
> +
> + cephroot was derived from nfsroot by Rob Taylor <rob.taylor@codethink.co.uk>
> + and Mark Doffman <mark.doffman@codethink.co.uk>
> +
> + The nfsroot code in the kernel and the RARP support have been written
> + by Gero Kuhlmann <gero@gkminix.han.de>.
> +
> + The rest of the IP layer autoconfiguration code has been written
> + by Martin Mares <mj@atrey.karlin.mff.cuni.cz>.
> +
> + In order to write the initial version of nfsroot I would like to thank
> + Jens-Uwe Mager <jum@anubis.han.de> for his help.
>
--
~Randy
^ permalink raw reply
* Re: [PATCH net-next] netfilter: remove double colon
From: Denis Kirjanov @ 2014-01-15 18:05 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Pablo Neira Ayuso, David S. Miller, netdev, netfilter-devel
In-Reply-To: <20140115081250.56958f9a@nehalam.linuxnetplumber.net>
You did miss the "---" after SOB ;)
On 1/15/14, Stephen Hemminger <stephen@networkplumber.org> wrote:
> This is C not shell script
>
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>
> --- a/net/ipv4/netfilter.c 2013-12-31 17:45:31.993942921 -0800
> +++ b/net/ipv4/netfilter.c 2014-01-15 08:10:49.793785943 -0800
> @@ -61,7 +61,7 @@ int ip_route_me_harder(struct sk_buff *s
> skb_dst_set(skb, NULL);
> dst = xfrm_lookup(net, dst, flowi4_to_flowi(&fl4), skb->sk, 0);
> if (IS_ERR(dst))
> - return PTR_ERR(dst);;
> + return PTR_ERR(dst);
> skb_dst_set(skb, dst);
> }
> #endif
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH] [RFC] netfilter: nf_conntrack: don't relase a conntrack with non-zero refcnt
From: Andrew Vagin @ 2014-01-15 18:08 UTC (permalink / raw)
To: Florian Westphal
Cc: Andrey Vagin, netfilter-devel, Eric Dumazet, netfilter, netdev,
linux-kernel, vvs, Cyrill Gorcunov, Vasiliy Averin
In-Reply-To: <20140114185329.GB28205@breakpoint.cc>
On Tue, Jan 14, 2014 at 07:53:29PM +0100, Florian Westphal wrote:
> Andrey Vagin <avagin@openvz.org> wrote:
> > ----
> > Eric and Florian, could you look at this patch. When you say,
> > that it looks good, I will ask the user to validate it.
> > I can't reorder these actions, because it's reproduced on a real host
> > with real users. Thanks.
> > ----
> >
> > nf_conntrack_free can't be called for a conntract with non-zero ref-counter,
> > because it can race with nf_conntrack_find_get().
>
> Indeed.
>
> > A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
> > ref-conunter says that this conntrack is used now. So when we release a
> > conntrack with non-zero counter, we break this assumption.
> >
> > CPU1 CPU2
> > ____nf_conntrack_find()
> > nf_ct_put()
> > destroy_conntrack()
> > ...
> > init_conntrack
> > __nf_conntrack_alloc (set use = 1)
> > atomic_inc_not_zero(&ct->use) (use = 2)
> > if (!l4proto->new(ct, skb, dataoff, timeouts))
> > nf_conntrack_free(ct); (use = 2 !!!)
> > ...
>
> Yes, I think this sequence is possible; we must not use nf_conntrack_free here.
>
> > - /* We overload first tuple to link into unconfirmed or dying list.*/
> > - BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
> > - hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> > + if (!hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode))
> > + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
>
> This is the only thing that I don't like about this patch. Currently
> all the conntracks in the system are always put on a list before they're
> supposed to be visible/handled via refcnt system (unconfirmed, hash, or
> dying list).
>
> I think it would be nice if we could keep it that way.
> If everything fails we could proably intoduce a 'larval' dummy list
> similar to the one used by template conntracks?
I'm not sure, that this is required. Could you elaborate when this can
be useful?
Now I see only overhead, because we need to take the nf_conntrack_lock
lock to add conntrack in a list.
Thanks,
Andrey
^ permalink raw reply
* Re: [Patch net-next] net_sched: act: remove headers in include/net/tc_act/
From: Cong Wang @ 2014-01-15 18:57 UTC (permalink / raw)
To: David Miller; +Cc: Cong Wang, netdev, Jamal Hadi Salim
In-Reply-To: <20140114.181213.695613248997119103.davem@davemloft.net>
On Tue, Jan 14, 2014 at 6:12 PM, David Miller <davem@davemloft.net> wrote:
> From: Cong Wang <xiyou.wangcong@gmail.com>
> Date: Tue, 14 Jan 2014 17:01:39 -0800
>
>> These headers are not necessary because those definitions in them
>> are action specific and are not shared for others. Just move them
>> into the C files.
>>
>> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
>> Cc: David S. Miller <davem@davemloft.net>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>
> Like Eric, I think this is a dubious change.
>
> There is nothing wrong with using these headers to define the
> core data structures used by each of these actions modules.
>
Nothing is wrong here, just that it is not necessary.
act_police defines similar stuffs in its C file, not a header.
I don't see any reason why others can't.
^ permalink raw reply
* [PATCH net-next] ipv6: send Change Status Report after DAD is completed
From: Flavio Leitner @ 2014-01-15 19:10 UTC (permalink / raw)
To: netdev; +Cc: Hideaki YOSHIFUJI, Hannes Frederic Sowa, Flavio Leitner
The RFC 3810 defines two type of messages for multicast
listeners. The "Current State Report" message, as the name
implies, refreshes the *current* state to the querier.
Since the querier sends Query messages periodically, there
is no need to retransmit the report.
On the other hand, any change should be reported immediately
using "State Change Report" messages. Since it's an event
triggered by a change and that it can be affected by packet
loss, the rfc states it should be retransmitted [RobVar] times
to make sure routers will receive timely.
Currently, we are sending "Current State Reports" after
DAD is completed. Before that, we send messages using
unspecified address (::) which should be silently discarded
by routers.
This patch changes to send "State Change Report" messages
after DAD is completed fixing the behavior to be RFC compliant
and also to pass TAHI IPv6 testsuite.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
---
net/ipv6/mcast.c | 64 ++++++++++++++++++++++++++++++++++----------------------
1 file changed, 39 insertions(+), 25 deletions(-)
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 99cd65c..8ac17f5 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1493,7 +1493,7 @@ static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
skb_tailroom(skb)) : 0)
static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc,
- int type, int gdeleted, int sdeleted)
+ int type, int gdeleted, int sdeleted, int crsend)
{
struct inet6_dev *idev = pmc->idev;
struct net_device *dev = idev->dev;
@@ -1585,7 +1585,7 @@ empty_source:
if (type == MLD2_ALLOW_NEW_SOURCES ||
type == MLD2_BLOCK_OLD_SOURCES)
return skb;
- if (pmc->mca_crcount || isquery) {
+ if (pmc->mca_crcount || isquery || crsend) {
/* make sure we have room for group header */
if (skb && AVAILABLE(skb) < sizeof(struct mld2_grec)) {
mld_sendpack(skb);
@@ -1602,6 +1602,28 @@ empty_source:
return skb;
}
+static void mld_send_initial_cr(struct inet6_dev *idev)
+{
+ struct sk_buff *skb;
+ struct ifmcaddr6 *pmc;
+ int type;
+
+ skb = NULL;
+ read_lock_bh(&idev->lock);
+ for (pmc=idev->mc_list; pmc; pmc=pmc->next) {
+ spin_lock_bh(&pmc->mca_lock);
+ if (pmc->mca_sfcount[MCAST_EXCLUDE])
+ type = MLD2_CHANGE_TO_EXCLUDE;
+ else
+ type = MLD2_CHANGE_TO_INCLUDE;
+ skb = add_grec(skb, pmc, type, 0, 0, 1);
+ spin_unlock_bh(&pmc->mca_lock);
+ }
+ read_unlock_bh(&idev->lock);
+ if (skb)
+ mld_sendpack(skb);
+}
+
static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc)
{
struct sk_buff *skb = NULL;
@@ -1617,7 +1639,7 @@ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc)
type = MLD2_MODE_IS_EXCLUDE;
else
type = MLD2_MODE_IS_INCLUDE;
- skb = add_grec(skb, pmc, type, 0, 0);
+ skb = add_grec(skb, pmc, type, 0, 0, 0);
spin_unlock_bh(&pmc->mca_lock);
}
} else {
@@ -1626,7 +1648,7 @@ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc)
type = MLD2_MODE_IS_EXCLUDE;
else
type = MLD2_MODE_IS_INCLUDE;
- skb = add_grec(skb, pmc, type, 0, 0);
+ skb = add_grec(skb, pmc, type, 0, 0, 0);
spin_unlock_bh(&pmc->mca_lock);
}
read_unlock_bh(&idev->lock);
@@ -1671,13 +1693,13 @@ static void mld_send_cr(struct inet6_dev *idev)
if (pmc->mca_sfmode == MCAST_INCLUDE) {
type = MLD2_BLOCK_OLD_SOURCES;
dtype = MLD2_BLOCK_OLD_SOURCES;
- skb = add_grec(skb, pmc, type, 1, 0);
- skb = add_grec(skb, pmc, dtype, 1, 1);
+ skb = add_grec(skb, pmc, type, 1, 0, 0);
+ skb = add_grec(skb, pmc, dtype, 1, 1, 0);
}
if (pmc->mca_crcount) {
if (pmc->mca_sfmode == MCAST_EXCLUDE) {
type = MLD2_CHANGE_TO_INCLUDE;
- skb = add_grec(skb, pmc, type, 1, 0);
+ skb = add_grec(skb, pmc, type, 1, 0, 0);
}
pmc->mca_crcount--;
if (pmc->mca_crcount == 0) {
@@ -1708,8 +1730,8 @@ static void mld_send_cr(struct inet6_dev *idev)
type = MLD2_ALLOW_NEW_SOURCES;
dtype = MLD2_BLOCK_OLD_SOURCES;
}
- skb = add_grec(skb, pmc, type, 0, 0);
- skb = add_grec(skb, pmc, dtype, 0, 1); /* deleted sources */
+ skb = add_grec(skb, pmc, type, 0, 0, 0);
+ skb = add_grec(skb, pmc, dtype, 0, 1, 0); /* deleted sources */
/* filter mode changes */
if (pmc->mca_crcount) {
@@ -1717,7 +1739,7 @@ static void mld_send_cr(struct inet6_dev *idev)
type = MLD2_CHANGE_TO_EXCLUDE;
else
type = MLD2_CHANGE_TO_INCLUDE;
- skb = add_grec(skb, pmc, type, 0, 0);
+ skb = add_grec(skb, pmc, type, 0, 0, 0);
pmc->mca_crcount--;
}
spin_unlock_bh(&pmc->mca_lock);
@@ -1825,27 +1847,19 @@ err_out:
goto out;
}
-static void mld_resend_report(struct inet6_dev *idev)
+static void mld_resend_cr(struct inet6_dev *idev)
{
- if (MLD_V1_SEEN(idev)) {
- struct ifmcaddr6 *mcaddr;
- read_lock_bh(&idev->lock);
- for (mcaddr = idev->mc_list; mcaddr; mcaddr = mcaddr->next) {
- if (!(mcaddr->mca_flags & MAF_NOREPORT))
- igmp6_send(&mcaddr->mca_addr, idev->dev,
- ICMPV6_MGM_REPORT);
- }
- read_unlock_bh(&idev->lock);
- } else {
- mld_send_report(idev, NULL);
- }
+ if (MLD_V1_SEEN(idev))
+ return;
+
+ mld_send_initial_cr(idev);
}
void ipv6_mc_dad_complete(struct inet6_dev *idev)
{
idev->mc_dad_count = idev->mc_qrv;
if (idev->mc_dad_count) {
- mld_resend_report(idev);
+ mld_resend_cr(idev);
idev->mc_dad_count--;
if (idev->mc_dad_count)
mld_dad_start_timer(idev, idev->mc_maxdelay);
@@ -1856,7 +1870,7 @@ static void mld_dad_timer_expire(unsigned long data)
{
struct inet6_dev *idev = (struct inet6_dev *)data;
- mld_resend_report(idev);
+ mld_resend_cr(idev);
if (idev->mc_dad_count) {
idev->mc_dad_count--;
if (idev->mc_dad_count)
--
1.8.4.2
^ permalink raw reply related
* Re: [patch net-next] neigh: use NEIGH_VAR_INIT in ndo_neigh_setup functions.
From: David Miller @ 2014-01-15 20:09 UTC (permalink / raw)
To: jiri; +Cc: netdev, jes
In-Reply-To: <1389273227-17532-1-git-send-email-jiri@resnulli.us>
From: Jiri Pirko <jiri@resnulli.us>
Date: Thu, 9 Jan 2014 14:13:47 +0100
> When ndo_neigh_setup is called, the bitfield used by NEIGH_VAR_SET is
> not initialized yet. This might cause confusion for the people who use
> NEIGH_VAR_SET in ndo_neigh_setup. So rather introduce NEIGH_VAR_INIT for
> usage in ndo_neigh_setup.
>
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Jiri, please respond to my feedback, this patch has been rotting in
patchwork for 6 days.
^ permalink raw reply
* Re: [PATCH v4 0/3] Send audit/procinfo/cgroup data in socket-level control message
From: David Miller @ 2014-01-15 20:17 UTC (permalink / raw)
To: jkaluza-H+wXaHxf7aLQT0dZR+AlfA
Cc: rgb-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
eparis-H+wXaHxf7aLQT0dZR+AlfA,
viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, tj-DgEjT+Ai2ygdnm+yROfE0A,
cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1389600109-30739-1-git-send-email-jkaluza-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
From: Jan Kaluza <jkaluza-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Date: Mon, 13 Jan 2014 09:01:46 +0100
> Changes introduced in this patchset can also increase performance
> of such server-like processes, because current way of opening and
> parsing /proc/$PID/* files is much more expensive than receiving these
> metadata using SCM.
The problem with this line of reasoning is that these changes will
hurt everyone else, because these new control messages are sent
unconditionally, whether the application is interested in them or not.
I really don't like this cost tradeoff, it's terrible, and therefore
I'm really not inclined to apply these patches, sorry.
^ permalink raw reply
* Re: [RFC net] tcp: metrics: Avoid duplicate entries with the same destination-IP
From: David Miller @ 2014-01-15 20:18 UTC (permalink / raw)
To: eric.dumazet; +Cc: christoph.paasch, netdev
In-Reply-To: <1389631669.31367.221.camel@edumazet-glaptop2.roam.corp.google.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 13 Jan 2014 08:47:49 -0800
> On Mon, 2014-01-13 at 16:25 +0100, Christoph Paasch wrote:
>
>> Another solution might be to leave tcp_get_metrics() as it is, and in
>> tcpm_new do another call to __tcp_get_metrics() while holding the
>> spin-lock. We would then check __tcp_get_metrics twice for new entries
>> but we won't hold the spin-lock needlessly anymore.
>
> This is the only solution if you want to fix this.
> Cost of lookup are the cache line misses.
> Avoiding the spinlock is a must.
>
> The second 'lookup' is basically free, as the first one have populated
> cpu caches.
Indeed, taking the lock in tcp_get_metrics() is to be avoided at all
costs.
^ permalink raw reply
* [PATCH-next v2] netfilter: don't use module_init/exit in core IPV4 code
From: Paul Gortmaker @ 2014-01-15 20:57 UTC (permalink / raw)
To: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik
Cc: David S. Miller, netfilter-devel, netdev, Paul Gortmaker
In-Reply-To: <1389638147-30399-1-git-send-email-paul.gortmaker@windriver.com>
The file net/ipv4/netfilter.o is created based on whether
CONFIG_NETFILTER is set. However that is defined as a bool, and
hence this file with the core netfilter hooks will never be
modular. So using module_init as an alias for __initcall can be
somewhat misleading.
Fix this up now, so that we can relocate module_init from
init.h into module.h in the future. If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing. Also add an inclusion of init.h, as
that was previously implicit here in the netfilter.c file.
Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups. As __initcall gets
mapped onto device_initcall, our use of subsys_initcall (which
seems to make sense for netfilter code) will thus change this
registration from level 6-device to level 4-subsys (i.e. slightly
earlier). However no observable impact of that small difference
has been observed during testing, or is expected. (i.e. the
location of the netfilter messages in dmesg remains unchanged
with respect to all the other surrounding messages.)
As for the module_exit, rather than replace it with __exitcall,
we simply remove it, since it appears only UML does anything
with those, and even for UML, there is no relevant cleanup
to be done here.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
[v2: Drop __exitcall stuff completely, as per Eric's suggestion
given for patch at http://patchwork.ozlabs.org/patch/311164/ ]
net/ipv4/netfilter.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)
diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index c3e0adea9c27..31abf9636ba7 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -197,11 +197,4 @@ static int __init ipv4_netfilter_init(void)
{
return nf_register_afinfo(&nf_ip_afinfo);
}
-
-static void __exit ipv4_netfilter_fini(void)
-{
- nf_unregister_afinfo(&nf_ip_afinfo);
-}
-
-module_init(ipv4_netfilter_init);
-module_exit(ipv4_netfilter_fini);
+device_initcall(ipv4_netfilter_init);
--
1.8.5.2
^ permalink raw reply related
* [PATCH net-next 0/2] r6040: misc fixes
From: Florian Fainelli @ 2014-01-15 21:04 UTC (permalink / raw)
To: netdev; +Cc: davem, Florian Fainelli
Hi David,
Here are two small fixes, patch 1 could potentially be backported to stable
trees since it affects MDIO operations.
Thanks!
Florian Fainelli (2):
r6040: add delays in MDIO read/write polling loops
r6040: use ETH_ZLEN instead of MISR for SKB length checking
drivers/net/ethernet/rdc/r6040.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--
1.8.3.2
^ permalink raw reply
* [PATCH net-next 2/2] r6040: use ETH_ZLEN instead of MISR for SKB length checking
From: Florian Fainelli @ 2014-01-15 21:04 UTC (permalink / raw)
To: netdev; +Cc: davem, Florian Fainelli
In-Reply-To: <1389819866-32142-1-git-send-email-florian@openwrt.org>
Ever since this driver was merged the following code was included:
if (skb->len < MISR)
skb->len = MISR;
MISR is defined to 0x3C which is also equivalent to ETH_ZLEN, but use
ETH_ZLEN directly which is exactly what we want to be checking for.
Reported-by: Marc Volovic <marcv@ezchip.com>
Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
drivers/net/ethernet/rdc/r6040.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c
index ff4683a..eb15ebf 100644
--- a/drivers/net/ethernet/rdc/r6040.c
+++ b/drivers/net/ethernet/rdc/r6040.c
@@ -836,8 +836,8 @@ static netdev_tx_t r6040_start_xmit(struct sk_buff *skb,
/* Set TX descriptor & Transmit it */
lp->tx_free_desc--;
descptr = lp->tx_insert_ptr;
- if (skb->len < MISR)
- descptr->len = MISR;
+ if (skb->len < ETH_ZLEN)
+ descptr->len = ETH_ZLEN;
else
descptr->len = skb->len;
--
1.8.3.2
^ permalink raw reply related
* [PATCH net-next 1/2] r6040: add delays in MDIO read/write polling loops
From: Florian Fainelli @ 2014-01-15 21:04 UTC (permalink / raw)
To: netdev; +Cc: davem, Florian Fainelli
In-Reply-To: <1389819866-32142-1-git-send-email-florian@openwrt.org>
On newer and faster machines (Vortex X86DX) using the r6040 driver, it
was noticed that the driver was returning an error during probing traced
down to being the MDIO bus probing and the inability to complete a MDIO
read operation in time. It turns out that the MDIO operations on these
faster machines usually complete after ~2140 iterations which is bigger
than 2048 (MAC_DEF_TIMEOUT) and results in spurious timeouts depending
on the system load.
Update r6040_phy_read() and r6040_phy_write() to include a 1
micro second delay in each busy-looping iteration of the loop which is a
much safer operation than incrementing MAC_DEF_TIMEOUT.
Reported-by: Nils Koehler <nils.koehler@ibt-interfaces.de>
Reported-by: Daniel Goertzen <daniel.goertzen@gmail.com>
Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
drivers/net/ethernet/rdc/r6040.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c
index 1e49ec5..ff4683a 100644
--- a/drivers/net/ethernet/rdc/r6040.c
+++ b/drivers/net/ethernet/rdc/r6040.c
@@ -222,6 +222,7 @@ static int r6040_phy_read(void __iomem *ioaddr, int phy_addr, int reg)
cmd = ioread16(ioaddr + MMDIO);
if (!(cmd & MDIO_READ))
break;
+ udelay(1);
}
if (limit < 0)
@@ -245,6 +246,7 @@ static int r6040_phy_write(void __iomem *ioaddr,
cmd = ioread16(ioaddr + MMDIO);
if (!(cmd & MDIO_WRITE))
break;
+ udelay(1);
}
return (limit < 0) ? -ETIMEDOUT : 0;
--
1.8.3.2
^ permalink raw reply related
* Re: [PATCH v3 1/4] net_dma: simple removal
From: saeed bishara @ 2014-01-15 21:20 UTC (permalink / raw)
To: Dan Williams
Cc: dmaengine, Alexander Duyck, Dave Jiang, Vinod Koul,
netdev@vger.kernel.org, David Whipple, lkml, David S. Miller
In-Reply-To: <20140114004622.27138.54103.stgit@viggo.jf.intel.com>
Hi Dan,
I'm using net_dma on my system and I achieve meaningful performance
boost when running Iperf receive.
As far as I know the net_dma is used by many embedded systems out
there and might effect their performance.
Can you please elaborate on the exact scenario that cause the memory corruption?
Is the scenario mentioned here caused by "real life" application or
this is more of theoretical issue found through manual testing, I was
trying to find the thread describing the failing scenario and couldn't
find it, any pointer will be appreciated.
Thanks
On Tue, Jan 14, 2014 at 2:46 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
> and there is no plan to fix it.
>
> This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
> Reverting the remainder of the net_dma induced changes is deferred to
> subsequent patches.
>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Vinod Koul <vinod.koul@intel.com>
> Cc: David Whipple <whipple@securedatainnovations.ch>
> Cc: Alexander Duyck <alexander.h.duyck@intel.com>
> Acked-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>
> No changes since v2
>
>
> Documentation/ABI/removed/net_dma | 8 +
> Documentation/networking/ip-sysctl.txt | 6 -
> drivers/dma/Kconfig | 12 -
> drivers/dma/Makefile | 1
> drivers/dma/dmaengine.c | 104 ------------
> drivers/dma/ioat/dma.c | 1
> drivers/dma/ioat/dma.h | 7 -
> drivers/dma/ioat/dma_v2.c | 1
> drivers/dma/ioat/dma_v3.c | 1
> drivers/dma/iovlock.c | 280 --------------------------------
> include/linux/dmaengine.h | 22 ---
> include/linux/skbuff.h | 8 -
> include/linux/tcp.h | 8 -
> include/net/netdma.h | 32 ----
> include/net/sock.h | 19 --
> include/net/tcp.h | 8 -
> kernel/sysctl_binary.c | 1
> net/core/Makefile | 1
> net/core/dev.c | 10 -
> net/core/sock.c | 6 -
> net/core/user_dma.c | 131 ---------------
> net/dccp/proto.c | 4
> net/ipv4/sysctl_net_ipv4.c | 9 -
> net/ipv4/tcp.c | 147 ++---------------
> net/ipv4/tcp_input.c | 61 -------
> net/ipv4/tcp_ipv4.c | 18 --
> net/ipv6/tcp_ipv6.c | 13 -
> net/llc/af_llc.c | 10 +
> 28 files changed, 35 insertions(+), 894 deletions(-)
> create mode 100644 Documentation/ABI/removed/net_dma
> delete mode 100644 drivers/dma/iovlock.c
> delete mode 100644 include/net/netdma.h
> delete mode 100644 net/core/user_dma.c
>
> diff --git a/Documentation/ABI/removed/net_dma b/Documentation/ABI/removed/net_dma
> new file mode 100644
> index 000000000000..a173aecc2f18
> --- /dev/null
> +++ b/Documentation/ABI/removed/net_dma
> @@ -0,0 +1,8 @@
> +What: tcp_dma_copybreak sysctl
> +Date: Removed in kernel v3.13
> +Contact: Dan Williams <dan.j.williams@intel.com>
> +Description:
> + Formerly the lower limit, in bytes, of the size of socket reads
> + that will be offloaded to a DMA copy engine. Removed due to
> + coherency issues of the cpu potentially touching the buffers
> + while dma is in flight.
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 3c12d9a7ed00..bdd8a67f0be2 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -538,12 +538,6 @@ tcp_workaround_signed_windows - BOOLEAN
> not receive a window scaling option from them.
> Default: 0
>
> -tcp_dma_copybreak - INTEGER
> - Lower limit, in bytes, of the size of socket reads that will be
> - offloaded to a DMA copy engine, if one is present in the system
> - and CONFIG_NET_DMA is enabled.
> - Default: 4096
> -
> tcp_thin_linear_timeouts - BOOLEAN
> Enable dynamic triggering of linear timeouts for thin streams.
> If set, a check is performed upon retransmission by timeout to
> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
> index c823daaf9043..b24f13195272 100644
> --- a/drivers/dma/Kconfig
> +++ b/drivers/dma/Kconfig
> @@ -351,18 +351,6 @@ config DMA_OF
> comment "DMA Clients"
> depends on DMA_ENGINE
>
> -config NET_DMA
> - bool "Network: TCP receive copy offload"
> - depends on DMA_ENGINE && NET
> - default (INTEL_IOATDMA || FSL_DMA)
> - depends on BROKEN
> - help
> - This enables the use of DMA engines in the network stack to
> - offload receive copy-to-user operations, freeing CPU cycles.
> -
> - Say Y here if you enabled INTEL_IOATDMA or FSL_DMA, otherwise
> - say N.
> -
> config ASYNC_TX_DMA
> bool "Async_tx: Offload support for the async_tx api"
> depends on DMA_ENGINE
> diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
> index 0ce2da97e429..024b008a25de 100644
> --- a/drivers/dma/Makefile
> +++ b/drivers/dma/Makefile
> @@ -6,7 +6,6 @@ obj-$(CONFIG_DMA_VIRTUAL_CHANNELS) += virt-dma.o
> obj-$(CONFIG_DMA_ACPI) += acpi-dma.o
> obj-$(CONFIG_DMA_OF) += of-dma.o
>
> -obj-$(CONFIG_NET_DMA) += iovlock.o
> obj-$(CONFIG_INTEL_MID_DMAC) += intel_mid_dma.o
> obj-$(CONFIG_DMATEST) += dmatest.o
> obj-$(CONFIG_INTEL_IOATDMA) += ioat/
> diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
> index ef63b9058f3c..d7f4f4e0d71f 100644
> --- a/drivers/dma/dmaengine.c
> +++ b/drivers/dma/dmaengine.c
> @@ -1029,110 +1029,6 @@ dmaengine_get_unmap_data(struct device *dev, int nr, gfp_t flags)
> }
> EXPORT_SYMBOL(dmaengine_get_unmap_data);
>
> -/**
> - * dma_async_memcpy_pg_to_pg - offloaded copy from page to page
> - * @chan: DMA channel to offload copy to
> - * @dest_pg: destination page
> - * @dest_off: offset in page to copy to
> - * @src_pg: source page
> - * @src_off: offset in page to copy from
> - * @len: length
> - *
> - * Both @dest_page/@dest_off and @src_page/@src_off must be mappable to a bus
> - * address according to the DMA mapping API rules for streaming mappings.
> - * Both @dest_page/@dest_off and @src_page/@src_off must stay memory resident
> - * (kernel memory or locked user space pages).
> - */
> -dma_cookie_t
> -dma_async_memcpy_pg_to_pg(struct dma_chan *chan, struct page *dest_pg,
> - unsigned int dest_off, struct page *src_pg, unsigned int src_off,
> - size_t len)
> -{
> - struct dma_device *dev = chan->device;
> - struct dma_async_tx_descriptor *tx;
> - struct dmaengine_unmap_data *unmap;
> - dma_cookie_t cookie;
> - unsigned long flags;
> -
> - unmap = dmaengine_get_unmap_data(dev->dev, 2, GFP_NOWAIT);
> - if (!unmap)
> - return -ENOMEM;
> -
> - unmap->to_cnt = 1;
> - unmap->from_cnt = 1;
> - unmap->addr[0] = dma_map_page(dev->dev, src_pg, src_off, len,
> - DMA_TO_DEVICE);
> - unmap->addr[1] = dma_map_page(dev->dev, dest_pg, dest_off, len,
> - DMA_FROM_DEVICE);
> - unmap->len = len;
> - flags = DMA_CTRL_ACK;
> - tx = dev->device_prep_dma_memcpy(chan, unmap->addr[1], unmap->addr[0],
> - len, flags);
> -
> - if (!tx) {
> - dmaengine_unmap_put(unmap);
> - return -ENOMEM;
> - }
> -
> - dma_set_unmap(tx, unmap);
> - cookie = tx->tx_submit(tx);
> - dmaengine_unmap_put(unmap);
> -
> - preempt_disable();
> - __this_cpu_add(chan->local->bytes_transferred, len);
> - __this_cpu_inc(chan->local->memcpy_count);
> - preempt_enable();
> -
> - return cookie;
> -}
> -EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
> -
> -/**
> - * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses
> - * @chan: DMA channel to offload copy to
> - * @dest: destination address (virtual)
> - * @src: source address (virtual)
> - * @len: length
> - *
> - * Both @dest and @src must be mappable to a bus address according to the
> - * DMA mapping API rules for streaming mappings.
> - * Both @dest and @src must stay memory resident (kernel memory or locked
> - * user space pages).
> - */
> -dma_cookie_t
> -dma_async_memcpy_buf_to_buf(struct dma_chan *chan, void *dest,
> - void *src, size_t len)
> -{
> - return dma_async_memcpy_pg_to_pg(chan, virt_to_page(dest),
> - (unsigned long) dest & ~PAGE_MASK,
> - virt_to_page(src),
> - (unsigned long) src & ~PAGE_MASK, len);
> -}
> -EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
> -
> -/**
> - * dma_async_memcpy_buf_to_pg - offloaded copy from address to page
> - * @chan: DMA channel to offload copy to
> - * @page: destination page
> - * @offset: offset in page to copy to
> - * @kdata: source address (virtual)
> - * @len: length
> - *
> - * Both @page/@offset and @kdata must be mappable to a bus address according
> - * to the DMA mapping API rules for streaming mappings.
> - * Both @page/@offset and @kdata must stay memory resident (kernel memory or
> - * locked user space pages)
> - */
> -dma_cookie_t
> -dma_async_memcpy_buf_to_pg(struct dma_chan *chan, struct page *page,
> - unsigned int offset, void *kdata, size_t len)
> -{
> - return dma_async_memcpy_pg_to_pg(chan, page, offset,
> - virt_to_page(kdata),
> - (unsigned long) kdata & ~PAGE_MASK, len);
> -}
> -EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
> -
> void dma_async_tx_descriptor_init(struct dma_async_tx_descriptor *tx,
> struct dma_chan *chan)
> {
> diff --git a/drivers/dma/ioat/dma.c b/drivers/dma/ioat/dma.c
> index 1a49c777607c..97fa394ca855 100644
> --- a/drivers/dma/ioat/dma.c
> +++ b/drivers/dma/ioat/dma.c
> @@ -1175,7 +1175,6 @@ int ioat1_dma_probe(struct ioatdma_device *device, int dca)
> err = ioat_probe(device);
> if (err)
> return err;
> - ioat_set_tcp_copy_break(4096);
> err = ioat_register(device);
> if (err)
> return err;
> diff --git a/drivers/dma/ioat/dma.h b/drivers/dma/ioat/dma.h
> index 11fb877ddca9..664ec9cbd651 100644
> --- a/drivers/dma/ioat/dma.h
> +++ b/drivers/dma/ioat/dma.h
> @@ -214,13 +214,6 @@ __dump_desc_dbg(struct ioat_chan_common *chan, struct ioat_dma_descriptor *hw,
> #define dump_desc_dbg(c, d) \
> ({ if (d) __dump_desc_dbg(&c->base, d->hw, &d->txd, desc_id(d)); 0; })
>
> -static inline void ioat_set_tcp_copy_break(unsigned long copybreak)
> -{
> - #ifdef CONFIG_NET_DMA
> - sysctl_tcp_dma_copybreak = copybreak;
> - #endif
> -}
> -
> static inline struct ioat_chan_common *
> ioat_chan_by_index(struct ioatdma_device *device, int index)
> {
> diff --git a/drivers/dma/ioat/dma_v2.c b/drivers/dma/ioat/dma_v2.c
> index 5d3affe7e976..31e8098e444f 100644
> --- a/drivers/dma/ioat/dma_v2.c
> +++ b/drivers/dma/ioat/dma_v2.c
> @@ -900,7 +900,6 @@ int ioat2_dma_probe(struct ioatdma_device *device, int dca)
> err = ioat_probe(device);
> if (err)
> return err;
> - ioat_set_tcp_copy_break(2048);
>
> list_for_each_entry(c, &dma->channels, device_node) {
> chan = to_chan_common(c);
> diff --git a/drivers/dma/ioat/dma_v3.c b/drivers/dma/ioat/dma_v3.c
> index 820817e97e62..4bb81346bee2 100644
> --- a/drivers/dma/ioat/dma_v3.c
> +++ b/drivers/dma/ioat/dma_v3.c
> @@ -1652,7 +1652,6 @@ int ioat3_dma_probe(struct ioatdma_device *device, int dca)
> err = ioat_probe(device);
> if (err)
> return err;
> - ioat_set_tcp_copy_break(262144);
>
> list_for_each_entry(c, &dma->channels, device_node) {
> chan = to_chan_common(c);
> diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
> deleted file mode 100644
> index bb48a57c2fc1..000000000000
> --- a/drivers/dma/iovlock.c
> +++ /dev/null
> @@ -1,280 +0,0 @@
> -/*
> - * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
> - * Portions based on net/core/datagram.c and copyrighted by their authors.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms of the GNU General Public License as published by the Free
> - * Software Foundation; either version 2 of the License, or (at your option)
> - * any later version.
> - *
> - * This program is distributed in the hope that it will be useful, but WITHOUT
> - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> - * more details.
> - *
> - * You should have received a copy of the GNU General Public License along with
> - * this program; if not, write to the Free Software Foundation, Inc., 59
> - * Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> - *
> - * The full GNU General Public License is included in this distribution in the
> - * file called COPYING.
> - */
> -
> -/*
> - * This code allows the net stack to make use of a DMA engine for
> - * skb to iovec copies.
> - */
> -
> -#include <linux/dmaengine.h>
> -#include <linux/pagemap.h>
> -#include <linux/slab.h>
> -#include <net/tcp.h> /* for memcpy_toiovec */
> -#include <asm/io.h>
> -#include <asm/uaccess.h>
> -
> -static int num_pages_spanned(struct iovec *iov)
> -{
> - return
> - ((PAGE_ALIGN((unsigned long)iov->iov_base + iov->iov_len) -
> - ((unsigned long)iov->iov_base & PAGE_MASK)) >> PAGE_SHIFT);
> -}
> -
> -/*
> - * Pin down all the iovec pages needed for len bytes.
> - * Return a struct dma_pinned_list to keep track of pages pinned down.
> - *
> - * We are allocating a single chunk of memory, and then carving it up into
> - * 3 sections, the latter 2 whose size depends on the number of iovecs and the
> - * total number of pages, respectively.
> - */
> -struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
> -{
> - struct dma_pinned_list *local_list;
> - struct page **pages;
> - int i;
> - int ret;
> - int nr_iovecs = 0;
> - int iovec_len_used = 0;
> - int iovec_pages_used = 0;
> -
> - /* don't pin down non-user-based iovecs */
> - if (segment_eq(get_fs(), KERNEL_DS))
> - return NULL;
> -
> - /* determine how many iovecs/pages there are, up front */
> - do {
> - iovec_len_used += iov[nr_iovecs].iov_len;
> - iovec_pages_used += num_pages_spanned(&iov[nr_iovecs]);
> - nr_iovecs++;
> - } while (iovec_len_used < len);
> -
> - /* single kmalloc for pinned list, page_list[], and the page arrays */
> - local_list = kmalloc(sizeof(*local_list)
> - + (nr_iovecs * sizeof (struct dma_page_list))
> - + (iovec_pages_used * sizeof (struct page*)), GFP_KERNEL);
> - if (!local_list)
> - goto out;
> -
> - /* list of pages starts right after the page list array */
> - pages = (struct page **) &local_list->page_list[nr_iovecs];
> -
> - local_list->nr_iovecs = 0;
> -
> - for (i = 0; i < nr_iovecs; i++) {
> - struct dma_page_list *page_list = &local_list->page_list[i];
> -
> - len -= iov[i].iov_len;
> -
> - if (!access_ok(VERIFY_WRITE, iov[i].iov_base, iov[i].iov_len))
> - goto unpin;
> -
> - page_list->nr_pages = num_pages_spanned(&iov[i]);
> - page_list->base_address = iov[i].iov_base;
> -
> - page_list->pages = pages;
> - pages += page_list->nr_pages;
> -
> - /* pin pages down */
> - down_read(¤t->mm->mmap_sem);
> - ret = get_user_pages(
> - current,
> - current->mm,
> - (unsigned long) iov[i].iov_base,
> - page_list->nr_pages,
> - 1, /* write */
> - 0, /* force */
> - page_list->pages,
> - NULL);
> - up_read(¤t->mm->mmap_sem);
> -
> - if (ret != page_list->nr_pages)
> - goto unpin;
> -
> - local_list->nr_iovecs = i + 1;
> - }
> -
> - return local_list;
> -
> -unpin:
> - dma_unpin_iovec_pages(local_list);
> -out:
> - return NULL;
> -}
> -
> -void dma_unpin_iovec_pages(struct dma_pinned_list *pinned_list)
> -{
> - int i, j;
> -
> - if (!pinned_list)
> - return;
> -
> - for (i = 0; i < pinned_list->nr_iovecs; i++) {
> - struct dma_page_list *page_list = &pinned_list->page_list[i];
> - for (j = 0; j < page_list->nr_pages; j++) {
> - set_page_dirty_lock(page_list->pages[j]);
> - page_cache_release(page_list->pages[j]);
> - }
> - }
> -
> - kfree(pinned_list);
> -}
> -
> -
> -/*
> - * We have already pinned down the pages we will be using in the iovecs.
> - * Each entry in iov array has corresponding entry in pinned_list->page_list.
> - * Using array indexing to keep iov[] and page_list[] in sync.
> - * Initial elements in iov array's iov->iov_len will be 0 if already copied into
> - * by another call.
> - * iov array length remaining guaranteed to be bigger than len.
> - */
> -dma_cookie_t dma_memcpy_to_iovec(struct dma_chan *chan, struct iovec *iov,
> - struct dma_pinned_list *pinned_list, unsigned char *kdata, size_t len)
> -{
> - int iov_byte_offset;
> - int copy;
> - dma_cookie_t dma_cookie = 0;
> - int iovec_idx;
> - int page_idx;
> -
> - if (!chan)
> - return memcpy_toiovec(iov, kdata, len);
> -
> - iovec_idx = 0;
> - while (iovec_idx < pinned_list->nr_iovecs) {
> - struct dma_page_list *page_list;
> -
> - /* skip already used-up iovecs */
> - while (!iov[iovec_idx].iov_len)
> - iovec_idx++;
> -
> - page_list = &pinned_list->page_list[iovec_idx];
> -
> - iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
> - page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
> - - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
> -
> - /* break up copies to not cross page boundary */
> - while (iov[iovec_idx].iov_len) {
> - copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
> - copy = min_t(int, copy, iov[iovec_idx].iov_len);
> -
> - dma_cookie = dma_async_memcpy_buf_to_pg(chan,
> - page_list->pages[page_idx],
> - iov_byte_offset,
> - kdata,
> - copy);
> - /* poll for a descriptor slot */
> - if (unlikely(dma_cookie < 0)) {
> - dma_async_issue_pending(chan);
> - continue;
> - }
> -
> - len -= copy;
> - iov[iovec_idx].iov_len -= copy;
> - iov[iovec_idx].iov_base += copy;
> -
> - if (!len)
> - return dma_cookie;
> -
> - kdata += copy;
> - iov_byte_offset = 0;
> - page_idx++;
> - }
> - iovec_idx++;
> - }
> -
> - /* really bad if we ever run out of iovecs */
> - BUG();
> - return -EFAULT;
> -}
> -
> -dma_cookie_t dma_memcpy_pg_to_iovec(struct dma_chan *chan, struct iovec *iov,
> - struct dma_pinned_list *pinned_list, struct page *page,
> - unsigned int offset, size_t len)
> -{
> - int iov_byte_offset;
> - int copy;
> - dma_cookie_t dma_cookie = 0;
> - int iovec_idx;
> - int page_idx;
> - int err;
> -
> - /* this needs as-yet-unimplemented buf-to-buff, so punt. */
> - /* TODO: use dma for this */
> - if (!chan || !pinned_list) {
> - u8 *vaddr = kmap(page);
> - err = memcpy_toiovec(iov, vaddr + offset, len);
> - kunmap(page);
> - return err;
> - }
> -
> - iovec_idx = 0;
> - while (iovec_idx < pinned_list->nr_iovecs) {
> - struct dma_page_list *page_list;
> -
> - /* skip already used-up iovecs */
> - while (!iov[iovec_idx].iov_len)
> - iovec_idx++;
> -
> - page_list = &pinned_list->page_list[iovec_idx];
> -
> - iov_byte_offset = ((unsigned long)iov[iovec_idx].iov_base & ~PAGE_MASK);
> - page_idx = (((unsigned long)iov[iovec_idx].iov_base & PAGE_MASK)
> - - ((unsigned long)page_list->base_address & PAGE_MASK)) >> PAGE_SHIFT;
> -
> - /* break up copies to not cross page boundary */
> - while (iov[iovec_idx].iov_len) {
> - copy = min_t(int, PAGE_SIZE - iov_byte_offset, len);
> - copy = min_t(int, copy, iov[iovec_idx].iov_len);
> -
> - dma_cookie = dma_async_memcpy_pg_to_pg(chan,
> - page_list->pages[page_idx],
> - iov_byte_offset,
> - page,
> - offset,
> - copy);
> - /* poll for a descriptor slot */
> - if (unlikely(dma_cookie < 0)) {
> - dma_async_issue_pending(chan);
> - continue;
> - }
> -
> - len -= copy;
> - iov[iovec_idx].iov_len -= copy;
> - iov[iovec_idx].iov_base += copy;
> -
> - if (!len)
> - return dma_cookie;
> -
> - offset += copy;
> - iov_byte_offset = 0;
> - page_idx++;
> - }
> - iovec_idx++;
> - }
> -
> - /* really bad if we ever run out of iovecs */
> - BUG();
> - return -EFAULT;
> -}
> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> index 41cf0c399288..890545871af0 100644
> --- a/include/linux/dmaengine.h
> +++ b/include/linux/dmaengine.h
> @@ -875,18 +875,6 @@ static inline void dmaengine_put(void)
> }
> #endif
>
> -#ifdef CONFIG_NET_DMA
> -#define net_dmaengine_get() dmaengine_get()
> -#define net_dmaengine_put() dmaengine_put()
> -#else
> -static inline void net_dmaengine_get(void)
> -{
> -}
> -static inline void net_dmaengine_put(void)
> -{
> -}
> -#endif
> -
> #ifdef CONFIG_ASYNC_TX_DMA
> #define async_dmaengine_get() dmaengine_get()
> #define async_dmaengine_put() dmaengine_put()
> @@ -908,16 +896,8 @@ async_dma_find_channel(enum dma_transaction_type type)
> return NULL;
> }
> #endif /* CONFIG_ASYNC_TX_DMA */
> -
> -dma_cookie_t dma_async_memcpy_buf_to_buf(struct dma_chan *chan,
> - void *dest, void *src, size_t len);
> -dma_cookie_t dma_async_memcpy_buf_to_pg(struct dma_chan *chan,
> - struct page *page, unsigned int offset, void *kdata, size_t len);
> -dma_cookie_t dma_async_memcpy_pg_to_pg(struct dma_chan *chan,
> - struct page *dest_pg, unsigned int dest_off, struct page *src_pg,
> - unsigned int src_off, size_t len);
> void dma_async_tx_descriptor_init(struct dma_async_tx_descriptor *tx,
> - struct dma_chan *chan);
> + struct dma_chan *chan);
>
> static inline void async_tx_ack(struct dma_async_tx_descriptor *tx)
> {
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index bec1cc7d5e3c..ac4f84dfa84b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -28,7 +28,6 @@
> #include <linux/textsearch.h>
> #include <net/checksum.h>
> #include <linux/rcupdate.h>
> -#include <linux/dmaengine.h>
> #include <linux/hrtimer.h>
> #include <linux/dma-mapping.h>
> #include <linux/netdev_features.h>
> @@ -496,11 +495,8 @@ struct sk_buff {
> /* 6/8 bit hole (depending on ndisc_nodetype presence) */
> kmemcheck_bitfield_end(flags2);
>
> -#if defined CONFIG_NET_DMA || defined CONFIG_NET_RX_BUSY_POLL
> - union {
> - unsigned int napi_id;
> - dma_cookie_t dma_cookie;
> - };
> +#ifdef CONFIG_NET_RX_BUSY_POLL
> + unsigned int napi_id;
> #endif
> #ifdef CONFIG_NETWORK_SECMARK
> __u32 secmark;
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index d68633452d9b..26f16021ce1d 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -19,7 +19,6 @@
>
>
> #include <linux/skbuff.h>
> -#include <linux/dmaengine.h>
> #include <net/sock.h>
> #include <net/inet_connection_sock.h>
> #include <net/inet_timewait_sock.h>
> @@ -169,13 +168,6 @@ struct tcp_sock {
> struct iovec *iov;
> int memory;
> int len;
> -#ifdef CONFIG_NET_DMA
> - /* members for async copy */
> - struct dma_chan *dma_chan;
> - int wakeup;
> - struct dma_pinned_list *pinned_list;
> - dma_cookie_t dma_cookie;
> -#endif
> } ucopy;
>
> u32 snd_wl1; /* Sequence for window update */
> diff --git a/include/net/netdma.h b/include/net/netdma.h
> deleted file mode 100644
> index 8ba8ce284eeb..000000000000
> --- a/include/net/netdma.h
> +++ /dev/null
> @@ -1,32 +0,0 @@
> -/*
> - * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms of the GNU General Public License as published by the Free
> - * Software Foundation; either version 2 of the License, or (at your option)
> - * any later version.
> - *
> - * This program is distributed in the hope that it will be useful, but WITHOUT
> - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> - * more details.
> - *
> - * You should have received a copy of the GNU General Public License along with
> - * this program; if not, write to the Free Software Foundation, Inc., 59
> - * Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> - *
> - * The full GNU General Public License is included in this distribution in the
> - * file called COPYING.
> - */
> -#ifndef NETDMA_H
> -#define NETDMA_H
> -#ifdef CONFIG_NET_DMA
> -#include <linux/dmaengine.h>
> -#include <linux/skbuff.h>
> -
> -int dma_skb_copy_datagram_iovec(struct dma_chan* chan,
> - struct sk_buff *skb, int offset, struct iovec *to,
> - size_t len, struct dma_pinned_list *pinned_list);
> -
> -#endif /* CONFIG_NET_DMA */
> -#endif /* NETDMA_H */
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e3a18ff0c38b..9d5f716e921e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -231,7 +231,6 @@ struct cg_proto;
> * @sk_receive_queue: incoming packets
> * @sk_wmem_alloc: transmit queue bytes committed
> * @sk_write_queue: Packet sending queue
> - * @sk_async_wait_queue: DMA copied packets
> * @sk_omem_alloc: "o" is "option" or "other"
> * @sk_wmem_queued: persistent queue size
> * @sk_forward_alloc: space allocated forward
> @@ -354,10 +353,6 @@ struct sock {
> struct sk_filter __rcu *sk_filter;
> struct socket_wq __rcu *sk_wq;
>
> -#ifdef CONFIG_NET_DMA
> - struct sk_buff_head sk_async_wait_queue;
> -#endif
> -
> #ifdef CONFIG_XFRM
> struct xfrm_policy *sk_policy[2];
> #endif
> @@ -2200,27 +2195,15 @@ void sock_tx_timestamp(struct sock *sk, __u8 *tx_flags);
> * sk_eat_skb - Release a skb if it is no longer needed
> * @sk: socket to eat this skb from
> * @skb: socket buffer to eat
> - * @copied_early: flag indicating whether DMA operations copied this data early
> *
> * This routine must be called with interrupts disabled or with the socket
> * locked so that the sk_buff queue operation is ok.
> */
> -#ifdef CONFIG_NET_DMA
> -static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, bool copied_early)
> -{
> - __skb_unlink(skb, &sk->sk_receive_queue);
> - if (!copied_early)
> - __kfree_skb(skb);
> - else
> - __skb_queue_tail(&sk->sk_async_wait_queue, skb);
> -}
> -#else
> -static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb, bool copied_early)
> +static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb)
> {
> __skb_unlink(skb, &sk->sk_receive_queue);
> __kfree_skb(skb);
> }
> -#endif
>
> static inline
> struct net *sock_net(const struct sock *sk)
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 70e55d200610..084c163e9d40 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -27,7 +27,6 @@
> #include <linux/cache.h>
> #include <linux/percpu.h>
> #include <linux/skbuff.h>
> -#include <linux/dmaengine.h>
> #include <linux/crypto.h>
> #include <linux/cryptohash.h>
> #include <linux/kref.h>
> @@ -267,7 +266,6 @@ extern int sysctl_tcp_adv_win_scale;
> extern int sysctl_tcp_tw_reuse;
> extern int sysctl_tcp_frto;
> extern int sysctl_tcp_low_latency;
> -extern int sysctl_tcp_dma_copybreak;
> extern int sysctl_tcp_nometrics_save;
> extern int sysctl_tcp_moderate_rcvbuf;
> extern int sysctl_tcp_tso_win_divisor;
> @@ -1032,12 +1030,6 @@ static inline void tcp_prequeue_init(struct tcp_sock *tp)
> tp->ucopy.len = 0;
> tp->ucopy.memory = 0;
> skb_queue_head_init(&tp->ucopy.prequeue);
> -#ifdef CONFIG_NET_DMA
> - tp->ucopy.dma_chan = NULL;
> - tp->ucopy.wakeup = 0;
> - tp->ucopy.pinned_list = NULL;
> - tp->ucopy.dma_cookie = 0;
> -#endif
> }
>
> bool tcp_prequeue(struct sock *sk, struct sk_buff *skb);
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 653cbbd9e7ad..d457005acedf 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -390,7 +390,6 @@ static const struct bin_table bin_net_ipv4_table[] = {
> { CTL_INT, NET_TCP_MTU_PROBING, "tcp_mtu_probing" },
> { CTL_INT, NET_TCP_BASE_MSS, "tcp_base_mss" },
> { CTL_INT, NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS, "tcp_workaround_signed_windows" },
> - { CTL_INT, NET_TCP_DMA_COPYBREAK, "tcp_dma_copybreak" },
> { CTL_INT, NET_TCP_SLOW_START_AFTER_IDLE, "tcp_slow_start_after_idle" },
> { CTL_INT, NET_CIPSOV4_CACHE_ENABLE, "cipso_cache_enable" },
> { CTL_INT, NET_CIPSOV4_CACHE_BUCKET_SIZE, "cipso_cache_bucket_size" },
> diff --git a/net/core/Makefile b/net/core/Makefile
> index b33b996f5dd6..5f98e5983bd3 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -16,7 +16,6 @@ obj-y += net-sysfs.o
> obj-$(CONFIG_PROC_FS) += net-procfs.o
> obj-$(CONFIG_NET_PKTGEN) += pktgen.o
> obj-$(CONFIG_NETPOLL) += netpoll.o
> -obj-$(CONFIG_NET_DMA) += user_dma.o
> obj-$(CONFIG_FIB_RULES) += fib_rules.o
> obj-$(CONFIG_TRACEPOINTS) += net-traces.o
> obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
> diff --git a/net/core/dev.c b/net/core/dev.c
> index ba3b7ea5ebb3..677a5a4dcca7 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1262,7 +1262,6 @@ static int __dev_open(struct net_device *dev)
> clear_bit(__LINK_STATE_START, &dev->state);
> else {
> dev->flags |= IFF_UP;
> - net_dmaengine_get();
> dev_set_rx_mode(dev);
> dev_activate(dev);
> add_device_randomness(dev->dev_addr, dev->addr_len);
> @@ -1338,7 +1337,6 @@ static int __dev_close_many(struct list_head *head)
> ops->ndo_stop(dev);
>
> dev->flags &= ~IFF_UP;
> - net_dmaengine_put();
> }
>
> return 0;
> @@ -4362,14 +4360,6 @@ static void net_rx_action(struct softirq_action *h)
> out:
> net_rps_action_and_irq_enable(sd);
>
> -#ifdef CONFIG_NET_DMA
> - /*
> - * There may not be any more sk_buffs coming right now, so push
> - * any pending DMA copies to hardware
> - */
> - dma_issue_pending_all();
> -#endif
> -
> return;
>
> softnet_break:
> diff --git a/net/core/sock.c b/net/core/sock.c
> index ab20ed9b0f31..411dab3a5726 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1461,9 +1461,6 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
> atomic_set(&newsk->sk_omem_alloc, 0);
> skb_queue_head_init(&newsk->sk_receive_queue);
> skb_queue_head_init(&newsk->sk_write_queue);
> -#ifdef CONFIG_NET_DMA
> - skb_queue_head_init(&newsk->sk_async_wait_queue);
> -#endif
>
> spin_lock_init(&newsk->sk_dst_lock);
> rwlock_init(&newsk->sk_callback_lock);
> @@ -2290,9 +2287,6 @@ void sock_init_data(struct socket *sock, struct sock *sk)
> skb_queue_head_init(&sk->sk_receive_queue);
> skb_queue_head_init(&sk->sk_write_queue);
> skb_queue_head_init(&sk->sk_error_queue);
> -#ifdef CONFIG_NET_DMA
> - skb_queue_head_init(&sk->sk_async_wait_queue);
> -#endif
>
> sk->sk_send_head = NULL;
>
> diff --git a/net/core/user_dma.c b/net/core/user_dma.c
> deleted file mode 100644
> index 1b5fefdb8198..000000000000
> --- a/net/core/user_dma.c
> +++ /dev/null
> @@ -1,131 +0,0 @@
> -/*
> - * Copyright(c) 2004 - 2006 Intel Corporation. All rights reserved.
> - * Portions based on net/core/datagram.c and copyrighted by their authors.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms of the GNU General Public License as published by the Free
> - * Software Foundation; either version 2 of the License, or (at your option)
> - * any later version.
> - *
> - * This program is distributed in the hope that it will be useful, but WITHOUT
> - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> - * more details.
> - *
> - * You should have received a copy of the GNU General Public License along with
> - * this program; if not, write to the Free Software Foundation, Inc., 59
> - * Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> - *
> - * The full GNU General Public License is included in this distribution in the
> - * file called COPYING.
> - */
> -
> -/*
> - * This code allows the net stack to make use of a DMA engine for
> - * skb to iovec copies.
> - */
> -
> -#include <linux/dmaengine.h>
> -#include <linux/socket.h>
> -#include <linux/export.h>
> -#include <net/tcp.h>
> -#include <net/netdma.h>
> -
> -#define NET_DMA_DEFAULT_COPYBREAK 4096
> -
> -int sysctl_tcp_dma_copybreak = NET_DMA_DEFAULT_COPYBREAK;
> -EXPORT_SYMBOL(sysctl_tcp_dma_copybreak);
> -
> -/**
> - * dma_skb_copy_datagram_iovec - Copy a datagram to an iovec.
> - * @skb - buffer to copy
> - * @offset - offset in the buffer to start copying from
> - * @iovec - io vector to copy to
> - * @len - amount of data to copy from buffer to iovec
> - * @pinned_list - locked iovec buffer data
> - *
> - * Note: the iovec is modified during the copy.
> - */
> -int dma_skb_copy_datagram_iovec(struct dma_chan *chan,
> - struct sk_buff *skb, int offset, struct iovec *to,
> - size_t len, struct dma_pinned_list *pinned_list)
> -{
> - int start = skb_headlen(skb);
> - int i, copy = start - offset;
> - struct sk_buff *frag_iter;
> - dma_cookie_t cookie = 0;
> -
> - /* Copy header. */
> - if (copy > 0) {
> - if (copy > len)
> - copy = len;
> - cookie = dma_memcpy_to_iovec(chan, to, pinned_list,
> - skb->data + offset, copy);
> - if (cookie < 0)
> - goto fault;
> - len -= copy;
> - if (len == 0)
> - goto end;
> - offset += copy;
> - }
> -
> - /* Copy paged appendix. Hmm... why does this look so complicated? */
> - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> - int end;
> - const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> -
> - WARN_ON(start > offset + len);
> -
> - end = start + skb_frag_size(frag);
> - copy = end - offset;
> - if (copy > 0) {
> - struct page *page = skb_frag_page(frag);
> -
> - if (copy > len)
> - copy = len;
> -
> - cookie = dma_memcpy_pg_to_iovec(chan, to, pinned_list, page,
> - frag->page_offset + offset - start, copy);
> - if (cookie < 0)
> - goto fault;
> - len -= copy;
> - if (len == 0)
> - goto end;
> - offset += copy;
> - }
> - start = end;
> - }
> -
> - skb_walk_frags(skb, frag_iter) {
> - int end;
> -
> - WARN_ON(start > offset + len);
> -
> - end = start + frag_iter->len;
> - copy = end - offset;
> - if (copy > 0) {
> - if (copy > len)
> - copy = len;
> - cookie = dma_skb_copy_datagram_iovec(chan, frag_iter,
> - offset - start,
> - to, copy,
> - pinned_list);
> - if (cookie < 0)
> - goto fault;
> - len -= copy;
> - if (len == 0)
> - goto end;
> - offset += copy;
> - }
> - start = end;
> - }
> -
> -end:
> - if (!len) {
> - skb->dma_cookie = cookie;
> - return cookie;
> - }
> -
> -fault:
> - return -EFAULT;
> -}
> diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> index eb892b4f4814..f9076f295b13 100644
> --- a/net/dccp/proto.c
> +++ b/net/dccp/proto.c
> @@ -848,7 +848,7 @@ int dccp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
> default:
> dccp_pr_debug("packet_type=%s\n",
> dccp_packet_name(dh->dccph_type));
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> }
> verify_sock_status:
> if (sock_flag(sk, SOCK_DONE)) {
> @@ -905,7 +905,7 @@ verify_sock_status:
> len = skb->len;
> found_fin_ok:
> if (!(flags & MSG_PEEK))
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> break;
> } while (1);
> out:
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 3d69ec8dac57..79a90b92e12d 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -642,15 +642,6 @@ static struct ctl_table ipv4_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec
> },
> -#ifdef CONFIG_NET_DMA
> - {
> - .procname = "tcp_dma_copybreak",
> - .data = &sysctl_tcp_dma_copybreak,
> - .maxlen = sizeof(int),
> - .mode = 0644,
> - .proc_handler = proc_dointvec
> - },
> -#endif
> {
> .procname = "tcp_slow_start_after_idle",
> .data = &sysctl_tcp_slow_start_after_idle,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index c4638e6f0238..8dc913dfbaef 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -274,7 +274,6 @@
> #include <net/tcp.h>
> #include <net/xfrm.h>
> #include <net/ip.h>
> -#include <net/netdma.h>
> #include <net/sock.h>
>
> #include <asm/uaccess.h>
> @@ -1409,39 +1408,6 @@ static void tcp_prequeue_process(struct sock *sk)
> tp->ucopy.memory = 0;
> }
>
> -#ifdef CONFIG_NET_DMA
> -static void tcp_service_net_dma(struct sock *sk, bool wait)
> -{
> - dma_cookie_t done, used;
> - dma_cookie_t last_issued;
> - struct tcp_sock *tp = tcp_sk(sk);
> -
> - if (!tp->ucopy.dma_chan)
> - return;
> -
> - last_issued = tp->ucopy.dma_cookie;
> - dma_async_issue_pending(tp->ucopy.dma_chan);
> -
> - do {
> - if (dma_async_is_tx_complete(tp->ucopy.dma_chan,
> - last_issued, &done,
> - &used) == DMA_COMPLETE) {
> - /* Safe to free early-copied skbs now */
> - __skb_queue_purge(&sk->sk_async_wait_queue);
> - break;
> - } else {
> - struct sk_buff *skb;
> - while ((skb = skb_peek(&sk->sk_async_wait_queue)) &&
> - (dma_async_is_complete(skb->dma_cookie, done,
> - used) == DMA_COMPLETE)) {
> - __skb_dequeue(&sk->sk_async_wait_queue);
> - kfree_skb(skb);
> - }
> - }
> - } while (wait);
> -}
> -#endif
> -
> static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
> {
> struct sk_buff *skb;
> @@ -1459,7 +1425,7 @@ static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
> * splitted a fat GRO packet, while we released socket lock
> * in skb_splice_bits()
> */
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> }
> return NULL;
> }
> @@ -1525,11 +1491,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> continue;
> }
> if (tcp_hdr(skb)->fin) {
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> ++seq;
> break;
> }
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> if (!desc->count)
> break;
> tp->copied_seq = seq;
> @@ -1567,7 +1533,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
> int target; /* Read at least this many bytes */
> long timeo;
> struct task_struct *user_recv = NULL;
> - bool copied_early = false;
> struct sk_buff *skb;
> u32 urg_hole = 0;
>
> @@ -1610,28 +1575,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
>
> target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
>
> -#ifdef CONFIG_NET_DMA
> - tp->ucopy.dma_chan = NULL;
> - preempt_disable();
> - skb = skb_peek_tail(&sk->sk_receive_queue);
> - {
> - int available = 0;
> -
> - if (skb)
> - available = TCP_SKB_CB(skb)->seq + skb->len - (*seq);
> - if ((available < target) &&
> - (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
> - !sysctl_tcp_low_latency &&
> - net_dma_find_channel()) {
> - preempt_enable_no_resched();
> - tp->ucopy.pinned_list =
> - dma_pin_iovec_pages(msg->msg_iov, len);
> - } else {
> - preempt_enable_no_resched();
> - }
> - }
> -#endif
> -
> do {
> u32 offset;
>
> @@ -1762,16 +1705,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
> /* __ Set realtime policy in scheduler __ */
> }
>
> -#ifdef CONFIG_NET_DMA
> - if (tp->ucopy.dma_chan) {
> - if (tp->rcv_wnd == 0 &&
> - !skb_queue_empty(&sk->sk_async_wait_queue)) {
> - tcp_service_net_dma(sk, true);
> - tcp_cleanup_rbuf(sk, copied);
> - } else
> - dma_async_issue_pending(tp->ucopy.dma_chan);
> - }
> -#endif
> if (copied >= target) {
> /* Do not sleep, just process backlog. */
> release_sock(sk);
> @@ -1779,11 +1712,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
> } else
> sk_wait_data(sk, &timeo);
>
> -#ifdef CONFIG_NET_DMA
> - tcp_service_net_dma(sk, false); /* Don't block */
> - tp->ucopy.wakeup = 0;
> -#endif
> -
> if (user_recv) {
> int chunk;
>
> @@ -1841,43 +1769,13 @@ do_prequeue:
> }
>
> if (!(flags & MSG_TRUNC)) {
> -#ifdef CONFIG_NET_DMA
> - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> - tp->ucopy.dma_chan = net_dma_find_channel();
> -
> - if (tp->ucopy.dma_chan) {
> - tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec(
> - tp->ucopy.dma_chan, skb, offset,
> - msg->msg_iov, used,
> - tp->ucopy.pinned_list);
> -
> - if (tp->ucopy.dma_cookie < 0) {
> -
> - pr_alert("%s: dma_cookie < 0\n",
> - __func__);
> -
> - /* Exception. Bailout! */
> - if (!copied)
> - copied = -EFAULT;
> - break;
> - }
> -
> - dma_async_issue_pending(tp->ucopy.dma_chan);
> -
> - if ((offset + used) == skb->len)
> - copied_early = true;
> -
> - } else
> -#endif
> - {
> - err = skb_copy_datagram_iovec(skb, offset,
> - msg->msg_iov, used);
> - if (err) {
> - /* Exception. Bailout! */
> - if (!copied)
> - copied = -EFAULT;
> - break;
> - }
> + err = skb_copy_datagram_iovec(skb, offset,
> + msg->msg_iov, used);
> + if (err) {
> + /* Exception. Bailout! */
> + if (!copied)
> + copied = -EFAULT;
> + break;
> }
> }
>
> @@ -1897,19 +1795,15 @@ skip_copy:
>
> if (tcp_hdr(skb)->fin)
> goto found_fin_ok;
> - if (!(flags & MSG_PEEK)) {
> - sk_eat_skb(sk, skb, copied_early);
> - copied_early = false;
> - }
> + if (!(flags & MSG_PEEK))
> + sk_eat_skb(sk, skb);
> continue;
>
> found_fin_ok:
> /* Process the FIN. */
> ++*seq;
> - if (!(flags & MSG_PEEK)) {
> - sk_eat_skb(sk, skb, copied_early);
> - copied_early = false;
> - }
> + if (!(flags & MSG_PEEK))
> + sk_eat_skb(sk, skb);
> break;
> } while (len > 0);
>
> @@ -1932,16 +1826,6 @@ skip_copy:
> tp->ucopy.len = 0;
> }
>
> -#ifdef CONFIG_NET_DMA
> - tcp_service_net_dma(sk, true); /* Wait for queue to drain */
> - tp->ucopy.dma_chan = NULL;
> -
> - if (tp->ucopy.pinned_list) {
> - dma_unpin_iovec_pages(tp->ucopy.pinned_list);
> - tp->ucopy.pinned_list = NULL;
> - }
> -#endif
> -
> /* According to UNIX98, msg_name/msg_namelen are ignored
> * on connected socket. I was just happy when found this 8) --ANK
> */
> @@ -2285,9 +2169,6 @@ int tcp_disconnect(struct sock *sk, int flags)
> __skb_queue_purge(&sk->sk_receive_queue);
> tcp_write_queue_purge(sk);
> __skb_queue_purge(&tp->out_of_order_queue);
> -#ifdef CONFIG_NET_DMA
> - __skb_queue_purge(&sk->sk_async_wait_queue);
> -#endif
>
> inet->inet_dport = 0;
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index c53b7f35c51d..33ef18e550c5 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -73,7 +73,6 @@
> #include <net/inet_common.h>
> #include <linux/ipsec.h>
> #include <asm/unaligned.h>
> -#include <net/netdma.h>
>
> int sysctl_tcp_timestamps __read_mostly = 1;
> int sysctl_tcp_window_scaling __read_mostly = 1;
> @@ -4967,53 +4966,6 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
> __tcp_checksum_complete_user(sk, skb);
> }
>
> -#ifdef CONFIG_NET_DMA
> -static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
> - int hlen)
> -{
> - struct tcp_sock *tp = tcp_sk(sk);
> - int chunk = skb->len - hlen;
> - int dma_cookie;
> - bool copied_early = false;
> -
> - if (tp->ucopy.wakeup)
> - return false;
> -
> - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> - tp->ucopy.dma_chan = net_dma_find_channel();
> -
> - if (tp->ucopy.dma_chan && skb_csum_unnecessary(skb)) {
> -
> - dma_cookie = dma_skb_copy_datagram_iovec(tp->ucopy.dma_chan,
> - skb, hlen,
> - tp->ucopy.iov, chunk,
> - tp->ucopy.pinned_list);
> -
> - if (dma_cookie < 0)
> - goto out;
> -
> - tp->ucopy.dma_cookie = dma_cookie;
> - copied_early = true;
> -
> - tp->ucopy.len -= chunk;
> - tp->copied_seq += chunk;
> - tcp_rcv_space_adjust(sk);
> -
> - if ((tp->ucopy.len == 0) ||
> - (tcp_flag_word(tcp_hdr(skb)) & TCP_FLAG_PSH) ||
> - (atomic_read(&sk->sk_rmem_alloc) > (sk->sk_rcvbuf >> 1))) {
> - tp->ucopy.wakeup = 1;
> - sk->sk_data_ready(sk, 0);
> - }
> - } else if (chunk > 0) {
> - tp->ucopy.wakeup = 1;
> - sk->sk_data_ready(sk, 0);
> - }
> -out:
> - return copied_early;
> -}
> -#endif /* CONFIG_NET_DMA */
> -
> /* Does PAWS and seqno based validation of an incoming segment, flags will
> * play significant role here.
> */
> @@ -5198,14 +5150,6 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
>
> if (tp->copied_seq == tp->rcv_nxt &&
> len - tcp_header_len <= tp->ucopy.len) {
> -#ifdef CONFIG_NET_DMA
> - if (tp->ucopy.task == current &&
> - sock_owned_by_user(sk) &&
> - tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
> - copied_early = 1;
> - eaten = 1;
> - }
> -#endif
> if (tp->ucopy.task == current &&
> sock_owned_by_user(sk) && !copied_early) {
> __set_current_state(TASK_RUNNING);
> @@ -5271,11 +5215,6 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
> if (!copied_early || tp->rcv_nxt != tp->rcv_wup)
> __tcp_ack_snd_check(sk, 0);
> no_ack:
> -#ifdef CONFIG_NET_DMA
> - if (copied_early)
> - __skb_queue_tail(&sk->sk_async_wait_queue, skb);
> - else
> -#endif
> if (eaten)
> kfree_skb_partial(skb, fragstolen);
> sk->sk_data_ready(sk, 0);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 59a6f8b90cd9..dc92ba9d0350 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -72,7 +72,6 @@
> #include <net/inet_common.h>
> #include <net/timewait_sock.h>
> #include <net/xfrm.h>
> -#include <net/netdma.h>
> #include <net/secure_seq.h>
> #include <net/tcp_memcontrol.h>
> #include <net/busy_poll.h>
> @@ -2000,18 +1999,8 @@ process:
> bh_lock_sock_nested(sk);
> ret = 0;
> if (!sock_owned_by_user(sk)) {
> -#ifdef CONFIG_NET_DMA
> - struct tcp_sock *tp = tcp_sk(sk);
> - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> - tp->ucopy.dma_chan = net_dma_find_channel();
> - if (tp->ucopy.dma_chan)
> + if (!tcp_prequeue(sk, skb))
> ret = tcp_v4_do_rcv(sk, skb);
> - else
> -#endif
> - {
> - if (!tcp_prequeue(sk, skb))
> - ret = tcp_v4_do_rcv(sk, skb);
> - }
> } else if (unlikely(sk_add_backlog(sk, skb,
> sk->sk_rcvbuf + sk->sk_sndbuf))) {
> bh_unlock_sock(sk);
> @@ -2170,11 +2159,6 @@ void tcp_v4_destroy_sock(struct sock *sk)
> }
> #endif
>
> -#ifdef CONFIG_NET_DMA
> - /* Cleans up our sk_async_wait_queue */
> - __skb_queue_purge(&sk->sk_async_wait_queue);
> -#endif
> -
> /* Clean prequeue, it must be empty really */
> __skb_queue_purge(&tp->ucopy.prequeue);
>
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 0740f93a114a..e27972590379 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -59,7 +59,6 @@
> #include <net/snmp.h>
> #include <net/dsfield.h>
> #include <net/timewait_sock.h>
> -#include <net/netdma.h>
> #include <net/inet_common.h>
> #include <net/secure_seq.h>
> #include <net/tcp_memcontrol.h>
> @@ -1504,18 +1503,8 @@ process:
> bh_lock_sock_nested(sk);
> ret = 0;
> if (!sock_owned_by_user(sk)) {
> -#ifdef CONFIG_NET_DMA
> - struct tcp_sock *tp = tcp_sk(sk);
> - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
> - tp->ucopy.dma_chan = net_dma_find_channel();
> - if (tp->ucopy.dma_chan)
> + if (!tcp_prequeue(sk, skb))
> ret = tcp_v6_do_rcv(sk, skb);
> - else
> -#endif
> - {
> - if (!tcp_prequeue(sk, skb))
> - ret = tcp_v6_do_rcv(sk, skb);
> - }
> } else if (unlikely(sk_add_backlog(sk, skb,
> sk->sk_rcvbuf + sk->sk_sndbuf))) {
> bh_unlock_sock(sk);
> diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
> index 7b01b9f5846c..e1b46709f8d6 100644
> --- a/net/llc/af_llc.c
> +++ b/net/llc/af_llc.c
> @@ -838,7 +838,7 @@ static int llc_ui_recvmsg(struct kiocb *iocb, struct socket *sock,
>
> if (!(flags & MSG_PEEK)) {
> spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
> - sk_eat_skb(sk, skb, false);
> + sk_eat_skb(sk, skb);
> spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
> *seq = 0;
> }
> @@ -860,10 +860,10 @@ copy_uaddr:
> llc_cmsg_rcv(msg, skb);
>
> if (!(flags & MSG_PEEK)) {
> - spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
> - sk_eat_skb(sk, skb, false);
> - spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
> - *seq = 0;
> + spin_lock_irqsave(&sk->sk_receive_queue.lock, cpu_flags);
> + sk_eat_skb(sk, skb);
> + spin_unlock_irqrestore(&sk->sk_receive_queue.lock, cpu_flags);
> + *seq = 0;
> }
>
> goto out;
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: TI CPSW Ethernet Tx performance regression
From: Florian Fainelli @ 2014-01-15 21:21 UTC (permalink / raw)
To: Ben Hutchings; +Cc: Mugunthan V N, netdev
In-Reply-To: <1389808467.11912.9.camel@bwh-desktop.uk.level5networks.com>
2014/1/15 Ben Hutchings <bhutchings@solarflare.com>:
> On Wed, 2014-01-15 at 18:18 +0530, Mugunthan V N wrote:
>> Hi
>>
>> I am seeing a performance regression with CPSW driver on AM335x EVM. AM335x EVM
>> CPSW has 3.2 kernel support [1] and Mainline support from 3.7. When I am
>> comparing the performance between 3.2 and 3.13-rc4. TCP receive performance of
>> CPSW between 3.2 and 3.13-rc4 is same (~180Mbps) but TCP Transmit performance
>> is poor comparing to 3.2 kernel. In 3.2 kernel is it *256Mbps* and in 3.13-rc4
>> it is *70Mbps*
>>
>> Iperf version is *iperf version 2.0.5 (08 Jul 2010) pthreads* on both PC and EVM
>>
>> On UDP transmit also performance is down comparing to 3.2 kernel. In 3.2 it is
>> 196Mbps for 200Mbps band width and in 3.13-rc4 it is 92Mbps
>>
>> Can someone point me out where can I look for improving Tx performance. I also
>> checked whether there is Tx descriptor over flow and there is none. I have
>> tries 3.11 and some older kernel, all are giving ~75Mbps Transmit performance
>> only.
>>
>> [1] - http://arago-project.org/git/projects/?p=linux-am33x.git;a=summary
>
> If you don't get any specific suggestions, you could try bisecting to
> find out which specific commit(s) changed the performance.
Not necessarily related to that issue, but there are a few
weird/unusual things done in the CPSW interrupt handler:
static irqreturn_t cpsw_interrupt(int irq, void *dev_id)
{
struct cpsw_priv *priv = dev_id;
cpsw_intr_disable(priv);
if (priv->irq_enabled == true) {
cpsw_disable_irq(priv);
priv->irq_enabled = false;
}
if (netif_running(priv->ndev)) {
napi_schedule(&priv->napi);
return IRQ_HANDLED;
}
Checking for netif_running() should not be required, you should not
get any TX/RX interrupts if your interface is not running.
priv = cpsw_get_slave_priv(priv, 1);
if (!priv)
return IRQ_NONE;
Should not this be moved up as the very first conditional check to do?
is not there a risk to leave the interrupts disabled and not
re-enabled due to the first 5 lines at the top?
if (netif_running(priv->ndev)) {
napi_schedule(&priv->napi);
return IRQ_HANDLED;
}
This was done before, why doing it again?
In drivers/net/ethernet/ti/davinci_cpdma.c::cpdma_chan_process()
treats equally an error processing a packet (and will stop there) as
well as successfully processing num_tx packets, is that also
intentional? Should you attempt to keep processing "quota" packets?
As Ben suggests, bisecting what is causing the regression is your best bet here.
--
Florian
^ permalink raw reply
* Re: [PATCH v3 1/4] net_dma: simple removal
From: Dan Williams @ 2014-01-15 21:31 UTC (permalink / raw)
To: saeed bishara
Cc: dmaengine@vger.kernel.org, Alexander Duyck, Dave Jiang,
Vinod Koul, netdev@vger.kernel.org, David Whipple, lkml,
David S. Miller
In-Reply-To: <CAMAG_eduH4M2OPVh-R4Q6KG1DDcinEDzC-fQyXj1mLdZG=49hw@mail.gmail.com>
On Wed, Jan 15, 2014 at 1:20 PM, saeed bishara <saeed.bishara@gmail.com> wrote:
> Hi Dan,
>
> I'm using net_dma on my system and I achieve meaningful performance
> boost when running Iperf receive.
>
> As far as I know the net_dma is used by many embedded systems out
> there and might effect their performance.
> Can you please elaborate on the exact scenario that cause the memory corruption?
>
> Is the scenario mentioned here caused by "real life" application or
> this is more of theoretical issue found through manual testing, I was
> trying to find the thread describing the failing scenario and couldn't
> find it, any pointer will be appreciated.
Did you see the referenced commit?
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=77873803363c
This is a real issue in that any app that forks() while receiving data
can cause the dma data to be lost. The problem is that the copy
operation falls back to cpu at many locations. Any one of those
instance could touch a mapped page and trigger a copy-on-write event.
The dma completes to the wrong location.
--
Dan
^ permalink raw reply
* Re: [PATCH v3 1/4] net_dma: simple removal
From: Dan Williams @ 2014-01-15 21:33 UTC (permalink / raw)
To: saeed bishara
Cc: dmaengine@vger.kernel.org, Alexander Duyck, Dave Jiang,
Vinod Koul, netdev@vger.kernel.org, David Whipple, lkml,
David S. Miller
In-Reply-To: <CAPcyv4hzNT15R41zSOM98f-0aQ60HBZoD_DAvf6VED7iXoCZ8w@mail.gmail.com>
On Wed, Jan 15, 2014 at 1:31 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Jan 15, 2014 at 1:20 PM, saeed bishara <saeed.bishara@gmail.com> wrote:
>> Hi Dan,
>>
>> I'm using net_dma on my system and I achieve meaningful performance
>> boost when running Iperf receive.
>>
>> As far as I know the net_dma is used by many embedded systems out
>> there and might effect their performance.
>> Can you please elaborate on the exact scenario that cause the memory corruption?
>>
>> Is the scenario mentioned here caused by "real life" application or
>> this is more of theoretical issue found through manual testing, I was
>> trying to find the thread describing the failing scenario and couldn't
>> find it, any pointer will be appreciated.
>
> Did you see the referenced commit?
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=77873803363c
>
> This is a real issue in that any app that forks() while receiving data
> can cause the dma data to be lost. The problem is that the copy
> operation falls back to cpu at many locations. Any one of those
> instance could touch a mapped page and trigger a copy-on-write event.
> The dma completes to the wrong location.
>
Btw, do you have benchmark data showing that NET_DMA is beneficial on
these platforms? I would have expected worse performance on platforms
without i/o coherent caches.
^ permalink raw reply
* [PATCH] net/dt: Add support for overriding phy configuration from device tree
From: Matthew Garrett @ 2014-01-15 21:38 UTC (permalink / raw)
To: netdev-u79uwXL29TY76Z2rM5mHXA
Cc: devicetree-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, kishon-l0cyMroinI0,
Matthew Garrett
Some hardware may be broken in interesting and board-specific ways, such
that various bits of functionality don't work. This patch provides a
mechanism for overriding mii registers during init based on the contents of
the device tree data, allowing board-specific fixups without having to
pollute generic code.
Signed-off-by: Matthew Garrett <matthew.garrett-05XSO3Yj/JvQT0dZR+AlfA@public.gmane.org>
---
Documentation/devicetree/bindings/net/phy.txt | 13 +++
drivers/net/phy/phy_device.c | 29 +++++-
drivers/of/of_net.c | 124 ++++++++++++++++++++++++++
include/linux/of_net.h | 12 +++
4 files changed, 177 insertions(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/net/phy.txt b/Documentation/devicetree/bindings/net/phy.txt
index 7cd18fb..552a5e0 100644
--- a/Documentation/devicetree/bindings/net/phy.txt
+++ b/Documentation/devicetree/bindings/net/phy.txt
@@ -23,6 +23,19 @@ Optional Properties:
assume clause 22. The compatible list may also contain other
elements.
+The following properties may be added to either the phy node or the parent
+ethernet device:
+
+- phy-mii-advertise-10half: Whether to advertise half-duplex 10MBit
+- phy-mii-advertise-10full: Whether to advertise full-duplex 10MBit
+- phy-mii-advertise-100half: Whether to advertise half-duplex 100MBit
+- phy-mii-advertise-100full: Whether to advertise full-duplex 100MBit
+- phy-mii-advertise-100base4: Whether to advertise 100base4
+- phy-mii-advertise-1000half: Whether to advertise half-duplex 1000MBit
+- phy-mii-advertise-1000full: Whether to advertise full-duplex 1000MBit
+- phy-mii-as-master: Configure phy to act as master/slave
+- phy-mii-manual-master: Enable/disable manual master/slave configuration
+
Example:
ethernet-phy@0 {
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index d6447b3..91793bc 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -33,6 +33,7 @@
#include <linux/mii.h>
#include <linux/ethtool.h>
#include <linux/phy.h>
+#include <linux/of_net.h>
#include <asm/io.h>
#include <asm/irq.h>
@@ -497,6 +498,28 @@ void phy_disconnect(struct phy_device *phydev)
}
EXPORT_SYMBOL(phy_disconnect);
+int phy_override_from_of(struct phy_device *phydev)
+{
+ int reg, regval;
+ u16 val, mask;
+
+ /* Check for phy register overrides from OF */
+ for (reg = 0; reg < 16; reg++) {
+ if (!of_get_mii_register(phydev, reg, &val, &mask)) {
+ if (!mask)
+ continue;
+ regval = phy_read(phydev, reg);
+ if (regval < 0)
+ continue;
+ regval &= ~mask;
+ regval |= val;
+ phy_write(phydev, reg, regval);
+ }
+ }
+
+ return 0;
+}
+
int phy_init_hw(struct phy_device *phydev)
{
int ret;
@@ -508,7 +531,11 @@ int phy_init_hw(struct phy_device *phydev)
if (ret < 0)
return ret;
- return phydev->drv->config_init(phydev);
+ ret = phydev->drv->config_init(phydev);
+ if (ret < 0)
+ return ret;
+
+ return phy_override_from_of(phydev);
}
/**
diff --git a/drivers/of/of_net.c b/drivers/of/of_net.c
index 8f9be2e..4545608 100644
--- a/drivers/of/of_net.c
+++ b/drivers/of/of_net.c
@@ -93,3 +93,127 @@ const void *of_get_mac_address(struct device_node *np)
return NULL;
}
EXPORT_SYMBOL(of_get_mac_address);
+
+/**
+ * Provide phy register overrides from the device tree. Some hardware may
+ * be broken in interesting and board-specific ways, so we want a mechanism
+ * for the board data to provide overrides for default values. This should be
+ * called during phy init.
+ */
+int of_get_mii_register(struct phy_device *phydev, int reg, u16 *val,
+ u16 *mask)
+{
+ u32 tmp;
+ struct device *dev = &phydev->dev;
+ struct device_node *np = dev->of_node;
+
+ *val = 0;
+ *mask = 0;
+
+ if (!np && dev->parent->of_node)
+ np = dev->parent->of_node;
+
+ if (!np)
+ return 0;
+
+ switch (reg) {
+ case MII_ADVERTISE:
+ if (!of_property_read_u32(np, "phy-mii-advertise-10half",
+ &tmp)) {
+ if (tmp) {
+ *val |= ADVERTISE_10HALF;
+ phydev->advertising |= SUPPORTED_10baseT_Half;
+ } else {
+ phydev->advertising &=
+ ~(SUPPORTED_10baseT_Half);
+ }
+
+ *mask |= ADVERTISE_10HALF;
+ }
+ if (!of_property_read_u32(np, "phy-mii-advertise-10full",
+ &tmp)) {
+ if (tmp) {
+ *val |= ADVERTISE_10FULL;
+ phydev->advertising |= SUPPORTED_10baseT_Full;
+ } else {
+ phydev->advertising &=
+ ~(SUPPORTED_10baseT_Full);
+ }
+
+ *mask |= ADVERTISE_10FULL;
+ }
+ if (!of_property_read_u32(np, "phy-mii-advertise-100half",
+ &tmp)) {
+ if (tmp) {
+ *val |= ADVERTISE_100HALF;
+ phydev->advertising |= SUPPORTED_100baseT_Half;
+ } else {
+ phydev->advertising &=
+ ~(SUPPORTED_100baseT_Half);
+ }
+
+ *mask |= ADVERTISE_100HALF;
+ }
+ if (!of_property_read_u32(np, "phy-mii-advertise-100full",
+ &tmp)) {
+ if (tmp) {
+ *val |= ADVERTISE_100FULL;
+ phydev->advertising |= SUPPORTED_100baseT_Full;
+ } else {
+ phydev->advertising &=
+ ~(SUPPORTED_100baseT_Full);
+ }
+
+ *mask |= ADVERTISE_100FULL;
+ }
+ if (!of_property_read_u32(np, "phy-mii-advertise-100base4",
+ &tmp)) {
+ if (tmp)
+ *val |= ADVERTISE_100BASE4;
+ *mask |= ADVERTISE_100BASE4;
+ }
+ break;
+ case MII_CTRL1000:
+ if (!of_property_read_u32(np, "phy-mii-advertise-1000full",
+ &tmp)) {
+ if (tmp) {
+ *val |= ADVERTISE_1000FULL;
+ phydev->advertising |= SUPPORTED_1000baseT_Full;
+ } else {
+ phydev->advertising &=
+ ~(SUPPORTED_1000baseT_Full);
+ }
+
+ *mask |= ADVERTISE_1000FULL;
+ }
+ if (!of_property_read_u32(np, "phy-mii-advertise-1000half",
+ &tmp)) {
+ if (tmp) {
+ *val |= ADVERTISE_1000HALF;
+ phydev->advertising |= SUPPORTED_1000baseT_Half;
+ } else {
+ phydev->advertising &=
+ ~(SUPPORTED_1000baseT_Half);
+ }
+
+ *mask |= ADVERTISE_1000HALF;
+ }
+ if (!of_property_read_u32(np, "phy-mii-as-master",
+ &tmp)) {
+ if (tmp)
+ *val |= CTL1000_AS_MASTER;
+ *mask |= CTL1000_AS_MASTER;
+ }
+ if (!of_property_read_u32(np, "phy-mii-manual-master",
+ &tmp)) {
+ if (tmp)
+ *val |= CTL1000_ENABLE_MASTER;
+ *mask |= CTL1000_ENABLE_MASTER;
+ }
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(of_get_mii_register);
diff --git a/include/linux/of_net.h b/include/linux/of_net.h
index 34597c8..2e478bc 100644
--- a/include/linux/of_net.h
+++ b/include/linux/of_net.h
@@ -7,10 +7,14 @@
#ifndef __LINUX_OF_NET_H
#define __LINUX_OF_NET_H
+#include <linux/phy.h>
+
#ifdef CONFIG_OF_NET
#include <linux/of.h>
extern int of_get_phy_mode(struct device_node *np);
extern const void *of_get_mac_address(struct device_node *np);
+extern int of_get_mii_register(struct phy_device *np, int reg, u16 *val,
+ u16 *mask);
#else
static inline int of_get_phy_mode(struct device_node *np)
{
@@ -21,6 +25,14 @@ static inline const void *of_get_mac_address(struct device_node *np)
{
return NULL;
}
+static inline int of_get_mii_register(struct phy_device *np, int reg, u16 *val,
+ u16 *mask)
+{
+ *val = 0;
+ *mask = 0;
+
+ return -EINVAL;
+}
#endif
#endif /* __LINUX_OF_NET_H */
--
1.8.4.2
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [PATCH 1/4] ksz884x: delete useless variable
From: David Miller @ 2014-01-15 21:43 UTC (permalink / raw)
To: Julia.Lawall; +Cc: netdev, kernel-janitors, linux-kernel
In-Reply-To: <1389629847-5330-2-git-send-email-Julia.Lawall@lip6.fr>
From: Julia Lawall <Julia.Lawall@lip6.fr>
Date: Mon, 13 Jan 2014 17:17:24 +0100
> From: Julia Lawall <Julia.Lawall@lip6.fr>
>
> Delete a variable that is at most only assigned to a constant, but never
> used otherwise. In this code, it is the variable result that is used for
> the return code, not rc.
>
> A simplified version of the semantic patch that fixes this problem is as
> follows: (http://coccinelle.lip6.fr/)
...
> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Applied, thank you.
^ permalink raw reply
* Re: [PATCH net-next] net: make dev_set_mtu() honor notification return code
From: David Miller @ 2014-01-15 21:48 UTC (permalink / raw)
To: vfalico; +Cc: netdev, jiri, edumazet, alexander.h.duyck, nicolas.dichtel
In-Reply-To: <20140114121354.GI4132@redhat.com>
From: Veaceslav Falico <vfalico@redhat.com>
Date: Tue, 14 Jan 2014 13:13:54 +0100
> As, currently, only team can signal NOTIFY_BAD on mtu change, it's
> really easy to implement. What do you think?
Looks great, and I agree that RTNL should solve all the other issues.
^ permalink raw reply
* Re: [PATCH RFC 0/9]net: stmmac PM related fixes.
From: David Miller @ 2014-01-15 21:49 UTC (permalink / raw)
To: srinivas.kandagatla; +Cc: peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <52D50C02.3040708@st.com>
From: srinivas kandagatla <srinivas.kandagatla@st.com>
Date: Tue, 14 Jan 2014 10:05:54 +0000
> Do you have any plans to take this series?
>
> Peppe already Acked these series.
>
> Please let me know if you want me to rebase these patches to a
> particular branch.
Please respin them against net-next and be sure to include Peppe's
ACKs.
^ permalink raw reply
* Re: [PATCH v4 0/2] ipv6 addrconf: add IFA_F_NOPREFIXROUTE flag to suppress creation of IP6 routes
From: David Miller @ 2014-01-15 21:53 UTC (permalink / raw)
To: jiri; +Cc: thaller, hannes, netdev, stephen, dcbw
In-Reply-To: <20140113153110.GA2499@minipsycho.orion>
From: Jiri Pirko <jiri@resnulli.us>
Date: Mon, 13 Jan 2014 16:31:10 +0100
> Sat, Jan 11, 2014 at 12:10:30AM CET, davem@davemloft.net wrote:
>>From: Thomas Haller <thaller@redhat.com>
>>Date: Thu, 9 Jan 2014 01:30:02 +0100
>>
>>> v1 -> v2: add a second commit, handling NOPREFIXROUTE in ip6_del_addr.
>>> v2 -> v3: reword commit messages, code comments and some refactoring.
>>> v3 -> v4: refactor, rename variables, add enum
>>>
>>> Thomas Haller (2):
>>> ipv6 addrconf: add IFA_F_NOPREFIXROUTE flag to suppress creation of
>>> IP6 routes
>>> ipv6 addrconf: don't cleanup prefix route for IFA_F_NOPREFIXROUTE
>>
>>Series applied, thanks Thomas.
>
> Hi Dave. Have you pushed this already? I can't see these patches in
> net-next.
Sorry, I must have forgotten to push these changes out before I travelled
on Saturday.
Please respin and resubmit and I'll make sure they get integrated properly.
Thanks.
^ permalink raw reply
* Re: [PATCH 1/2 v3] ixgbe: define IXGBE_MAX_VFS_DRV_LIMIT macro and cleanup const 63
From: Brown, Aaron F @ 2014-01-15 22:00 UTC (permalink / raw)
To: ethan.kernel@gmail.com
Cc: Kirsher, Jeffrey T, Brandeburg, Jesse, Allan, Bruce W,
Wyborny, Carolyn, davem@davemloft.net,
e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <1389795148-341-1-git-send-email-ethan.kernel@gmail.com>
On Wed, 2014-01-15 at 22:12 +0800, Ethan Zhao wrote:
> Because ixgbe driver limit the max number of VF functions could be enabled
> to 63, so define one macro IXGBE_MAX_VFS_DRV_LIMIT and cleanup the const 63
> in code.
>
> v2: fix a typo.
> v3: fix a encoding issue.
>
> Signed-off-by: Ethan Zhao <ethan.kernel@gmail.com>
> ---
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++--
> drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 5 +++--
> drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h | 5 +++++
> 3 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 0ade0cd..47e9b44 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -4818,7 +4818,7 @@ static int ixgbe_sw_init(struct ixgbe_adapter *adapter)
> #ifdef CONFIG_PCI_IOV
> /* assign number of SR-IOV VFs */
> if (hw->mac.type != ixgbe_mac_82598EB)
> - adapter->num_vfs = (max_vfs > 63) ? 0 : max_vfs;
Unfortunately the if statement got changed considerably with a recent
commit:
commit 170e85430bcbe4d18e81b5a70bb163c741381092
ixgbe: add warning when max_vfs is out of range.
And the pattern no longer exists to make a match. In other words, this
patch no longer applies to net-next and I have to ask you for yet
another spin if you still want to squash the magic number.
Thanks,
Aaron
^ permalink raw reply
* Re: [PATCH v5 net-next 2/4] sh_eth: Add support for r7s72100
From: Sergei Shtylyov @ 2014-01-15 22:26 UTC (permalink / raw)
To: Simon Horman, David S. Miller, netdev, linux-sh
Cc: linux-arm-kernel, Magnus Damm
In-Reply-To: <1389766341-14001-3-git-send-email-horms+renesas@verge.net.au>
Hello.
On 01/15/2014 09:12 AM, Simon Horman wrote:
> This is a fast ethernet controller.
I have to say it's not exact enough patch description: R7S72100 is not
Ethernet controller itself, it's a SoC containing the Ethernet controller.
> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
> ---
> v5
> * As suggested by Sergei Shtylyov
> - Do not use sh_eth_chip_reset_r8a7740 as it accesses non-existent
> RMII registers. Instead use sh_eth_chip_reset.
> - Do not use sh_eth_set_rate_gether as it accesses non-existent registers.
> - Do not use reserved LCHNG bit of ECSR
> - Do not use reserved LCHNGIP bit of ECSIPR
> - Document that R8A779x also needs a 16 bit shift of the RFS bits
> - Do not document that the R7S72100 has GECMR, it does not
The above change list was moved from v2 section and doesn't match the real
changes done in v5. ;-)
> v4
> * As requested by David Miller
> - Use a boolean for the return value of sh_eth_is_rz_fast_ether()
> - Correct coding style in sh_eth_get_stats()
> v3
> * No change
> v2
> * As suggested by Magnus Damm and Sergei Shtylyov
> - r7s72100 ethernet is not gigabit so do not refer to it as such
> * As suggested by Magnus Damm
> - As RZ specific register layout rather than using the gigabit layout
> which includes registers that do not exist on this chip.
> ---
> drivers/net/ethernet/renesas/sh_eth.c | 126 ++++++++++++++++++++++++++++++++--
> drivers/net/ethernet/renesas/sh_eth.h | 3 +-
> 2 files changed, 121 insertions(+), 8 deletions(-)
> diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
> index 4f5cfad..a7a0555 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> @@ -190,6 +190,64 @@ static const u16 sh_eth_offset_fast_rcar[SH_ETH_MAX_REGISTER_OFFSET] = {
> [TRIMD] = 0x027c,
> };
>
> +static const u16 sh_eth_offset_fast_rz[SH_ETH_MAX_REGISTER_OFFSET] = {
Shouldn't this map precede R-Car one since this SoC is newer the same way
you've reordered *enum* values, etc.? Sorry for not noticing in the previous
review...
[...]
> + [ARSTR] = 0x0000,
> + [TSU_CTRST] = 0x0004,
> + [TSU_VTAG0] = 0x0058,
> + [TSU_ADSBSY] = 0x0060,
> + [TSU_TEN] = 0x0064,
> + [TXNLCR0] = 0x0080,
> + [TXALCR0] = 0x0084,
> + [RXNLCR0] = 0x0088,
> + [RXALCR0] = 0x008C,
Well, the above counter register subgroup stands out from the TSU_*
registers in the Gigabit mapping, not sure if we should follow that. These
registers are not currently used anyway...
> + [TSU_ADRH0] = 0x0100,
> + [TSU_ADRL0] = 0x0104,
> + [TSU_ADRH31] = 0x01f8,
> + [TSU_ADRL31] = 0x01fc,
> +};
> +
> static const u16 sh_eth_offset_fast_sh4[SH_ETH_MAX_REGISTER_OFFSET] = {
> [ECMR] = 0x0100,
> [RFLR] = 0x0108,
> @@ -318,6 +376,14 @@ static bool sh_eth_is_gether(struct sh_eth_private *mdp)
> return false;
> }
>
> +static bool sh_eth_is_rz_fast_ether(struct sh_eth_private *mdp)
> +{
> + if (mdp->reg_offset == sh_eth_offset_fast_rz)
> + return true;
> + else
> + return false;
Perhaps you should compress the above functions to one-liners as Joe has
suggested. Or I/you could do it in a separate patch...
> +}
> +
> static void sh_eth_select_mii(struct net_device *ndev)
> {
> u32 value = 0x0;
[...]
> @@ -1309,9 +1409,9 @@ static int sh_eth_rx(struct net_device *ndev, u32 intr_status, int *quota)
>
> /* In case of almost all GETHER/ETHERs, the Receive Frame State
> * (RFS) bits in the Receive Descriptor 0 are from bit 9 to
> - * bit 0. However, in case of the R8A7740's GETHER, the RFS
> - * bits are from bit 25 to bit 16. So, the driver needs right
> - * shifting by 16.
> + * bit 0. However, in case of the R8A7740, R8A779x and
Small nit: comma needed before "and" as far as I know English grammar.
> + * R7S72100 the RFS bits are from bit 25 to bit 16. So, the
> + * driver needs right shifting by 16.
> */
> if (mdp->cd->shift_rd0)
> desc_status >>= 16;
Other than that, this looks fine now, you can add my:
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
WBR, Sergei
^ permalink raw reply
* Re: [PATCH][net-next] gianfar: Fix portabilty issues for ethtool and ptp
From: David Miller @ 2014-01-15 22:39 UTC (permalink / raw)
To: claudiu.manoil; +Cc: netdev
In-Reply-To: <1389706500-1990-1-git-send-email-claudiu.manoil@freescale.com>
From: Claudiu Manoil <claudiu.manoil@freescale.com>
Date: Tue, 14 Jan 2014 15:35:00 +0200
> Fixes unhandled register write in gianfar_ethtool.c.
> Fixes following endianess related functional issues,
> reported by sparse as well, i.e.:
>
> gianfar_ethtool.c:1058:33: warning:
> incorrect type in argument 1 (different base types)
> expected unsigned int [unsigned] [usertype] value
> got restricted __be32 [usertype] ip4src
>
> gianfar_ethtool.c:1164:33: warning:
> restricted __be16 degrades to integer
>
> gianfar_ethtool.c:1669:32: warning:
> invalid assignment: ^=
> left side has type restricted __be16
> right side has type int
>
> Solves all the sparse warnings for mixig normal pointers
> with __iomem pointers for gianfar_ptp.c, i.e.:
> gianfar_ptp.c:163:32: warning:
> incorrect type in argument 1 (different address spaces)
> expected unsigned int [noderef] <asn:2>*addr
> got unsigned int *<noident>
>
> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Applied, thank you.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox