Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-next PATCH V2] net: frag queue per hash bucket locking
From: Jesper Dangaard Brouer @ 2013-04-04  7:52 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller
  Cc: Jesper Dangaard Brouer, netdev, Florian Westphal, Daniel Borkmann,
	Hannes Frederic Sowa
In-Reply-To: <1365027060.12728.30.camel@localhost>

This patch implements per hash bucket locking for the frag queue
hash.  This removes two write locks, and the only remaining write
lock is for protecting hash rebuild.  This essentially reduce the
readers-writer lock to a rebuild lock.

V2:
- By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't
  need the spinlock _bh versions, as Netfilter currently does a
  local_bh_disable() before entering inet_fragment.
- Fold-in desc from cover-mail

This patch is part of "net: frag performance followup"
 http://thread.gmane.org/gmane.linux.network/263644
of which two patches have already been accepted:

Same test setup as previous:
 (http://thread.gmane.org/gmane.linux.network/257155)
 Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses
 Ethernet flow-control.  A third interface is used for generating the
 DoS attack (with trafgen).

Notice, I have changed the frag DoS generator script to be more
efficient/deadly.  Before it would only hit one RX queue, now its
sending packets causing multi-queue RX, due to "better" RX hashing.

Test types summary (netperf UDP_STREAM):
 Test-20G64K     == 2x10G with 65K fragments
 Test-20G3F      == 2x10G with 3x fragments (3*1472 bytes)
 Test-20G64K+DoS == Same as 20G64K with frag DoS
 Test-20G3F+DoS  == Same as 20G3F  with frag DoS
 Test-20G64K+MQ  == Same as 20G64K with Multi-Queue frag DoS
 Test-20G3F+MQ   == Same as 20G3F  with Multi-Queue frag DoS

When I rebased this-patch(03) (on top of net-next commit a210576c) and
removed the _bh spinlock, I saw a performance regression.  BUT this
was caused by some unrelated change in-between.  See tests below.

Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de.
Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02.
Test (C) is what I reported before for this-patch

Test (D) is net-next master HEAD (commit a210576c), which reveals some
(unknown) performance regression (compared against test (B)).
Test (D) function as a new base-test.

Performance table summary (in Mbit/s):

(#) Test-type:  20G64K    20G3F    20G64K+DoS  20G3F+DoS  20G64K+MQ 20G3F+MQ
    ----------  -------   -------  ----------  ---------  --------  -------
(A) Patch-02  : 18848.7   13230.1   4103.04     5310.36     130.0    440.2
(B) 1b5ab0de  : 18841.5   13156.8   4101.08     5314.57     129.0    424.2
(C) Patch-03v1: 18838.0   13490.5   4405.11     6814.72     196.6    461.6

(D) a210576c  : 18321.5   11250.4   3635.34     5160.13     119.1    405.2
(E) with _bh  : 17247.3   11492.6   3994.74     6405.29     166.7    413.6
(F) without bh: 17471.3   11298.7   3818.05     6102.11     165.7    406.3

Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks.

I cannot explain the slow down for 20G64K (but its an artificial
"lab-test" so I'm not worried).  But the other results does show
improvements.  And test (E) "with _bh" version is slightly better.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h  |    9 ++++++-
 net/ipv4/inet_fragment.c |   60 ++++++++++++++++++++++++++++++++++++----------
 2 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 7cac9c5..c4f5183 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -50,10 +50,17 @@ struct inet_frag_queue {
  */
 #define INETFRAGS_MAXDEPTH		128
 
+struct inet_frag_bucket {
+	struct hlist_head	chain;
+	spinlock_t		chain_lock;
+	u16			chain_len;
+};
+
 struct inet_frags {
-	struct hlist_head	hash[INETFRAGS_HASHSZ];
+	struct inet_frag_bucket	hash[INETFRAGS_HASHSZ];
 	/* This rwlock is a global lock (seperate per IPv4, IPv6 and
 	 * netfilter). Important to keep this on a seperate cacheline.
+	 * Its primarily a rebuild protection rwlock.
 	 */
 	rwlock_t		lock ____cacheline_aligned_in_smp;
 	int			secret_interval;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 1206ca6..2bf15e8 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -52,20 +52,27 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
 	unsigned long now = jiffies;
 	int i;
 
+	/* Per bucket lock NOT needed here, due to write lock protection */
 	write_lock(&f->lock);
+
 	get_random_bytes(&f->rnd, sizeof(u32));
 	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+		struct inet_frag_bucket *hb;
 		struct inet_frag_queue *q;
 		struct hlist_node *n;
 
-		hlist_for_each_entry_safe(q, n, &f->hash[i], list) {
+		hb = &f->hash[i];
+		hlist_for_each_entry_safe(q, n, &hb->chain, list) {
 			unsigned int hval = f->hashfn(q);
 
 			if (hval != i) {
+				struct inet_frag_bucket *hb_dest;
+
 				hlist_del(&q->list);
 
 				/* Relink to new hash chain. */
-				hlist_add_head(&q->list, &f->hash[hval]);
+				hb_dest = &f->hash[hval];
+				hlist_add_head(&q->list, &hb_dest->chain);
 			}
 		}
 	}
@@ -78,9 +85,13 @@ void inet_frags_init(struct inet_frags *f)
 {
 	int i;
 
-	for (i = 0; i < INETFRAGS_HASHSZ; i++)
-		INIT_HLIST_HEAD(&f->hash[i]);
+	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+		struct inet_frag_bucket *hb = &f->hash[i];
 
+		spin_lock_init(&hb->chain_lock);
+		INIT_HLIST_HEAD(&hb->chain);
+		hb->chain_len = 0;
+	}
 	rwlock_init(&f->lock);
 
 	f->rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^
@@ -122,9 +133,19 @@ EXPORT_SYMBOL(inet_frags_exit_net);
 
 static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
 {
-	write_lock(&f->lock);
+	struct inet_frag_bucket *hb;
+	unsigned int hash;
+
+	read_lock(&f->lock);
+	hash = f->hashfn(fq);
+	hb = &f->hash[hash];
+
+	spin_lock(&hb->chain_lock);
 	hlist_del(&fq->list);
-	write_unlock(&f->lock);
+	hb->chain_len--;
+	spin_unlock(&hb->chain_lock);
+
+	read_unlock(&f->lock);
 	inet_frag_lru_del(fq);
 }
 
@@ -226,27 +247,32 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		struct inet_frag_queue *qp_in, struct inet_frags *f,
 		void *arg)
 {
+	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *qp;
 #ifdef CONFIG_SMP
 #endif
 	unsigned int hash;
 
-	write_lock(&f->lock);
+	read_lock(&f->lock); /* Protects against hash rebuild */
 	/*
 	 * While we stayed w/o the lock other CPU could update
 	 * the rnd seed, so we need to re-calculate the hash
 	 * chain. Fortunatelly the qp_in can be used to get one.
 	 */
 	hash = f->hashfn(qp_in);
+	hb = &f->hash[hash];
+	spin_lock(&hb->chain_lock);
+
 #ifdef CONFIG_SMP
 	/* With SMP race we have to recheck hash table, because
 	 * such entry could be created on other cpu, while we
-	 * promoted read lock to write lock.
+	 * released the hash bucket lock.
 	 */
-	hlist_for_each_entry(qp, &f->hash[hash], list) {
+	hlist_for_each_entry(qp, &hb->chain, list) {
 		if (qp->net == nf && f->match(qp, arg)) {
 			atomic_inc(&qp->refcnt);
-			write_unlock(&f->lock);
+			spin_unlock(&hb->chain_lock);
+			read_unlock(&f->lock);
 			qp_in->last_in |= INET_FRAG_COMPLETE;
 			inet_frag_put(qp_in, f);
 			return qp;
@@ -258,8 +284,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		atomic_inc(&qp->refcnt);
 
 	atomic_inc(&qp->refcnt);
-	hlist_add_head(&qp->list, &f->hash[hash]);
-	write_unlock(&f->lock);
+	hlist_add_head(&qp->list, &hb->chain);
+	hb->chain_len++;
+	spin_unlock(&hb->chain_lock);
+	read_unlock(&f->lock);
 	inet_frag_lru_add(nf, qp);
 	return qp;
 }
@@ -300,17 +328,23 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash)
 	__releases(&f->lock)
 {
+	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *q;
 	int depth = 0;
 
-	hlist_for_each_entry(q, &f->hash[hash], list) {
+	hb = &f->hash[hash];
+
+	spin_lock(&hb->chain_lock);
+	hlist_for_each_entry(q, &hb->chain, list) {
 		if (q->net == nf && f->match(q, key)) {
 			atomic_inc(&q->refcnt);
+			spin_unlock(&hb->chain_lock);
 			read_unlock(&f->lock);
 			return q;
 		}
 		depth++;
 	}
+	spin_unlock(&hb->chain_lock);
 	read_unlock(&f->lock);
 
 	if (depth <= INETFRAGS_MAXDEPTH)

^ permalink raw reply related

* Re: [PATCH 2/2] af_unix: If we don't care about credentials coallesce all messages
From: dingtianhong @ 2013-04-04  7:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David S. Miller, Sven Joachim, Greg Kroah-Hartman, linux-kernel,
	stable, Eric Dumazet, Andy Lutomirski, Karel Srot, netdev,
	Eric Dumazet
In-Reply-To: <87d2ubhwiw.fsf_-_@xmission.com>

On 2013/4/4 10:14, Eric W. Biederman wrote:
> 
> It was reported that the following LSB test case failed
> https://lsbbugs.linuxfoundation.org/attachment.cgi?id=2144 because we
> were not coallescing unix stream messages when the application was
> expecting us to.
> 
> The problem was that the first send was before the socket was accepted
> and thus sock->sk_socket was NULL in maybe_add_creds, and the second
> send after the socket was accepted had a non-NULL value for sk->socket
> and thus we could tell the credentials were not needed so we did not
> bother.
> 
> The unnecessary credentials on the first message cause
> unix_stream_recvmsg to start verifying that all messages had the same
> credentials before coallescing and then the coallescing failed because
> the second message had no credentials.
> 
> Ignoring credentials when we don't care in unix_stream_recvmsg fixes a
> long standing pessimization which would fail to coallesce messages when
> reading from a unix stream socket if the senders were different even if
> we did not care about their credentials.
> 
> I have tested this and verified that the in the LSB test case mentioned
> above that the messages do coallesce now, while the were failing to
> coallesce without this change.
> 
> Reported-by: Karel Srot <ksrot@redhat.com>
> Reported-by: Ding Tianhong <dingtianhong@huawei.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  net/unix/af_unix.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index f153a8d..2db702d 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1993,7 +1993,7 @@ again:
>  			if ((UNIXCB(skb).pid  != siocb->scm->pid) ||
>  			    (UNIXCB(skb).cred != siocb->scm->cred))
>  				break;
> -		} else {
> +		} else if (test_bit(SOCK_PASSCRED, &sock->flags)) {
>  			/* Copy credentials */
>  			scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
>  			check_creds = 1;
> 

As your opinion, I think the way is better:

		if (test_bit(SOCK_PASSCRED, &sock->flags)) {
                        if (check_creds) {
                                /* Never glue messages from different writers */
                                if ((UNIXCB(skb).pid  != siocb->scm->pid) ||
                                (UNIXCB(skb).cred != siocb->scm->cred))
                                        break;
                        } else {
                                /* Copy credentials */
                                scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
                                check_creds = 1;
                        }
                }

Ding

^ permalink raw reply

* Re: [Suggestion] ISDN: isdnloop: C grammar issue, '}' miss match 'if' and 'switch' statement.
From: Chen Gang @ 2013-04-04  8:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Michal Kubecek, Wu Fengguang, Karsten Keil,
	David Miller, netdev, Joe Perches
In-Reply-To: <CA+55aFw+Efs+um15_V6t-9rXd-U1VJS9p+_6E-m5jr87h-BPVg@mail.gmail.com>


  firstly, thank you very much for your details reply and your patch.

On 2013年04月03日 23:30, Linus Torvalds wrote:
> On Wed, Apr 3, 2013 at 7:31 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> And what is the problem exactly ?
> 
> The indentation does look completely broken.
> 
> It should still *work*, because case-statements don't actually care
> about nesting (you can use a case statement to jump into other control
> statements, the traditional example is the so-called "duff's device"),
> but I agree with Chen Gang that it looks wrong.
> 

  ok, thanks.


> I'm attaching a patch that would appear to fix the nesting, but I
> haven't actually tested it. Also, regardless of that patch, the code
> looks like complete and utter crap, because it sets the "i" variable
> in many of the case statements, and then doesn't actually *use* it.
> Finally, almost all of the case statements test for something like
> 
>     if (c->arg < ISDNLOOP_BCH) {
> 
> but if "c->arg" is out of range, it will then just break out of the
> switch statement and return 0, even though it looks like it should be
> an error.
> 

  really, it is.

> Of course, nobody sane actually cares about ISDN any more, so I think
> this is all pretty academic. I think even Germany (where ISDN *used*
> to be very common due to telephone monopolies and odd rules) no longer
> uses it. I can't imagine that anybody else does either.
> 

  can we delete it ?

  it will not provide contributes any more, but can waste other members'
time resources.

> But if somebody does care, and can validate my patch (if not by
> actually using it, then by at least looking at it more), feel free to
> take it and take my sign-off.
> 
>      Linus
> 

  for me, I suggest:
    if we can not delete it, we'd better to apply Linus' patch.

-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [PATCH v5 1/6] drivers: phy: add generic PHY framework
From: Kishon Vijay Abraham I @ 2013-04-04  8:56 UTC (permalink / raw)
  To: balbi-l0cyMroinI0
  Cc: linux-usb-u79uwXL29TY76Z2rM5mHXA, mchehab-H+wXaHxf7aLQT0dZR+AlfA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-lFZ/pmaqli7XmaaqVzeoHQ,
	javier-0uQlZySMnqxg9hUCZPvPmw, cesarb-PWySMVKUnqmsTnJN9+BGXg,
	eballetbo-Re5JQEeQqe8AvxtiuMwx3w,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	rob.herring-bsGFqQB8/DxBDgjK7y7TUQ,
	swarren-DDmLM1+adcrQT0dZR+AlfA,
	sylvester.nawrocki-Re5JQEeQqe8AvxtiuMwx3w,
	linux-omap-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	broonie-yzvPICuk2AATkU/dhu1WVueM+bqZidxxQQ4Iyu8u01E,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	santosh.shilimkar-l0cyMroinI0, netdev-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <20130403154704.GD19093-S8G//mZuvNWo5Im9Ml3/Zg@public.gmane.org>

Hi,

On Wednesday 03 April 2013 09:17 PM, Felipe Balbi wrote:
> Hi,
>
> On Wed, Apr 03, 2013 at 08:02:52PM +0530, Kishon Vijay Abraham I wrote:
>>>>>> +		ret = -EINVAL;
>>>>>> +		goto err0;
>>>>>> +	}
>>>>>> +
>>>>>> +	if (!phy_class)
>>>>>> +		phy_core_init();
>>>>>
>>>>> why don't you setup the class on module_init ? Then this would be a
>>>>> terrible error condition here :-)
>>>>
>>>> This is for the case where the PHY driver gets loaded before the PHY
>>>> framework. I could have returned EPROBE_DEFER here instead I thought
>>>> will have it this way.
>>>
>>> looks a bit weird IMO. Is it really possible for PHY to load before ?
>>
>> yeah. it actually happened when I tried with beagle and had all the
>> modules as built-in. Because twl4030 has subsys_initcall(), it loads
>> before PHY framework.
>
> that's a bug in twl4030.

right. Will fix it.

Thanks
Kishon

^ permalink raw reply

* Re: [net-next PATCH V2] net: frag queue per hash bucket locking
From: Hannes Frederic Sowa @ 2013-04-04  9:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S. Miller, netdev, Florian Westphal,
	Daniel Borkmann
In-Reply-To: <20130404075226.18493.75426.stgit@dragon>

On Thu, Apr 04, 2013 at 09:52:26AM +0200, Jesper Dangaard Brouer wrote:
> +struct inet_frag_bucket {
> +	struct hlist_head	chain;
> +	spinlock_t		chain_lock;
> +	u16			chain_len;
> +};
> +

I just noticed and wanted to ask for what chain_len is needed?  Could it
be dropped?

If the elements are swapped between the hash buckets in
inet_frag_secret_rebuild it seems you forgot to update chain_len
correctly.

Thanks,

  Hannes

^ permalink raw reply

* Re: [Suggestion] ISDN: isdnloop:  C grammar issue,  '}' miss match 'if' and 'switch' statement.
From: Chen Gang @ 2013-04-04  9:05 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: fengguang.wu, isdn, Linus Torvalds, David Miller, netdev,
	Joe Perches
In-Reply-To: <20130403140823.GB3401@unicorn.suse.cz>

On 2013年04月03日 22:08, Michal Kubecek wrote:
> On Wed, Apr 03, 2013 at 09:35:55PM +0800, Chen Gang wrote:
>> > 
>> > in drivers/isdn/isdnloop/isdnloop.c
>> > 
>> >   issue description:
>> >     it is in function 'isdnloop_command'.
>> >     it seems a C grammar issue for '}' miss match 'if' and 'switch' statement
>> >     please check the line 1243, 1265, 1341.
>> > 
>> >   building:
>> >     make allyesconfig, can not let it built.
>> >     in menuconfig, we (at least for me) can not let ISDN_DRV_LOOP = 'y' or 'm'.
>> >     is this module a waste module which should be deleted ?
>> > 
>> >   the related commit:
>> >     commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
>> >     Author: Linus Torvalds <torvalds@ppc970.osdl.org>
>> >     Date:   Sat Apr 16 15:20:36 2005 -0700
> As far as I can see, this rather comes from
> 
> commit 475be4d85a274d0961593db41cf85689db1d583c
> Author: Joe Perches <joe@perches.com>
> Date:   Sun Feb 19 19:52:38 2012 -0800

  Joe Perches only beautified the code, not change the contents.

  please check thanks.


-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [PATCH v5 1/6] drivers: phy: add generic PHY framework
From: Kishon Vijay Abraham I @ 2013-04-04  9:21 UTC (permalink / raw)
  To: Sylwester Nawrocki
  Cc: balbi, gregkh, arnd, akpm, swarren, rob, netdev, davem, cesarb,
	linux-usb, linux-omap, linux-kernel, tony, grant.likely,
	rob.herring, b-cousson, linux, eballetbo, javier, mchehab,
	santosh.shilimkar, broonie, swarren, linux-doc,
	devicetree-discuss, linux-arm-kernel
In-Reply-To: <515CA337.1010504@gmail.com>

Hi,

On Thursday 04 April 2013 03:16 AM, Sylwester Nawrocki wrote:
> On 04/03/2013 02:53 PM, Kishon Vijay Abraham I wrote:
.
.
<snip>
.
.
>> diff --git a/Documentation/phy.txt b/Documentation/phy.txt
>> new file mode 100644
>> index 0000000..7785ec0
>> --- /dev/null
>> +++ b/Documentation/phy.txt
>> @@ -0,0 +1,113 @@
>> +                PHY SUBSYSTEM
>> +          Kishon Vijay Abraham I<kishon@ti.com>
>> +
>> +This document explains the Generic PHY Framework along with the APIs
>> provided,
>> +and how-to-use.
>> +
>> +1. Introduction
>> +
>> +*PHY* is the abbreviation for physical layer. It is used to connect a
>> device
>> +to the physical medium e.g., the USB controller has a PHY to provide
>> functions
>
> Shouldn't it be "...medium, e.g. the USB..." ?
>
>> +such as serialization, de-serialization, encoding, decoding and is
>> responsible
>> +for obtaining the required data transmission rate. Note that some USB
>> +controller has PHY functionality embedded into it and others use an
>> external
>
> "controllers have PHY functionality embedded into them" ?
>
>> +PHY. Other peripherals that uses a PHY include Wireless LAN, Ethernet,
>
> s/uses/use ?
>
>> +SATA etc.
>> +
>> +The intention of creating this framework is to bring the phy drivers
>> spread
>> +all over the Linux kernel to drivers/phy to increase code re-use and to
>> +increase code maintainability.
>> +
>> +This framework will be of use only to devices that uses external PHY
>> (PHY
>
> s/uses/use ?
>
>> +functionality is not embedded within the controller).
>> +
>> +2. Creating the PHY
>> +
>> +The PHY driver should create the PHY in order for other peripheral
>> controllers
>> +to make use of it. The PHY framework provides 2 APIs to create the PHY.
>> +
>> +struct phy *phy_create(struct device *dev, const char *label,
>> +    struct device_node *of_node, int type, struct phy_ops *ops,
>> +    void *priv);
>> +struct phy *devm_phy_create(struct device *dev, const char *label,
>> +    struct device_node *of_node, int type, struct phy_ops *ops,
>> +    void *priv);
>> +
>> +The PHY drivers can use one of the above 2 APIs to create the PHY by
>> passing
>> +the device pointer, label, device node, type, phy ops and a driver data.
>> +phy_ops is a set of function pointers for performing PHY operations
>> such as
>> +init, exit, suspend, resume, power_on and power_off.
>> +
>> +3. Binding the PHY to the controller
>> +
>> +The framework provides an API for binding the controller to the PHY
>> in the
>> +case of non dt boot.
>> +
>> +struct phy_bind *phy_bind(const char *dev_name, int index,
>> +                const char *phy_dev_name);
>> +
>> +The API fills the phy_bind structure with the dev_name (device name
>> of the
>> +controller), index and phy_dev_name (device name of the PHY). This will
>> +be used when the controller requests this phy. This API should be
>> used by
>> +platform specific initialization code (board file).
>> +
>> +In the case of dt boot, the binding information should be added in
>> the dt
>> +data of the controller.
>
> s/in the dt data of/in the device tree node of ?
>
>> +4. Getting a reference to the PHY
>> +
>> +Before the controller can make use of the PHY, it has to get a
>> reference to
>> +it. This framework provides 6 APIs to get a reference to the PHY.
>> +
>> +struct phy *phy_get(struct device *dev, int index);
>> +struct phy *devm_phy_get(struct device *dev, int index);
>> +struct phy *of_phy_get(struct device *dev, const char *phandle, int
>> index);
>> +struct phy *devm_of_phy_get(struct device *dev, const char *phandle,
>> int index);
>
> Why do we need separate functions for dt and non-dt ? Couldn't it be
> handled
> internally by the framework ? So the PHY users can use same calls for dt
> and non-dt, like in case of e.g. the regulators API ?

yeah. Indeed it makes sense to do it that way.
>
> Also signatures of some functions are different now:
>
> extern struct phy *phy_get(struct device *dev, int index);
> extern struct phy *devm_phy_get(struct device *dev, int index);
> extern struct phy *of_phy_get(struct device *dev, int index);
> extern struct phy *devm_of_phy_get(struct device *dev, int index);

My bad :-(

>
> BTW, I think "extern" could be dropped, does it have any significance in
> function declarations in header files ?

it shouldn't have any effect I guess. It's harmless nevertheless.

>
>> +struct phy *of_phy_get_byname(struct device *dev, const char *string);
>> +struct phy *devm_of_phy_get_byname(struct device *dev, const char
>> *string);
>> +
>> +phy_get and devm_phy_get can be used to get the PHY in non-dt boot.
>> This API
>> +uses the binding information added using the phy_bind API to find and
>> return
>> +the appropriate PHY. The only difference between the two APIs is that
>> +devm_phy_get associates the device with the PHY using devres on
>> successful PHY
>> +get. On driver detach, release function is invoked on the the devres
>> data and
>
> s/the the/the
>
>> +devres data is freed.
>> +
>> +of_phy_get and devm_of_phy_get can be used to get the PHY in dt boot.
>> These
>> +APIs take the phandle and index to get a reference to the PHY. The only
>> +difference between the two APIs is that devm_of_phy_get associates
>> the device
>> +with the PHY using devres on successful phy get. On driver detach,
>> release
>> +function is invoked on the devres data and it is freed.
>> +
>> +of_phy_get_byname and devm_of_phy_get_byname can also be used to get
>> the PHY
>> +in dt boot. It is same as the above API except that the user has to
>> pass the
>> +phy name as filled in "phy-names" phandle in dt data and the
>> framework will
>
> s/phandle/property ?
>
>> +find the index and get the PHY.
>> +
>> +5. Releasing a reference to the PHY
>> +
>> +When the controller no longer needs the PHY, it has to release the
>> reference
>> +to the PHY it has obtained using the APIs mentioned in the above
>> section. The
>> +PHY framework provides 2 APIS to release a reference to the PHY.
>
> s/APIS/APIs ?
>
>> +
>> +void phy_put(struct phy *phy);
>> +void devm_phy_put(struct device *dev, struct phy *phy);
>> +
>> +Both these APIs are used to release a reference to the PHY and
>> devm_phy_put
>> +destroys the devres associated with this PHY.
>> +
>> +6. Destroying the PHY
>> +
>> +When the driver that created the PHY is unloaded, it should destroy
>> the PHY it
>> +created using one of the following 2 APIs.
>> +
>> +void phy_destroy(struct phy *phy);
>> +void devm_phy_destroy(struct device *dev, struct phy *phy);
>> +
>> +Both these APIs destroys the PHY and devm_phy_destroy destroys the
>> devres
>
> s/APIs destroys/APIs destroy ?
>
>> +associated with this PHY.
>> +
>> +7. DeviceTree Binding
>> +
>> +The documentation for PHY dt binding can be found @
>> +Documentation/devicetree/bindings/phy/phy-bindings.txt
>
>> --- /dev/null
>> +++ b/drivers/phy/phy-core.c
>> @@ -0,0 +1,616 @@
>
>> +static struct phy *of_phy_lookup(struct device_node *node)
>> +{
>> +    struct phy *phy;
>> +    struct device *dev;
>> +    struct class_dev_iter iter;
>> +
>> +    class_dev_iter_init(&iter, phy_class, NULL, NULL);
>
> There is currently nothing preventing a call to this function before
> phy_class is initialized. It happened during my tests, and the nice
> stack dump showed that it was the PHY user attempting to get the PHY
> before the framework got initialized. So either phy_core_init should
> be a subsys_initcall or we need a better protection against phy_class
> being NULL or ERR_PTR in more places.

Whats the initcall used in your PHY user? Looks like more people prefer 
having module_init and any issue because of it should be fixed in PHY 
providers and PHY users.
>
>> +    while ((dev = class_dev_iter_next(&iter))) {
>> +        phy = container_of(dev, struct phy, dev);
>> +        if (node != phy->of_node)
>> +            continue;
>> +
>> +        class_dev_iter_exit(&iter);
>> +        return phy;
>> +    }
>> +
>> +    class_dev_iter_exit(&iter);
>> +    return ERR_PTR(-EPROBE_DEFER);
>> +}
>
>> +/**
>> + * of_phy_get() - lookup and obtain a reference to a phy by phandle
>> + * @dev: device that requests this phy
>> + * @index: the index of the phy
>> + *
>> + * Returns the phy associated with the given phandle value,
>> + * after getting a refcount to it or -ENODEV if there is no such phy or
>> + * -EPROBE_DEFER if there is a phandle to the phy, but the device is
>> + * not yet loaded.
>> + */
>> +struct phy *of_phy_get(struct device *dev, int index)
>> +{
>> +    int ret;
>> +    struct phy *phy = NULL;
>> +    struct phy_bind *phy_map = NULL;
>> +    struct of_phandle_args args;
>> +    struct device_node *node;
>> +
>> +    if (!dev->of_node) {
>> +        dev_dbg(dev, "device does not have a device node entry\n");
>> +        return ERR_PTR(-EINVAL);
>> +    }
>> +
>> +    ret = of_parse_phandle_with_args(dev->of_node, "phys", "#phy-cells",
>> +        index,&args);
>> +    if (ret) {
>> +        dev_dbg(dev, "failed to get phy in %s node\n",
>> +            dev->of_node->full_name);
>> +        return ERR_PTR(-ENODEV);
>> +    }
>> +
>> +    phy = of_phy_lookup(args.np);
>> +    if (IS_ERR(phy) || !try_module_get(phy->ops->owner)) {
>> +        phy = ERR_PTR(-EPROBE_DEFER);
>> +        goto err0;
>> +    }
>> +
>> +    phy = phy->ops->of_xlate(phy,&args);
>> +    if (IS_ERR(phy))
>> +        goto err1;
>> +
>> +    phy_map = phy_bind(dev_name(dev), index, dev_name(&phy->dev));
>> +    if (!IS_ERR(phy_map)) {
>> +        phy_map->phy = phy;
>> +        phy_map->auto_bind = true;
>
> Shouldn't it be done with the phy_bind_mutex held ? I guess an unlocked
> version of the phy_bind functions is needed, so it can be used internally.

The locking is done inside phy_bind function. I'm not sure but I vaguely 
remember getting a dump stack (incorrect mutex ordering or something) 
when trying to have phy_bind_mutex here. I can check it again.
>
>> +    }
>> +
>> +    get_device(&phy->dev);
>> +
>> +err1:
>> +    module_put(phy->ops->owner);
>> +
>> +err0:
>> +    of_node_put(node);
>> +
>> +    return phy;
>> +}
>> +EXPORT_SYMBOL_GPL(of_phy_get);
>
>> +/**
>> + * phy_create() - create a new phy
>> + * @dev: device that is creating the new phy
>> + * @label: label given to phy
>> + * @of_node: device node of the phy
>> + * @type: specifies the phy type
>> + * @ops: function pointers for performing phy operations
>> + * @priv: private data from phy driver
>> + *
>> + * Called to create a phy using phy framework.
>> + */
>> +struct phy *phy_create(struct device *dev, const char *label,
>> +    struct device_node *of_node, int type, struct phy_ops *ops,
>> +    void *priv)
>> +{
>> +    int ret;
>> +    struct phy *phy;
>> +    struct phy_bind *phy_bind;
>> +    const char *devname = NULL;
>> +
>> +    if (!dev) {
>> +        dev_err(dev, "no device provided for PHY\n");
>> +        ret = -EINVAL;
>> +        goto err0;
>> +    }
>> +
>> +    if (!ops || !ops->of_xlate || !priv) {
>> +        dev_err(dev, "no PHY ops/PHY data provided\n");
>> +        ret = -EINVAL;
>> +        goto err0;
>> +    }
>> +
>> +    if (!phy_class)
>> +        phy_core_init();
>> +
>> +    phy = kzalloc(sizeof(*phy), GFP_KERNEL);
>> +    if (!phy) {
>> +        ret = -ENOMEM;
>> +        goto err0;
>> +    }
>> +
>> +    devname = dev_name(dev);
>
> OK, a sort of serious issue here is that you can't call phy_create()
> multiple times for same PHY provider device.

Ah.. right.
>
> device_add() further will fail as sysfs won't let you create multiple
> objects with same name. So I would assume we need to add a new argument
> to phy_create() or rename @type to e.g. @phy_id, which could be
> concatenated with dev_name(dev) to create a unique name of the phy
> device.

hmm.. ok
>
> And how is the PHY provider supposed to identify a PHY in its common PHY
> ops, now when the struct phy port field is removed ? I have temporarily
> (ab)used the type field in order to select proper registers within the
> phy ops.

Can't it be handled in the PHY provider driver without using struct phy 
*? Moreover the PHY ops passes on the *phy to phy provider right? Not 
sure I understand you here.
>
>> +    device_initialize(&phy->dev);
>> +
>> +    phy->dev.class = phy_class;
>> +    phy->dev.parent = dev;
>> +    phy->label = label;
>
> What about duplicating the string here ? That would make the API a bit
> more convenient and the callers wouldn't be required to keep all the
> labels around.

you mean use dev_name? The device names are sometime more cryptic so 
preferred to have it like this.
>
>> +    phy->of_node = of_node;
>> +    phy->type = type;
>> +    phy->ops = ops;
>> +
>> +    dev_set_drvdata(&phy->dev, priv);
>> +
>> +    ret = dev_set_name(&phy->dev, "%s", devname);
>> +    if (ret)
>> +        goto err1;
>> +
>> +    mutex_lock(&phy_bind_mutex);
>> +    list_for_each_entry(phy_bind,&phy_bind_list, list)
>> +        if (!(strcmp(phy_bind->phy_dev_name, devname)))
>> +            phy_bind->phy = phy;
>> +    mutex_unlock(&phy_bind_mutex);
>> +
>> +    ret = device_add(&phy->dev);
>> +    if (ret)
>> +        goto err2;
>> +
>> +    return phy;
>> +
>> +err2:
>> +    phy_bind->phy = NULL;
>> +
>> +err1:
>> +    put_device(&phy->dev);
>> +    kfree(phy);
>> +
>> +err0:
>> +    return ERR_PTR(ret);
>> +}
>> +EXPORT_SYMBOL_GPL(phy_create);
>
>> +/**
>> + * devm_phy_create() - create a new phy
>> + * @dev: device that is creating the new phy
>> + * @dev: device that is creating the new phy
>
> Duplicated line.
>
>> + * @label: label given to phy
>> + * @of_node: device node of the phy
>> + * @type: specifies the phy type
>> + * @ops: function pointers for performing phy operations
>> + * @priv: private data from phy driver
>> + *
>> + * Creates a new PHY device adding it to the PHY class.
>> + * While at that, it also associates the device with the phy using
>> devres.
>> + * On driver detach, release function is invoked on the devres data,
>> + * then, devres data is freed.
>> + */
>
>> +/**
>> + * phy_bind() - bind the phy and the controller that uses the phy
>> + * @dev_name: the device name of the device that will bind to the phy
>> + * @index: index to specify the port number
>> + * @phy_dev_name: the device name of the phy
>> + *
>> + * Fills the phy_bind structure with the dev_name and phy_dev_name.
>> This will
>> + * be used when the phy driver registers the phy and when the controller
>> + * requests this phy.
>> + *
>> + * To be used by platform specific initialization code.
>> + */
>> +struct phy_bind *phy_bind(const char *dev_name, int index,
>> +                const char *phy_dev_name)
>> +{
>> +    struct phy_bind *phy_bind;
>> +
>> +    mutex_lock(&phy_bind_mutex);
>> +    list_for_each_entry(phy_bind,&phy_bind_list, list) {
>> +        if (!strcmp(phy_bind->dev_name, dev_name)&&  phy_bind->index ==
>> +            index) {
>> +            phy_bind->phy_dev_name = phy_dev_name;
>> +            goto ret0;
>> +        }
>> +    }
>> +
>> +    phy_bind = kzalloc(sizeof(*phy_bind), GFP_KERNEL);
>> +    if (!phy_bind) {
>> +        phy_bind = ERR_PTR(-ENOMEM);
>> +        goto ret0;
>> +    }
>> +
>> +    phy_bind->dev_name = dev_name;
>> +    phy_bind->phy_dev_name = phy_dev_name;
>> +    phy_bind->index = index;
>> +    phy_bind->auto_bind = false;
>> +
>> +    list_add_tail(&phy_bind->list,&phy_bind_list);
>> +
>> +ret0:
>> +    mutex_unlock(&phy_bind_mutex);
>> +    return phy_bind;
>> +}
>> +EXPORT_SYMBOL_GPL(phy_bind);
>
>> +static ssize_t phy_bind_show(struct device *dev,
>> +                 struct device_attribute *attr, char *buf)
>> +{
>> +    struct phy_bind *phy_bind;
>> +    struct phy *phy;
>> +    char *p = buf;
>> +    int len;
>> +
>> +    phy = container_of(dev, struct phy, dev);
>> +
>
> Shouldn't this iteration also be protected with the phy_bind_mutex ?

Indeed.
>
>> +    list_for_each_entry(phy_bind,&phy_bind_list, list)
>> +        if (phy_bind->phy == phy)
>> +            p += sprintf(p, "%s %d %s\n", phy_bind->dev_name,
>> +                phy_bind->index, phy_bind->phy_dev_name);
>> +    len = (p - buf);
>> +
>> +    return len;
>> +}
>
>> +static int phy_core_init(void)
>> +{
>> +    if (phy_class)
>> +        return 0;
>> +
>> +    phy_class = class_create(THIS_MODULE, "phy");
>> +    if (IS_ERR(phy_class)) {
>> +        pr_err("failed to create phy class -->  %ld\n",
>> +            PTR_ERR(phy_class));
>> +        return PTR_ERR(phy_class);
>> +    }
>> +
>> +    phy_class->dev_release = phy_release;
>> +    phy_class->dev_attrs = phy_dev_attrs;
>> +
>> +    return 0;
>> +}
>> +module_init(phy_core_init);
>
> Having this framework initialized before the PHY provider and consumer
> drivers could save some unnecessary probe deferrals. I was wondering,
> what is really an advantage of having it as module_init(), rather than
> subsys_initcall() ?

I'm not sure of the exact reason but after the advent of EPROBE_DEFER, 
everyone recommends to use module_init only.

Thanks for the detailed review.

Regards
Kishon

^ permalink raw reply

* Re: [net-next PATCH V2] net: frag queue per hash bucket locking
From: Jesper Dangaard Brouer @ 2013-04-04  9:27 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric Dumazet, David S. Miller, netdev, Florian Westphal,
	Daniel Borkmann
In-Reply-To: <20130404090349.GE20292@order.stressinduktion.org>

On Thu, 2013-04-04 at 11:03 +0200, Hannes Frederic Sowa wrote:
> On Thu, Apr 04, 2013 at 09:52:26AM +0200, Jesper Dangaard Brouer wrote:
> > +struct inet_frag_bucket {
> > +	struct hlist_head	chain;
> > +	spinlock_t		chain_lock;
> > +	u16			chain_len;
> > +};
> > +
> 
> I just noticed and wanted to ask for what chain_len is needed?  Could it
> be dropped?

It could be dropped from this patch.  Its part of my future hash cleanup
strategy.
I also wanted to use it to replace the nqueues counter, but its would
not be correct, because nqueues counter is maintained per netns (network
namespace).

Its currently the netns separation, which is causing "headaches" for my
removal of LRU and direct-hash-cleaning solution...

> If the elements are swapped between the hash buckets in
> inet_frag_secret_rebuild it seems you forgot to update chain_len
> correctly.

Ah, good catch.
Given its not even correct, I'll remove the chain_len and repost a V3.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [net-next PATCH V3] net: frag queue per hash bucket locking
From: Jesper Dangaard Brouer @ 2013-04-04  9:38 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller
  Cc: Jesper Dangaard Brouer, netdev, Florian Westphal, Daniel Borkmann,
	Hannes Frederic Sowa
In-Reply-To: <1365067672.12728.46.camel@localhost>

This patch implements per hash bucket locking for the frag queue
hash.  This removes two write locks, and the only remaining write
lock is for protecting hash rebuild.  This essentially reduce the
readers-writer lock to a rebuild lock.

This patch is part of "net: frag performance followup"
 http://thread.gmane.org/gmane.linux.network/263644
of which two patches have already been accepted:

Same test setup as previous:
 (http://thread.gmane.org/gmane.linux.network/257155)
 Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses
 Ethernet flow-control.  A third interface is used for generating the
 DoS attack (with trafgen).

Notice, I have changed the frag DoS generator script to be more
efficient/deadly.  Before it would only hit one RX queue, now its
sending packets causing multi-queue RX, due to "better" RX hashing.

Test types summary (netperf UDP_STREAM):
 Test-20G64K     == 2x10G with 65K fragments
 Test-20G3F      == 2x10G with 3x fragments (3*1472 bytes)
 Test-20G64K+DoS == Same as 20G64K with frag DoS
 Test-20G3F+DoS  == Same as 20G3F  with frag DoS
 Test-20G64K+MQ  == Same as 20G64K with Multi-Queue frag DoS
 Test-20G3F+MQ   == Same as 20G3F  with Multi-Queue frag DoS

When I rebased this-patch(03) (on top of net-next commit a210576c) and
removed the _bh spinlock, I saw a performance regression.  BUT this
was caused by some unrelated change in-between.  See tests below.

Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de.
Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02.
Test (C) is what I reported before for this-patch

Test (D) is net-next master HEAD (commit a210576c), which reveals some
(unknown) performance regression (compared against test (B)).
Test (D) function as a new base-test.

Performance table summary (in Mbit/s):

(#) Test-type:  20G64K    20G3F    20G64K+DoS  20G3F+DoS  20G64K+MQ 20G3F+MQ
    ----------  -------   -------  ----------  ---------  --------  -------
(A) Patch-02  : 18848.7   13230.1   4103.04     5310.36     130.0    440.2
(B) 1b5ab0de  : 18841.5   13156.8   4101.08     5314.57     129.0    424.2
(C) Patch-03v1: 18838.0   13490.5   4405.11     6814.72     196.6    461.6

(D) a210576c  : 18321.5   11250.4   3635.34     5160.13     119.1    405.2
(E) with _bh  : 17247.3   11492.6   3994.74     6405.29     166.7    413.6
(F) without bh: 17471.3   11298.7   3818.05     6102.11     165.7    406.3

Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks.

I cannot explain the slow down for 20G64K (but its an artificial
"lab-test" so I'm not worried).  But the other results does show
improvements.  And test (E) "with _bh" version is slightly better.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

----
V2:
- By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't
  need the spinlock _bh versions, as Netfilter currently does a
  local_bh_disable() before entering inet_fragment.
- Fold-in desc from cover-mail
V3:
- Drop the chain_len counter per hash bucket.
---

 include/net/inet_frag.h  |    8 ++++++
 net/ipv4/inet_fragment.c |   57 ++++++++++++++++++++++++++++++++++++----------
 2 files changed, 51 insertions(+), 14 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 7cac9c5..6f41b45 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -50,10 +50,16 @@ struct inet_frag_queue {
  */
 #define INETFRAGS_MAXDEPTH		128
 
+struct inet_frag_bucket {
+	struct hlist_head	chain;
+	spinlock_t		chain_lock;
+};
+
 struct inet_frags {
-	struct hlist_head	hash[INETFRAGS_HASHSZ];
+	struct inet_frag_bucket	hash[INETFRAGS_HASHSZ];
 	/* This rwlock is a global lock (seperate per IPv4, IPv6 and
 	 * netfilter). Important to keep this on a seperate cacheline.
+	 * Its primarily a rebuild protection rwlock.
 	 */
 	rwlock_t		lock ____cacheline_aligned_in_smp;
 	int			secret_interval;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 1206ca6..e97d66a 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -52,20 +52,27 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
 	unsigned long now = jiffies;
 	int i;
 
+	/* Per bucket lock NOT needed here, due to write lock protection */
 	write_lock(&f->lock);
+
 	get_random_bytes(&f->rnd, sizeof(u32));
 	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+		struct inet_frag_bucket *hb;
 		struct inet_frag_queue *q;
 		struct hlist_node *n;
 
-		hlist_for_each_entry_safe(q, n, &f->hash[i], list) {
+		hb = &f->hash[i];
+		hlist_for_each_entry_safe(q, n, &hb->chain, list) {
 			unsigned int hval = f->hashfn(q);
 
 			if (hval != i) {
+				struct inet_frag_bucket *hb_dest;
+
 				hlist_del(&q->list);
 
 				/* Relink to new hash chain. */
-				hlist_add_head(&q->list, &f->hash[hval]);
+				hb_dest = &f->hash[hval];
+				hlist_add_head(&q->list, &hb_dest->chain);
 			}
 		}
 	}
@@ -78,9 +85,12 @@ void inet_frags_init(struct inet_frags *f)
 {
 	int i;
 
-	for (i = 0; i < INETFRAGS_HASHSZ; i++)
-		INIT_HLIST_HEAD(&f->hash[i]);
+	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+		struct inet_frag_bucket *hb = &f->hash[i];
 
+		spin_lock_init(&hb->chain_lock);
+		INIT_HLIST_HEAD(&hb->chain);
+	}
 	rwlock_init(&f->lock);
 
 	f->rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^
@@ -122,9 +132,18 @@ EXPORT_SYMBOL(inet_frags_exit_net);
 
 static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
 {
-	write_lock(&f->lock);
+	struct inet_frag_bucket *hb;
+	unsigned int hash;
+
+	read_lock(&f->lock);
+	hash = f->hashfn(fq);
+	hb = &f->hash[hash];
+
+	spin_lock(&hb->chain_lock);
 	hlist_del(&fq->list);
-	write_unlock(&f->lock);
+	spin_unlock(&hb->chain_lock);
+
+	read_unlock(&f->lock);
 	inet_frag_lru_del(fq);
 }
 
@@ -226,27 +245,32 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		struct inet_frag_queue *qp_in, struct inet_frags *f,
 		void *arg)
 {
+	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *qp;
 #ifdef CONFIG_SMP
 #endif
 	unsigned int hash;
 
-	write_lock(&f->lock);
+	read_lock(&f->lock); /* Protects against hash rebuild */
 	/*
 	 * While we stayed w/o the lock other CPU could update
 	 * the rnd seed, so we need to re-calculate the hash
 	 * chain. Fortunatelly the qp_in can be used to get one.
 	 */
 	hash = f->hashfn(qp_in);
+	hb = &f->hash[hash];
+	spin_lock(&hb->chain_lock);
+
 #ifdef CONFIG_SMP
 	/* With SMP race we have to recheck hash table, because
 	 * such entry could be created on other cpu, while we
-	 * promoted read lock to write lock.
+	 * released the hash bucket lock.
 	 */
-	hlist_for_each_entry(qp, &f->hash[hash], list) {
+	hlist_for_each_entry(qp, &hb->chain, list) {
 		if (qp->net == nf && f->match(qp, arg)) {
 			atomic_inc(&qp->refcnt);
-			write_unlock(&f->lock);
+			spin_unlock(&hb->chain_lock);
+			read_unlock(&f->lock);
 			qp_in->last_in |= INET_FRAG_COMPLETE;
 			inet_frag_put(qp_in, f);
 			return qp;
@@ -258,8 +282,9 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		atomic_inc(&qp->refcnt);
 
 	atomic_inc(&qp->refcnt);
-	hlist_add_head(&qp->list, &f->hash[hash]);
-	write_unlock(&f->lock);
+	hlist_add_head(&qp->list, &hb->chain);
+	spin_unlock(&hb->chain_lock);
+	read_unlock(&f->lock);
 	inet_frag_lru_add(nf, qp);
 	return qp;
 }
@@ -300,17 +325,23 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash)
 	__releases(&f->lock)
 {
+	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *q;
 	int depth = 0;
 
-	hlist_for_each_entry(q, &f->hash[hash], list) {
+	hb = &f->hash[hash];
+
+	spin_lock(&hb->chain_lock);
+	hlist_for_each_entry(q, &hb->chain, list) {
 		if (q->net == nf && f->match(q, key)) {
 			atomic_inc(&q->refcnt);
+			spin_unlock(&hb->chain_lock);
 			read_unlock(&f->lock);
 			return q;
 		}
 		depth++;
 	}
+	spin_unlock(&hb->chain_lock);
 	read_unlock(&f->lock);
 
 	if (depth <= INETFRAGS_MAXDEPTH)

^ permalink raw reply related

* Re: [net-next PATCH V3] net: frag queue per hash bucket locking
From: Hannes Frederic Sowa @ 2013-04-04  9:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S. Miller, netdev, Florian Westphal,
	Daniel Borkmann
In-Reply-To: <20130404093816.3613.42475.stgit@dragon>

On Thu, Apr 04, 2013 at 11:38:16AM +0200, Jesper Dangaard Brouer wrote:
 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> ----
> V2:
> - By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't
>   need the spinlock _bh versions, as Netfilter currently does a
>   local_bh_disable() before entering inet_fragment.
> - Fold-in desc from cover-mail
> V3:
> - Drop the chain_len counter per hash bucket.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

^ permalink raw reply

* [Patch net-next] tcp: add a global sysctl to control TCP delayed ack
From: Cong Wang @ 2013-04-04 10:16 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Rick Jones, Stephen Hemminger, David S. Miller,
	Thomas Graf, David Laight, Cong Wang

From: Cong Wang <amwang@redhat.com>

Change from RFC:
* make the sysctl per netns

According to previous discussion, it seems there is no
reasonable heuristics.

Similar to TCP_QUICK_ACK option, but for people who can't
modify the source code and still wants to control
TCP delayed ACK behavior.

David, do you still have any objection?

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Cong Wang <amwang@redhat.com>

---
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index f98ca63..9b39681 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -572,6 +572,11 @@ tcp_challenge_ack_limit - INTEGER
 	in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
 	Default: 100
 
+tcp_quick_ack - BOOLEAN
+	Globally enables or disables TCP delayed ACK. The applications
+	can still change the quick ACK mode by TCP_QUICK_ACK option.
+	Default: off
+
 UDP variables:
 
 udp_mem - vector of 3 INTEGERs: min, pressure, max
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 2ba9de8..f03298a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -63,6 +63,7 @@ struct netns_ipv4 {
 	int sysctl_icmp_errors_use_inbound_ifaddr;
 
 	int sysctl_tcp_ecn;
+	int sysctl_tcp_quick_ack;
 
 	kgid_t sysctl_ping_group_range[2];
 	long sysctl_tcp_mem[3];
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index fa2f63f..1db1780 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -837,6 +837,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= ipv4_tcp_mem,
 	},
+	{
+		.procname	= "tcp_quick_ack",
+		.data		= &init_net.ipv4.sysctl_tcp_quick_ack,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
@@ -866,6 +873,8 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 			&net->ipv4.sysctl_ping_group_range;
 		table[7].data =
 			&net->ipv4.sysctl_tcp_ecn;
+		table[9].data =
+			&net->ipv4.sysctl_tcp_quick_ack;
 
 		/* Don't export sysctls to unprivileged users */
 		if (net->user_ns != &init_user_ns)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6d9ca35..a1b44f3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3781,7 +3781,8 @@ static void tcp_fin(struct sock *sk)
 	case TCP_ESTABLISHED:
 		/* Move to CLOSE_WAIT */
 		tcp_set_state(sk, TCP_CLOSE_WAIT);
-		inet_csk(sk)->icsk_ack.pingpong = 1;
+		if (!sock_net(sk)->ipv4.sysctl_tcp_quick_ack)
+			inet_csk(sk)->icsk_ack.pingpong = 1;
 		break;
 
 	case TCP_CLOSE_WAIT:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index af354c98..130fc99 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -169,8 +169,9 @@ static void tcp_event_data_sent(struct tcp_sock *tp,
 	/* If it is a reply for ato after last received
 	 * packet, enter pingpong mode.
 	 */
-	if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
-		icsk->icsk_ack.pingpong = 1;
+	if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato &&
+	    !sock_net(sk)->ipv4.sysctl_tcp_quick_ack)
+			icsk->icsk_ack.pingpong = 1;
 }
 
 /* Account for an ACK we sent. */

^ permalink raw reply related

* Re: [PATCH 1/2] Revert "af_unix: dont send SCM_CREDENTIAL when dest socket is NULL"
From: Eric W. Biederman @ 2013-04-04 10:22 UTC (permalink / raw)
  To: dingtianhong
  Cc: David S. Miller, Sven Joachim, Greg Kroah-Hartman, linux-kernel,
	stable, Eric Dumazet, Andy Lutomirski, Karel Srot, netdev,
	Eric Dumazet
In-Reply-To: <515D30F6.9010004@huawei.com>

dingtianhong <dingtianhong@huawei.com> writes:

> On 2013/4/4 10:13, Eric W. Biederman wrote:
>> 
>> This reverts commit 14134f6584212d585b310ce95428014b653dfaf6.
>> 
>> The problem that the above patch was meant to address is that af_unix
>> messages are not being coallesced because we are sending unnecesarry
>> credentials.  Not sending credentials in maybe_add_creds totally
>> breaks unconnected unix domain sockets that wish to send credentails
>> to other sockets.
>> 
>
> thanks for check the question and make a fix solution, but I still doubt that if unconnected unix
> domain socket wish to send credentails to oher sockets, why dont set
> SOCK_PASSCRED on sock->flags, I think the user need to decide the param
> and shouldnt send creds by default way.

The big issue is the semantics are the receiver sets SOCK_PASSCRED when
they want to receive credentials.  When transmitting packets from
unconnected or unaccepted sockets we don't know if the receiver has set
SOCK_PASSCRED so when in doubt transmit. Historically we always
tranmitted credentials.

Furthermore we have a real regression in udev that breaks systems, so
this patch must be reverted.

Eric

^ permalink raw reply

* [PATCH 4/5 v2] ARM: orion5x: add gigabit ethernet device tree node
From: Florian Fainelli @ 2013-04-04 10:27 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-arm-kernel, devicetree-discuss, thomas.petazzoni,
	jm, moinejf, sebastian.hesselbarth, buytenh, andrew, jason,
	grant.likely, rob.herring, jogo, Florian Fainelli
In-Reply-To: <1365071235-11611-1-git-send-email-florian@openwrt.org>

This patch adds the gigabit ethernet device tree node to orion5x.dtsi.
This node is disabled by default and must be enabled on a per-board
basis. For completeness and easier testing the MDIO node is also added
and disabled by default.

Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
Changes since v1:
- fixed off-by 0x2000 address of the ethernet-group node

 arch/arm/boot/dts/orion5x.dtsi |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/arm/boot/dts/orion5x.dtsi b/arch/arm/boot/dts/orion5x.dtsi
index f7bec3b..c49503e 100644
--- a/arch/arm/boot/dts/orion5x.dtsi
+++ b/arch/arm/boot/dts/orion5x.dtsi
@@ -99,5 +99,28 @@
 			interrupts = <28>;
 			status = "okay";
 		};
+
+		mdio@72004 {
+			#address-cells = <1>;
+			#size-cells = <0>;
+			compatible = "marvell,orion-mdio";
+			reg = <0x72004 0x84>;
+			status = "disabled";
+		};
+
+		ethernet-group@72000 {
+			#address-cells = <1>.
+			#size-cells = <0>;
+			compatible = "marvell,mv643xx-eth-block";
+			reg = <0x72000 0x4000>;
+			status = "disabled";
+
+			egiga0: ethernet@0 {
+				device_type = "network";
+				compatible = "marvell,mv643xx-eth";
+				reg = <0>;
+				interrupts = <21>;
+			};
+		};
 	};
 };
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 3/5 v2] ARM: kirkwood: add device node entries for the gigabit interfaces
From: Florian Fainelli @ 2013-04-04 10:27 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-arm-kernel, devicetree-discuss, thomas.petazzoni,
	jm, moinejf, sebastian.hesselbarth, buytenh, andrew, jason,
	grant.likely, rob.herring, jogo, Florian Fainelli
In-Reply-To: <1365071235-11611-1-git-send-email-florian@openwrt.org>

This patch modifies kirkwood.dtsi to specify the various gigabit
interfaces nodes available on kirkwood devices. They are disabled by
default and should be enabled on a per-board basis. egiga0 and egiga1
aliases are defined for convenience. The mdio node is also present and
should be enabled on a per-board basis as well.

Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
Changes since v1:
- dropped change to arch/arm/mach-kirkwood/common.c to avoid merge conflicts
- fixed off-by 0x2000 ethernet-group nodes address

 arch/arm/boot/dts/kirkwood.dtsi |   46 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/arch/arm/boot/dts/kirkwood.dtsi b/arch/arm/boot/dts/kirkwood.dtsi
index fada7e6..254f5a8 100644
--- a/arch/arm/boot/dts/kirkwood.dtsi
+++ b/arch/arm/boot/dts/kirkwood.dtsi
@@ -7,6 +7,8 @@
 	aliases {
 	       gpio0 = &gpio0;
 	       gpio1 = &gpio1;
+	       egiga0 = &egiga0;
+	       egiga1 = &egiga1;
 	};
 	intc: interrupt-controller {
 		compatible = "marvell,orion-intc", "marvell,intc";
@@ -202,5 +204,49 @@
 			clocks = <&gate_clk 4>;
 			status = "disabled";
 		};
+
+		mdio@72004 {
+			#address-cells = <1>;
+			#size-cells = <0>;
+			compatible = "marvell,orion-mdio";
+			reg = <0x72004 0x84>;
+			status = "disabled";
+		};
+
+		ethernet-group@72000 {
+			#address-cells = <1>;
+			#size-cells = <0>;
+			compatible = "marvell,mv643xx-eth-block";
+			reg = <0x72000 0x4000>;
+			tx-csum-limit = <1600>;
+			status = "disabled";
+
+			egiga0: egiga0@0 {
+				device_type = "network";
+				compatible = "marvell,mv643xx-eth";
+				reg = <0>;
+				interrupts = <11>;
+				clocks = <&gate_clk 0>;
+				clock-names = "0";
+			};
+		};
+
+		ethernet-group@76000 {
+			#address-cells = <1>;
+			#size-cells = <0>;
+			compatible = "marvell,mv643xx-eth-block";
+			reg = <0x76000 0x4000>;
+			tx-csum-limit = <1600>;
+			status = "disabled";
+
+			egiga1: egiga1@0 {
+				device_type = "network";
+				compatible = "marvell,mv643xx-eth";
+				reg = <0>;
+				interrupts = <15>;
+				clocks = <&gate_clk 19>;
+				clock-names = "1";
+			};
+		};
 	};
 };
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 0/5 v2] mv643xx_eth: device tree bindings
From: Florian Fainelli @ 2013-04-04 10:27 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-arm-kernel, devicetree-discuss, thomas.petazzoni,
	jm, moinejf, sebastian.hesselbarth, buytenh, andrew, jason,
	grant.likely, rob.herring, jogo, Florian Fainelli

Hi all,

This patch serie implements mv643xx_eth device tree bindings. I opted for
the reuse of the bindings already defined in
Documentation/devicetree/bindings/marvell.txt so that we do not create
any confusion.

For reference, I have included the mv643xx-eth related nodes in the
corresponding Kirkwood, Dove and Orion5x .dtsi files so you can more
easily test on your own platforms.

I tested these on a custom 88F6181-based boards.

Florian Fainelli (5):
  mv643xx_eth: add Device Tree bindings
  mv643xx_eth: update Device Tree bindings documentation
  ARM: kirkwood: add device node entries for the gigabit interfaces
  ARM: orion5x: add gigabit ethernet device tree node
  ARM: dove: add gigabit device tree nodes to dove.dtsi

 Documentation/devicetree/bindings/marvell.txt |   25 +++++-
 arch/arm/boot/dts/dove.dtsi                   |   25 ++++++
 arch/arm/boot/dts/kirkwood.dtsi               |   46 ++++++++++
 arch/arm/boot/dts/orion5x.dtsi                |   23 +++++
 drivers/net/ethernet/marvell/mv643xx_eth.c    |  120 ++++++++++++++++++++++++-
 5 files changed, 234 insertions(+), 5 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* [PATCH 2/5] mv643xx_eth: update Device Tree bindings documentation
From: Florian Fainelli @ 2013-04-04 10:27 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-arm-kernel, devicetree-discuss, thomas.petazzoni,
	jm, moinejf, sebastian.hesselbarth, buytenh, andrew, jason,
	grant.likely, rob.herring, jogo, Florian Fainelli
In-Reply-To: <1365071235-11611-1-git-send-email-florian@openwrt.org>

This patch updates the Device Tree bindings documentation for the
Marvell MV643xx ethernet controller node and makes the
"local-mac-address" and "phy" properties optional instead of required.
The absence of a "phy" node is legal and describes a MAC to switch
configuration, while not specifying the "local-mac-address" allows for
reuse of the MAC address already programmed into the Ethernet
controller.

Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
 Documentation/devicetree/bindings/marvell.txt |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/marvell.txt b/Documentation/devicetree/bindings/marvell.txt
index e70a013..994b933 100644
--- a/Documentation/devicetree/bindings/marvell.txt
+++ b/Documentation/devicetree/bindings/marvell.txt
@@ -141,11 +141,11 @@ prefixed with the string "marvell,", for Marvell Technology Group Ltd.
      - interrupts : <a> where a is the interrupt number for the port.
      - interrupt-parent : the phandle for the interrupt controller
        that services interrupts for this device.
+
+   Optional properties:
      - phy : the phandle for the PHY connected to this ethernet
        controller.
      - local-mac-address : 6 bytes, MAC address
-
-   Optional properties:
      - clocks : Phandle to the clock control device and gate bit
      - clock-names : String describing the clock gate bit
      - speed : Speed to force the link (10, 100, 1000), used when no
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 1/5 v2] mv643xx_eth: add Device Tree bindings
From: Florian Fainelli @ 2013-04-04 10:27 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-arm-kernel, devicetree-discuss, thomas.petazzoni,
	jm, moinejf, sebastian.hesselbarth, buytenh, andrew, jason,
	grant.likely, rob.herring, jogo, Florian Fainelli
In-Reply-To: <1365071235-11611-1-git-send-email-florian@openwrt.org>

This patch adds Device Tree bindings following the already defined
bindings at Documentation/devicetree/bindings/marvell.txt. The binding
documentation is also enhanced with new optionnal properties required
for supporting certain devices (RX/TX queue and SRAM). Since we now have
proper support for the orion MDIO bus driver, there is no need to fiddle
around with device tree phandles. PHY-less (MAC connected to switch)
configurations are supported by not specifying any phy phandle for an
ethernet node.

Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
- properly ifdef of_platform_bus_probe with CONFIG_OF
- handle of_platform_bus_probe errors and cleanup accordingly
- use of_property_read_u32 where applicable
- parse "duplex" and "speed" property in PHY-less configuration

 Documentation/devicetree/bindings/marvell.txt |   25 +++++-
 drivers/net/ethernet/marvell/mv643xx_eth.c    |  120 ++++++++++++++++++++++++-
 2 files changed, 140 insertions(+), 5 deletions(-)

diff --git a/Documentation/devicetree/bindings/marvell.txt b/Documentation/devicetree/bindings/marvell.txt
index f1533d9..e70a013 100644
--- a/Documentation/devicetree/bindings/marvell.txt
+++ b/Documentation/devicetree/bindings/marvell.txt
@@ -112,9 +112,14 @@ prefixed with the string "marvell,", for Marvell Technology Group Ltd.
    Required properties:
      - #address-cells : <1>
      - #size-cells : <0>
-     - compatible : "marvell,mv64360-eth-block"
+     - compatible : "marvell,mv64360-eth-block", "marvell,mv64360-eth-group",
+		    "marvell,mv643xx-eth-block"
      - reg : Offset and length of the register set for this block
 
+   Optional properties:
+     - tx-csum-limit : Hardware limit above which transmit checksumming
+                       is disabled.
+
    Example Discovery Ethernet block node:
      ethernet-block@2000 {
 	     #address-cells = <1>;
@@ -130,7 +135,7 @@ prefixed with the string "marvell,", for Marvell Technology Group Ltd.
 
    Required properties:
      - device_type : Should be "network".
-     - compatible : Should be "marvell,mv64360-eth".
+     - compatible : Should be "marvell,mv64360-eth", "marvell,mv643xx-eth".
      - reg : Should be <0>, <1>, or <2>, according to which registers
        within the silicon block the device uses.
      - interrupts : <a> where a is the interrupt number for the port.
@@ -140,6 +145,22 @@ prefixed with the string "marvell,", for Marvell Technology Group Ltd.
        controller.
      - local-mac-address : 6 bytes, MAC address
 
+   Optional properties:
+     - clocks : Phandle to the clock control device and gate bit
+     - clock-names : String describing the clock gate bit
+     - speed : Speed to force the link (10, 100, 1000), used when no
+               phy property is defined
+     - duplex : Duplex to force the link (0: half, 1: full), used when no
+               phy property is defined
+     - rx-queue-count : number of RX queues to use
+     - tx-queue-count : number of TX queues to use
+     - rx-queue-size : size of the RX queue (in bytes)
+     - tx-queue-size : size of the TX queue (in bytes)
+     - rx-sram-addr : address of the SRAM for RX path (non 0 means used)
+     - rx-sram-size : size of the SRAM for RX path (non 0 means used)
+     - tx-sram-addr : address of the SRAM for TX path (non 0 means used)
+     - tx-sram-size : size of the SRAM for TX path (non 0 means used)
+
    Example Discovery Ethernet port node:
      ethernet@0 {
 	     device_type = "network";
diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c b/drivers/net/ethernet/marvell/mv643xx_eth.c
index aedbd82..75599a8 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -60,6 +60,10 @@
 #include <linux/inet_lro.h>
 #include <linux/slab.h>
 #include <linux/clk.h>
+#include <linux/stringify.h>
+#include <linux/of.h>
+#include <linux/of_platform.h>
+#include <linux/of_net.h>
 
 static char mv643xx_eth_driver_name[] = "mv643xx_eth";
 static char mv643xx_eth_driver_version[] = "1.4";
@@ -2542,14 +2546,23 @@ static void infer_hw_params(struct mv643xx_eth_shared_private *msp)
 	}
 }
 
+static const struct of_device_id mv643xx_eth_match[] = {
+	{ .compatible = "marvell,mv64360-eth" },
+	{ .compatible = "marvell,mv643xx-eth" },
+	{ /* sentinel */ },
+};
+MODULE_DEVICE_TABLE(of, mv643xx_eth_match);
+
 static int mv643xx_eth_shared_probe(struct platform_device *pdev)
 {
 	static int mv643xx_eth_version_printed;
+	struct device_node *np = pdev->dev.of_node;
 	struct mv643xx_eth_shared_platform_data *pd = pdev->dev.platform_data;
 	struct mv643xx_eth_shared_private *msp;
 	const struct mbus_dram_target_info *dram;
 	struct resource *res;
 	int ret;
+	int tx_csum_limit = 0;
 
 	if (!mv643xx_eth_version_printed++)
 		pr_notice("MV-643xx 10/100/1000 ethernet driver version %s\n",
@@ -2576,13 +2589,23 @@ static int mv643xx_eth_shared_probe(struct platform_device *pdev)
 	if (dram)
 		mv643xx_eth_conf_mbus_windows(msp, dram);
 
-	msp->tx_csum_limit = (pd != NULL && pd->tx_csum_limit) ?
-					pd->tx_csum_limit : 9 * 1024;
+	if (np)
+		of_property_read_u32(np, "tx-csum-limit", &tx_csum_limit);
+	else
+		tx_csum_limit = pd->tx_csum_limit;
+
+	msp->tx_csum_limit = tx_csum_limit ? tx_csum_limit : 9 * 1024;
 	infer_hw_params(msp);
 
 	platform_set_drvdata(pdev, msp);
+	ret = 0;
 
-	return 0;
+#ifdef CONFIG_OF
+	ret = of_platform_bus_probe(np, mv643xx_eth_match, &pdev->dev);
+	if (ret)
+		goto out_free;
+#endif
+	return ret;
 
 out_free:
 	kfree(msp);
@@ -2600,12 +2623,22 @@ static int mv643xx_eth_shared_remove(struct platform_device *pdev)
 	return 0;
 }
 
+static const struct of_device_id mv643xx_eth_shared_match[] = {
+	{ .compatible = "marvell,mv64360-eth-group" },
+	{ .compatible = "marvell,mv64360-eth-block" },
+	{ .compatible = "marvell,mv643xx-eth-group" },
+	{ .compatible = "marvell,mv643xx-eth-block" },
+	{ /* sentinel */ },
+};
+MODULE_DEVICE_TABLE(of, mv643xx_eth_shared_match);
+
 static struct platform_driver mv643xx_eth_shared_driver = {
 	.probe		= mv643xx_eth_shared_probe,
 	.remove		= mv643xx_eth_shared_remove,
 	.driver = {
 		.name	= MV643XX_ETH_SHARED_NAME,
 		.owner	= THIS_MODULE,
+		.of_match_table = of_match_ptr(mv643xx_eth_shared_match),
 	},
 };
 
@@ -2764,6 +2797,74 @@ static const struct net_device_ops mv643xx_eth_netdev_ops = {
 #endif
 };
 
+#ifdef CONFIG_OF
+static int mv643xx_eth_of_probe(struct platform_device *pdev)
+{
+	struct mv643xx_eth_platform_data *pd;
+	struct device_node *np = pdev->dev.of_node;
+	struct device_node *shared = of_get_parent(np);
+	struct device_node *phy_node;
+	const int *prop;
+	const char *mac_addr;
+
+	if (!pdev->dev.of_node)
+		return 0;
+
+	pd = kzalloc(sizeof(*pd), GFP_KERNEL);
+	if (!pd)
+		return -ENOMEM;
+
+	pdev->dev.platform_data = pd;
+
+	pd->shared = of_find_device_by_node(shared);
+	if (!pd->shared)
+		return -ENODEV;
+
+	prop = of_get_property(np, "reg", NULL);
+	if (!prop)
+		return -EINVAL;
+
+	pd->port_number = be32_to_cpup(prop);
+
+	phy_node = of_parse_phandle(np, "phy", 0);
+	if (!phy_node) {
+		pd->phy_addr = MV643XX_ETH_PHY_NONE;
+
+		of_property_read_u32(np, "speed", &pd->speed);
+		of_property_read_u32(np, "duplex", &pd->duplex);
+	} else {
+		prop = of_get_property(phy_node, "reg", NULL);
+		if (prop)
+			pd->phy_addr = be32_to_cpup(prop);
+	}
+
+	mac_addr = of_get_mac_address(np);
+	if (mac_addr)
+		memcpy(pd->mac_addr, mac_addr, ETH_ALEN);
+
+#define rx_tx_queue_sram_property(_name)			\
+	prop = of_get_property(np, __stringify(_name), NULL);	\
+	if (prop)						\
+		pd->_name = be32_to_cpup(prop);
+
+	rx_tx_queue_sram_property(rx_queue_count);
+	rx_tx_queue_sram_property(tx_queue_count);
+	rx_tx_queue_sram_property(rx_queue_size);
+	rx_tx_queue_sram_property(tx_queue_size);
+	rx_tx_queue_sram_property(rx_sram_addr);
+	rx_tx_queue_sram_property(rx_sram_size);
+	rx_tx_queue_sram_property(tx_sram_addr);
+	rx_tx_queue_sram_property(rx_sram_size);
+
+	return 0;
+}
+#else
+static inline int mv643xx_eth_of_probe(struct platform_device *dev)
+{
+	return 0;
+}
+#endif
+
 static int mv643xx_eth_probe(struct platform_device *pdev)
 {
 	struct mv643xx_eth_platform_data *pd;
@@ -2772,7 +2873,12 @@ static int mv643xx_eth_probe(struct platform_device *pdev)
 	struct resource *res;
 	int err;
 
+	err = mv643xx_eth_of_probe(pdev);
+	if (err)
+		return err;
+
 	pd = pdev->dev.platform_data;
+
 	if (pd == NULL) {
 		dev_err(&pdev->dev, "no mv643xx_eth_platform_data\n");
 		return -ENODEV;
@@ -2896,6 +3002,8 @@ out:
 	}
 #endif
 	free_netdev(dev);
+	if (pdev->dev.of_node)
+		kfree(pd);
 
 	return err;
 }
@@ -2903,6 +3011,7 @@ out:
 static int mv643xx_eth_remove(struct platform_device *pdev)
 {
 	struct mv643xx_eth_private *mp = platform_get_drvdata(pdev);
+	struct mv643xx_eth_platform_data *pd = pdev->dev.platform_data;
 
 	unregister_netdev(mp->dev);
 	if (mp->phy != NULL)
@@ -2918,6 +3027,9 @@ static int mv643xx_eth_remove(struct platform_device *pdev)
 
 	free_netdev(mp->dev);
 
+	if (pdev->dev.of_node)
+		kfree(pd);
+
 	platform_set_drvdata(pdev, NULL);
 
 	return 0;
@@ -2935,6 +3047,7 @@ static void mv643xx_eth_shutdown(struct platform_device *pdev)
 		port_reset(mp);
 }
 
+
 static struct platform_driver mv643xx_eth_driver = {
 	.probe		= mv643xx_eth_probe,
 	.remove		= mv643xx_eth_remove,
@@ -2942,6 +3055,7 @@ static struct platform_driver mv643xx_eth_driver = {
 	.driver = {
 		.name	= MV643XX_ETH_NAME,
 		.owner	= THIS_MODULE,
+		.of_match_table = of_match_ptr(mv643xx_eth_match),
 	},
 };
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 5/5 v2] ARM: dove: add gigabit device tree nodes to dove.dtsi
From: Florian Fainelli @ 2013-04-04 10:27 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-arm-kernel, devicetree-discuss, thomas.petazzoni,
	jm, moinejf, sebastian.hesselbarth, buytenh, andrew, jason,
	grant.likely, rob.herring, jogo, Florian Fainelli
In-Reply-To: <1365071235-11611-1-git-send-email-florian@openwrt.org>

This patch adds the gigabit ethernet device tree nodes to dove.dtsi in a
disabled state. The gigabit ethernet device tree node must be enabled on
a per-board basis. For completeness and easier testing the MDIO node is
also added in a disabled state.

Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
Changes since v1:
- fixed off-by 0x2000 address of the ethernet-group node

 arch/arm/boot/dts/dove.dtsi |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/arch/arm/boot/dts/dove.dtsi b/arch/arm/boot/dts/dove.dtsi
index f7509ca..51de5f7 100644
--- a/arch/arm/boot/dts/dove.dtsi
+++ b/arch/arm/boot/dts/dove.dtsi
@@ -253,5 +253,30 @@
 				dmacap,xor;
 			};
 		};
+
+		mdio@72004 {
+			#address-cells = <1>;
+			#size-cells = <0>;
+			compatible = "marvell,orion-mdio";
+			reg = <0x72004 0x84>;
+			status = "disabled";
+		};
+
+		ethernet-group@72000 {
+			#address-cells = <1>;
+			#size-cells = <0>;
+			compatible = "marvell,mv643xx-eth-block";
+			reg = <0x72000 0x4000>;
+			tx-csum-limit = <1600>;
+			status = "disabled";
+
+			egiga0: ethernet@0 {
+				device_type = "network";
+				compatible = "marvell,mv643xx-eth";
+				reg = <0>;
+				interrupts = <29>;
+				clocks = <&gate_clk 2>;
+			};
+		};
 	};
 };
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 2/2] af_unix: If we don't care about credentials coallesce all messages
From: Eric W. Biederman @ 2013-04-04 10:36 UTC (permalink / raw)
  To: dingtianhong
  Cc: David S. Miller, Sven Joachim, Greg Kroah-Hartman, linux-kernel,
	stable, Eric Dumazet, Andy Lutomirski, Karel Srot, netdev,
	Eric Dumazet
In-Reply-To: <515D3235.7080608@huawei.com>

dingtianhong <dingtianhong@huawei.com> writes:

> On 2013/4/4 10:14, Eric W. Biederman wrote:
>> 
>> It was reported that the following LSB test case failed
>> https://lsbbugs.linuxfoundation.org/attachment.cgi?id=2144 because we
>> were not coallescing unix stream messages when the application was
>> expecting us to.
>> 
>> The problem was that the first send was before the socket was accepted
>> and thus sock->sk_socket was NULL in maybe_add_creds, and the second
>> send after the socket was accepted had a non-NULL value for sk->socket
>> and thus we could tell the credentials were not needed so we did not
>> bother.
>> 
>> The unnecessary credentials on the first message cause
>> unix_stream_recvmsg to start verifying that all messages had the same
>> credentials before coallescing and then the coallescing failed because
>> the second message had no credentials.
>> 
>> Ignoring credentials when we don't care in unix_stream_recvmsg fixes a
>> long standing pessimization which would fail to coallesce messages when
>> reading from a unix stream socket if the senders were different even if
>> we did not care about their credentials.
>> 
>> I have tested this and verified that the in the LSB test case mentioned
>> above that the messages do coallesce now, while the were failing to
>> coallesce without this change.
>> 
>> Reported-by: Karel Srot <ksrot@redhat.com>
>> Reported-by: Ding Tianhong <dingtianhong@huawei.com>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  net/unix/af_unix.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>> 
>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> index f153a8d..2db702d 100644
>> --- a/net/unix/af_unix.c
>> +++ b/net/unix/af_unix.c
>> @@ -1993,7 +1993,7 @@ again:
>>  			if ((UNIXCB(skb).pid  != siocb->scm->pid) ||
>>  			    (UNIXCB(skb).cred != siocb->scm->cred))
>>  				break;
>> -		} else {
>> +		} else if (test_bit(SOCK_PASSCRED, &sock->flags)) {
>>  			/* Copy credentials */
>>  			scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
>>  			check_creds = 1;
>> 
>
> As your opinion, I think the way is better:
>
> 		if (test_bit(SOCK_PASSCRED, &sock->flags)) {
>                         if (check_creds) {
>                                 /* Never glue messages from different writers */
>                                 if ((UNIXCB(skb).pid  != siocb->scm->pid) ||
>                                 (UNIXCB(skb).cred != siocb->scm->cred))
>                                         break;
>                         } else {
>                                 /* Copy credentials */
>                                 scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
>                                 check_creds = 1;
>                         }
>                 }

It is a smidge clearer in intent, but there is no functional
difference.  The lines get really long.

Shrug.

Patches are always welcome.

Beyond getting something correct for the right reasons I don't care.

Eric

^ permalink raw reply

* Re: [PATCH v5 1/6] drivers: phy: add generic PHY framework
From: Sylwester Nawrocki @ 2013-04-04 10:41 UTC (permalink / raw)
  To: Kishon Vijay Abraham I
  Cc: linux-usb-u79uwXL29TY76Z2rM5mHXA, mchehab-H+wXaHxf7aLQT0dZR+AlfA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-lFZ/pmaqli7XmaaqVzeoHQ,
	javier-0uQlZySMnqxg9hUCZPvPmw, cesarb-PWySMVKUnqmsTnJN9+BGXg,
	eballetbo-Re5JQEeQqe8AvxtiuMwx3w,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	rob.herring-bsGFqQB8/DxBDgjK7y7TUQ,
	swarren-DDmLM1+adcrQT0dZR+AlfA, linux-omap-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	broonie-yzvPICuk2AATkU/dhu1WVueM+bqZidxxQQ4Iyu8u01E,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, balbi-l0cyMroinI0,
	santosh.shilimkar-l0cyMroinI0, netdev-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <515D462F.9050109-l0cyMroinI0@public.gmane.org>

Hi,

On 04/04/2013 11:21 AM, Kishon Vijay Abraham I wrote:
> On Thursday 04 April 2013 03:16 AM, Sylwester Nawrocki wrote:
>> On 04/03/2013 02:53 PM, Kishon Vijay Abraham I wrote:

>>> +4. Getting a reference to the PHY
>>> +
>>> +Before the controller can make use of the PHY, it has to get a
>>> reference to
>>> +it. This framework provides 6 APIs to get a reference to the PHY.
>>> +
>>> +struct phy *phy_get(struct device *dev, int index);
>>> +struct phy *devm_phy_get(struct device *dev, int index);
>>> +struct phy *of_phy_get(struct device *dev, const char *phandle, int
>>> index);
>>> +struct phy *devm_of_phy_get(struct device *dev, const char *phandle,
>>> int index);
>>
>> Why do we need separate functions for dt and non-dt ? Couldn't it be
>> handled
>> internally by the framework ? So the PHY users can use same calls for dt
>> and non-dt, like in case of e.g. the regulators API ?
> 
> yeah. Indeed it makes sense to do it that way.
>
>> Also signatures of some functions are different now:
>>
>> extern struct phy *phy_get(struct device *dev, int index);
>> extern struct phy *devm_phy_get(struct device *dev, int index);
>> extern struct phy *of_phy_get(struct device *dev, int index);
>> extern struct phy *devm_of_phy_get(struct device *dev, int index);
> 
> My bad :-(
> 
>> BTW, I think "extern" could be dropped, does it have any significance in
>> function declarations in header files ?
> 
> it shouldn't have any effect I guess. It's harmless nevertheless.

Yup.

>>> +struct phy *of_phy_get_byname(struct device *dev, const char *string);
>>> +struct phy *devm_of_phy_get_byname(struct device *dev, const char
>>> *string);

>>> --- /dev/null
>>> +++ b/drivers/phy/phy-core.c
>>> @@ -0,0 +1,616 @@
>>
>>> +static struct phy *of_phy_lookup(struct device_node *node)
>>> +{
>>> +    struct phy *phy;
>>> +    struct device *dev;
>>> +    struct class_dev_iter iter;
>>> +
>>> +    class_dev_iter_init(&iter, phy_class, NULL, NULL);
>>
>> There is currently nothing preventing a call to this function before
>> phy_class is initialized. It happened during my tests, and the nice
>> stack dump showed that it was the PHY user attempting to get the PHY
>> before the framework got initialized. So either phy_core_init should
>> be a subsys_initcall or we need a better protection against phy_class
>> being NULL or ERR_PTR in more places.
> 
> Whats the initcall used in your PHY user? Looks like more people prefer having

It happened in a regular platform driver probe() callback.

> module_init and any issue because of it should be fixed in PHY providers and
> PHY users.

OK. In fact this uncovered some issues in the MIPI DSI interface driver
(some unexpected interrupts). But this should just be fixed in those drivers
anyway. Now the DSI interface driver needs to wait for the PHY to be
registered and ready, so the initialization will likely be changed regardless
the framework initializes in module_init or earlier.

>>> +    while ((dev = class_dev_iter_next(&iter))) {
>>> +        phy = container_of(dev, struct phy, dev);
>>> +        if (node != phy->of_node)
>>> +            continue;
>>> +
>>> +        class_dev_iter_exit(&iter);
>>> +        return phy;
>>> +    }
>>> +
>>> +    class_dev_iter_exit(&iter);
>>> +    return ERR_PTR(-EPROBE_DEFER);
>>> +}
>>
>>> +/**
>>> + * of_phy_get() - lookup and obtain a reference to a phy by phandle
>>> + * @dev: device that requests this phy
>>> + * @index: the index of the phy
>>> + *
>>> + * Returns the phy associated with the given phandle value,
>>> + * after getting a refcount to it or -ENODEV if there is no such phy or
>>> + * -EPROBE_DEFER if there is a phandle to the phy, but the device is
>>> + * not yet loaded.
>>> + */
>>> +struct phy *of_phy_get(struct device *dev, int index)
>>> +{
>>> +    int ret;
>>> +    struct phy *phy = NULL;
>>> +    struct phy_bind *phy_map = NULL;
>>> +    struct of_phandle_args args;
>>> +    struct device_node *node;
>>> +
>>> +    if (!dev->of_node) {
>>> +        dev_dbg(dev, "device does not have a device node entry\n");
>>> +        return ERR_PTR(-EINVAL);
>>> +    }
>>> +
>>> +    ret = of_parse_phandle_with_args(dev->of_node, "phys", "#phy-cells",
>>> +        index,&args);
>>> +    if (ret) {
>>> +        dev_dbg(dev, "failed to get phy in %s node\n",
>>> +            dev->of_node->full_name);
>>> +        return ERR_PTR(-ENODEV);
>>> +    }
>>> +
>>> +    phy = of_phy_lookup(args.np);
>>> +    if (IS_ERR(phy) || !try_module_get(phy->ops->owner)) {
>>> +        phy = ERR_PTR(-EPROBE_DEFER);
>>> +        goto err0;
>>> +    }
>>> +
>>> +    phy = phy->ops->of_xlate(phy,&args);
>>> +    if (IS_ERR(phy))
>>> +        goto err1;
>>> +
>>> +    phy_map = phy_bind(dev_name(dev), index, dev_name(&phy->dev));
>>> +    if (!IS_ERR(phy_map)) {
>>> +        phy_map->phy = phy;
>>> +        phy_map->auto_bind = true;
>>
>> Shouldn't it be done with the phy_bind_mutex held ? I guess an unlocked
>> version of the phy_bind functions is needed, so it can be used internally.
> 
> The locking is done inside phy_bind function. I'm not sure but I vaguely
> remember getting a dump stack (incorrect mutex ordering or something) when
> trying to have phy_bind_mutex here. I can check it again.

Yes, IIUC the locking needs to be reworked a bit so "phy_map" is not modified
without the mutex held.

>>> +    }
>>> +
>>> +    get_device(&phy->dev);
>>> +
>>> +err1:
>>> +    module_put(phy->ops->owner);
>>> +
>>> +err0:
>>> +    of_node_put(node);
>>> +
>>> +    return phy;
>>> +}
>>> +EXPORT_SYMBOL_GPL(of_phy_get);
>>
>>> +/**
>>> + * phy_create() - create a new phy
>>> + * @dev: device that is creating the new phy
>>> + * @label: label given to phy
>>> + * @of_node: device node of the phy
>>> + * @type: specifies the phy type
>>> + * @ops: function pointers for performing phy operations
>>> + * @priv: private data from phy driver
>>> + *
>>> + * Called to create a phy using phy framework.
>>> + */
>>> +struct phy *phy_create(struct device *dev, const char *label,
>>> +    struct device_node *of_node, int type, struct phy_ops *ops,
>>> +    void *priv)
>>> +{
>>> +    int ret;
>>> +    struct phy *phy;
>>> +    struct phy_bind *phy_bind;
>>> +    const char *devname = NULL;
>>> +
>>> +    if (!dev) {
>>> +        dev_err(dev, "no device provided for PHY\n");
>>> +        ret = -EINVAL;
>>> +        goto err0;
>>> +    }
>>> +
>>> +    if (!ops || !ops->of_xlate || !priv) {
>>> +        dev_err(dev, "no PHY ops/PHY data provided\n");
>>> +        ret = -EINVAL;
>>> +        goto err0;
>>> +    }
>>> +
>>> +    if (!phy_class)
>>> +        phy_core_init();
>>> +
>>> +    phy = kzalloc(sizeof(*phy), GFP_KERNEL);
>>> +    if (!phy) {
>>> +        ret = -ENOMEM;
>>> +        goto err0;
>>> +    }
>>> +
>>> +    devname = dev_name(dev);
>>
>> OK, a sort of serious issue here is that you can't call phy_create()
>> multiple times for same PHY provider device.
> 
> Ah.. right.

The other question I had was why we needed struct device object for each
PHY. And if it wouldn't be enough to have only struct device * in
struct phy, and the real device would be the PHY provider only. I might
be missing some reasons behind this decision though, as I have not been
following your work since the beginning.

>> device_add() further will fail as sysfs won't let you create multiple
>> objects with same name. So I would assume we need to add a new argument
>> to phy_create() or rename @type to e.g. @phy_id, which could be
>> concatenated with dev_name(dev) to create a unique name of the phy
>> device.
> 
> hmm.. ok
>>
>> And how is the PHY provider supposed to identify a PHY in its common PHY
>> ops, now when the struct phy port field is removed ? I have temporarily
>> (ab)used the type field in order to select proper registers within the
>> phy ops.
> 
> Can't it be handled in the PHY provider driver without using struct phy *?

You need to select registers/bit fields corresponding to a specific PHY,
since the phy ops are common. So we need to know exactly with which phy
we deal at the moment and I can't see any other option to figure out that
than by dereferencing struct phy *... ;)

> Moreover the PHY ops passes on the *phy to phy provider right? Not sure I

Yes, all you get in the ops is phy *. I wanted to avoid global variables
in the provider driver and be able to get to any provider private data
from phy * only.

If I use something like below for the PHY provider private data then from
struct phy * only I would need to get an index. This could be looked up by
iterating over phys[] array, but that seems an unnecessary complication.
Maybe there are better/easier ways to resolve it, without adding a new
field to struct phy.

struct phy_povider_priv {
	struct phy *phys[4];
	...
};

static in phy_power_on(struct phy *phy)
{
	struct phy_provider_priv *priv = dev_get_drvdata(&phy->dev);
	int phy_id = ...
	....
}
...
struct phy_ops ops = {
	.power_on = phy_power_on,
	...
};

> understand you here.
>
>>> +    device_initialize(&phy->dev);
>>> +
>>> +    phy->dev.class = phy_class;
>>> +    phy->dev.parent = dev;
>>> +    phy->label = label;
>>
>> What about duplicating the string here ? That would make the API a bit
>> more convenient and the callers wouldn't be required to keep all the
>> labels around.
> 
> you mean use dev_name? The device names are sometime more cryptic so preferred
> to have it like this.

No, I meant something like:

phy->label = kstrdup(label, GFP_KERNEL);

>>> +static int phy_core_init(void)
>>> +{
>>> +    if (phy_class)
>>> +        return 0;
>>> +
>>> +    phy_class = class_create(THIS_MODULE, "phy");
>>> +    if (IS_ERR(phy_class)) {
>>> +        pr_err("failed to create phy class -->  %ld\n",
>>> +            PTR_ERR(phy_class));
>>> +        return PTR_ERR(phy_class);
>>> +    }
>>> +
>>> +    phy_class->dev_release = phy_release;
>>> +    phy_class->dev_attrs = phy_dev_attrs;
>>> +
>>> +    return 0;
>>> +}
>>> +module_init(phy_core_init);
>>
>> Having this framework initialized before the PHY provider and consumer
>> drivers could save some unnecessary probe deferrals. I was wondering,
>> what is really an advantage of having it as module_init(), rather than
>> subsys_initcall() ?
> 
> I'm not sure of the exact reason but after the advent of EPROBE_DEFER, everyone
> recommends to use module_init only.

Ah...OK. We just need to ensure the resources are properly checked before
they are used then.

> Thanks for the detailed review.

Regards,
Sylwester

^ permalink raw reply

* Re: [PATCH v5 1/6] drivers: phy: add generic PHY framework
From: Kishon Vijay Abraham I @ 2013-04-04 11:11 UTC (permalink / raw)
  To: Sylwester Nawrocki
  Cc: balbi, gregkh, arnd, akpm, swarren, rob, netdev, davem, cesarb,
	linux-usb, linux-omap, linux-kernel, tony, grant.likely,
	rob.herring, b-cousson, linux, eballetbo, javier, mchehab,
	santosh.shilimkar, broonie, swarren, linux-doc,
	devicetree-discuss, linux-arm-kernel
In-Reply-To: <515D58F2.7030404@samsung.com>

Hi,

On Thursday 04 April 2013 04:11 PM, Sylwester Nawrocki wrote:
> Hi,
>
> On 04/04/2013 11:21 AM, Kishon Vijay Abraham I wrote:
>> On Thursday 04 April 2013 03:16 AM, Sylwester Nawrocki wrote:
>>> On 04/03/2013 02:53 PM, Kishon Vijay Abraham I wrote:
>
>>>> +4. Getting a reference to the PHY
>>>> +
>>>> +Before the controller can make use of the PHY, it has to get a
>>>> reference to
>>>> +it. This framework provides 6 APIs to get a reference to the PHY.
>>>> +
>>>> +struct phy *phy_get(struct device *dev, int index);
>>>> +struct phy *devm_phy_get(struct device *dev, int index);
>>>> +struct phy *of_phy_get(struct device *dev, const char *phandle, int
>>>> index);
>>>> +struct phy *devm_of_phy_get(struct device *dev, const char *phandle,
>>>> int index);
>>>
>>> Why do we need separate functions for dt and non-dt ? Couldn't it be
>>> handled
>>> internally by the framework ? So the PHY users can use same calls for dt
>>> and non-dt, like in case of e.g. the regulators API ?
>>
>> yeah. Indeed it makes sense to do it that way.
>>
>>> Also signatures of some functions are different now:
>>>
>>> extern struct phy *phy_get(struct device *dev, int index);
>>> extern struct phy *devm_phy_get(struct device *dev, int index);
>>> extern struct phy *of_phy_get(struct device *dev, int index);
>>> extern struct phy *devm_of_phy_get(struct device *dev, int index);
>>
>> My bad :-(
>>
>>> BTW, I think "extern" could be dropped, does it have any significance in
>>> function declarations in header files ?
>>
>> it shouldn't have any effect I guess. It's harmless nevertheless.
>
> Yup.
>
>>>> +struct phy *of_phy_get_byname(struct device *dev, const char *string);
>>>> +struct phy *devm_of_phy_get_byname(struct device *dev, const char
>>>> *string);
>
>>>> --- /dev/null
>>>> +++ b/drivers/phy/phy-core.c
>>>> @@ -0,0 +1,616 @@
>>>
>>>> +static struct phy *of_phy_lookup(struct device_node *node)
>>>> +{
>>>> +    struct phy *phy;
>>>> +    struct device *dev;
>>>> +    struct class_dev_iter iter;
>>>> +
>>>> +    class_dev_iter_init(&iter, phy_class, NULL, NULL);
>>>
>>> There is currently nothing preventing a call to this function before
>>> phy_class is initialized. It happened during my tests, and the nice
>>> stack dump showed that it was the PHY user attempting to get the PHY
>>> before the framework got initialized. So either phy_core_init should
>>> be a subsys_initcall or we need a better protection against phy_class
>>> being NULL or ERR_PTR in more places.
>>
>> Whats the initcall used in your PHY user? Looks like more people prefer having
>
> It happened in a regular platform driver probe() callback.
>
>> module_init and any issue because of it should be fixed in PHY providers and
>> PHY users.
>
> OK. In fact this uncovered some issues in the MIPI DSI interface driver
> (some unexpected interrupts). But this should just be fixed in those drivers
> anyway. Now the DSI interface driver needs to wait for the PHY to be
> registered and ready, so the initialization will likely be changed regardless
> the framework initializes in module_init or earlier.
>
>>>> +    while ((dev = class_dev_iter_next(&iter))) {
>>>> +        phy = container_of(dev, struct phy, dev);
>>>> +        if (node != phy->of_node)
>>>> +            continue;
>>>> +
>>>> +        class_dev_iter_exit(&iter);
>>>> +        return phy;
>>>> +    }
>>>> +
>>>> +    class_dev_iter_exit(&iter);
>>>> +    return ERR_PTR(-EPROBE_DEFER);
>>>> +}
>>>
>>>> +/**
>>>> + * of_phy_get() - lookup and obtain a reference to a phy by phandle
>>>> + * @dev: device that requests this phy
>>>> + * @index: the index of the phy
>>>> + *
>>>> + * Returns the phy associated with the given phandle value,
>>>> + * after getting a refcount to it or -ENODEV if there is no such phy or
>>>> + * -EPROBE_DEFER if there is a phandle to the phy, but the device is
>>>> + * not yet loaded.
>>>> + */
>>>> +struct phy *of_phy_get(struct device *dev, int index)
>>>> +{
>>>> +    int ret;
>>>> +    struct phy *phy = NULL;
>>>> +    struct phy_bind *phy_map = NULL;
>>>> +    struct of_phandle_args args;
>>>> +    struct device_node *node;
>>>> +
>>>> +    if (!dev->of_node) {
>>>> +        dev_dbg(dev, "device does not have a device node entry\n");
>>>> +        return ERR_PTR(-EINVAL);
>>>> +    }
>>>> +
>>>> +    ret = of_parse_phandle_with_args(dev->of_node, "phys", "#phy-cells",
>>>> +        index,&args);
>>>> +    if (ret) {
>>>> +        dev_dbg(dev, "failed to get phy in %s node\n",
>>>> +            dev->of_node->full_name);
>>>> +        return ERR_PTR(-ENODEV);
>>>> +    }
>>>> +
>>>> +    phy = of_phy_lookup(args.np);
>>>> +    if (IS_ERR(phy) || !try_module_get(phy->ops->owner)) {
>>>> +        phy = ERR_PTR(-EPROBE_DEFER);
>>>> +        goto err0;
>>>> +    }
>>>> +
>>>> +    phy = phy->ops->of_xlate(phy,&args);
>>>> +    if (IS_ERR(phy))
>>>> +        goto err1;
>>>> +
>>>> +    phy_map = phy_bind(dev_name(dev), index, dev_name(&phy->dev));
>>>> +    if (!IS_ERR(phy_map)) {
>>>> +        phy_map->phy = phy;
>>>> +        phy_map->auto_bind = true;
>>>
>>> Shouldn't it be done with the phy_bind_mutex held ? I guess an unlocked
>>> version of the phy_bind functions is needed, so it can be used internally.
>>
>> The locking is done inside phy_bind function. I'm not sure but I vaguely
>> remember getting a dump stack (incorrect mutex ordering or something) when
>> trying to have phy_bind_mutex here. I can check it again.
>
> Yes, IIUC the locking needs to be reworked a bit so "phy_map" is not modified
> without the mutex held.
>
>>>> +    }
>>>> +
>>>> +    get_device(&phy->dev);
>>>> +
>>>> +err1:
>>>> +    module_put(phy->ops->owner);
>>>> +
>>>> +err0:
>>>> +    of_node_put(node);
>>>> +
>>>> +    return phy;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(of_phy_get);
>>>
>>>> +/**
>>>> + * phy_create() - create a new phy
>>>> + * @dev: device that is creating the new phy
>>>> + * @label: label given to phy
>>>> + * @of_node: device node of the phy
>>>> + * @type: specifies the phy type
>>>> + * @ops: function pointers for performing phy operations
>>>> + * @priv: private data from phy driver
>>>> + *
>>>> + * Called to create a phy using phy framework.
>>>> + */
>>>> +struct phy *phy_create(struct device *dev, const char *label,
>>>> +    struct device_node *of_node, int type, struct phy_ops *ops,
>>>> +    void *priv)
>>>> +{
>>>> +    int ret;
>>>> +    struct phy *phy;
>>>> +    struct phy_bind *phy_bind;
>>>> +    const char *devname = NULL;
>>>> +
>>>> +    if (!dev) {
>>>> +        dev_err(dev, "no device provided for PHY\n");
>>>> +        ret = -EINVAL;
>>>> +        goto err0;
>>>> +    }
>>>> +
>>>> +    if (!ops || !ops->of_xlate || !priv) {
>>>> +        dev_err(dev, "no PHY ops/PHY data provided\n");
>>>> +        ret = -EINVAL;
>>>> +        goto err0;
>>>> +    }
>>>> +
>>>> +    if (!phy_class)
>>>> +        phy_core_init();
>>>> +
>>>> +    phy = kzalloc(sizeof(*phy), GFP_KERNEL);
>>>> +    if (!phy) {
>>>> +        ret = -ENOMEM;
>>>> +        goto err0;
>>>> +    }
>>>> +
>>>> +    devname = dev_name(dev);
>>>
>>> OK, a sort of serious issue here is that you can't call phy_create()
>>> multiple times for same PHY provider device.
>>
>> Ah.. right.
>
> The other question I had was why we needed struct device object for each
> PHY. And if it wouldn't be enough to have only struct device * in
> struct phy, and the real device would be the PHY provider only. I might

IMO the PHY framework shouldn't be modifying the properties of device of 
some other driver. Consider the case where the PHY provider implements 
other functions also (in addition to acting as a PHY) in which case the 
PHY provider entirely doesn't come under *PHY class* but only a part of 
it. There might be other reasons as well.

Thanks
Kishon



^ permalink raw reply

* [PATCH] bridge-utils: fix AF_LOCAL socket leaks
From: Devendra Naga @ 2013-04-04 11:36 UTC (permalink / raw)
  To: Stephen Hemminger, netdev; +Cc: Devendra Naga

valgrind reported the following leak on fc18 system:

valgrind -v --leak-check=full --track-fds=yes ./brctl show

==27307==
==27307== FILE DESCRIPTORS: 4 open at exit.
==27307== Open AF_UNIX socket 3: <unknown>
==27307==    at 0x397B2F3617: socket (in /usr/lib64/libc-2.16.so)
==27307==    by 0x403CF7: br_init (libbridge_init.c:35)
==27307==    by 0x4010BF: main (brctl.c:63)

fix it by calling br_shutdown

Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
---
 brctl/brctl.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/brctl/brctl.c b/brctl/brctl.c
index 46ca352..16080a7 100644
--- a/brctl/brctl.c
+++ b/brctl/brctl.c
@@ -43,6 +43,7 @@ int main(int argc, char *const* argv)
 		{ .name = "version", .val = 'V' },
 		{ 0 }
 	};
+	int ret;
 
 	while ((f = getopt_long(argc, argv, "Vh", options, NULL)) != EOF) 
 		switch(f) {
@@ -79,7 +80,10 @@ int main(int argc, char *const* argv)
 		return 1;
 	}
 
-	return cmd->func(argc, argv);
+	ret = cmd->func(argc, argv);
+	br_shutdown();
+
+	return ret;
 
 help:
 	help();
-- 
1.8.1.4

^ permalink raw reply related

* [net-next 01/13] ixgbe: Mask off check of frag_off as we only want fragment offset
From: Jeff Kirsher @ 2013-04-04 11:37 UTC (permalink / raw)
  To: davem; +Cc: Alexander Duyck, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1365075480-20183-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

We were incorrectly checking the entire frag_off field when we only wanted the
fragment offset.  As a result we were not pulling in TCP headers when the DNF
flag was set.

To correct that we will now check for frag off using the IP_OFFSET mask.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index e56a3d1..5cee5de 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1337,7 +1337,7 @@ static unsigned int ixgbe_get_headlen(unsigned char *data,
 			return hdr.network - data;
 
 		/* record next protocol if header is present */
-		if (!hdr.ipv4->frag_off)
+		if (!(hdr.ipv4->frag_off & htons(IP_OFFSET)))
 			nexthdr = hdr.ipv4->protocol;
 	} else if (protocol == __constant_htons(ETH_P_IPV6)) {
 		if ((hdr.network - data) > (max_len - sizeof(struct ipv6hdr)))
-- 
1.7.11.7

^ permalink raw reply related

* [net-next 03/13] ixgbe: Drop check for PAGE_SIZE from ixgbe_xmit_frame_ring
From: Jeff Kirsher @ 2013-04-04 11:37 UTC (permalink / raw)
  To: davem; +Cc: Alexander Duyck, netdev, gospo, sassmann, Jeff Kirsher
In-Reply-To: <1365075480-20183-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>

The check for PAGE_SIZE is pointless now that the default configuration is to
allocate 32K for all buffers.  Since the Tx descriptor limit is 16K we can
just drop the check and always compare the descriptors to the maximum size
supported.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 5cee5de..d49a148d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6425,9 +6425,7 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
 	struct ixgbe_tx_buffer *first;
 	int tso;
 	u32 tx_flags = 0;
-#if PAGE_SIZE > IXGBE_MAX_DATA_PER_TXD
 	unsigned short f;
-#endif
 	u16 count = TXD_USE_COUNT(skb_headlen(skb));
 	__be16 protocol = skb->protocol;
 	u8 hdr_len = 0;
@@ -6439,12 +6437,9 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
 	 *       + 1 desc for context descriptor,
 	 * otherwise try next time
 	 */
-#if PAGE_SIZE > IXGBE_MAX_DATA_PER_TXD
 	for (f = 0; f < skb_shinfo(skb)->nr_frags; f++)
 		count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
-#else
-	count += skb_shinfo(skb)->nr_frags;
-#endif
+
 	if (ixgbe_maybe_stop_tx(tx_ring, count + 3)) {
 		tx_ring->tx_stats.tx_busy++;
 		return NETDEV_TX_BUSY;
-- 
1.7.11.7

^ permalink raw reply related

* [net-next 00/13][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2013-04-04 11:37 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, gospo, sassmann

This series contains updates to ixgbe and igb.

For ixgbe (and igb), Alex fixes an issue where we were incorrectly checking
the entire frag_off field when we only wanted the fragment offset.  Alex
also provides a patch to drop the check for PAGE_SIZE in transmit since the
default configuration is to allocate 32k for all buffers.

Emil provides the third ixgbe patch with is a simple change to make the
calculation of eerd consistent between the read and write functions
by using | instead of + for IXGBE_EEPROM_RW_REG_START.

The remaining patches in the series are against igb, the largest being
my patch to cleanup code comments and whitespace to align the igb
driver with the networking style of code comments.  While cleaning up the
code comments, fixed several other whitespace/checkpatch.pl code
formatting issues.

Other notable igb patches are the added support for 100base-fx SFP,
added support for reading & exporting SFP data over i2c, and on EEE
capable devices, query the PHY to determine what the link partner
is advertising.

The following are changes since commit d66248326410ed0d3e813ebe974b3e6638df0717:
  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master

Akeem G. Abodunrin (5):
  igb: Support for 100base-fx SFP
  igb: Support to read and export SFF-8472/8079 data
  igb: Implement support to power sfp cage and turn on I2C
  igb: random code and comments fix
  igb: Fix sparse warnings on function pointers

Alexander Duyck (5):
  ixgbe: Mask off check of frag_off as we only want fragment offset
  ixgbe: Drop check for PAGE_SIZE from ixgbe_xmit_frame_ring
  igb: Mask off check of frag_off as we only want fragment offset
  igb: Pull adapter out of main path in igb_xmit_frame_ring
  igb: Use rx/tx_itr_setting when setting up initial value of itr

Emil Tantilov (1):
  ixgbe: don't do arithmetic operations on bitmasks

Jeff Kirsher (1):
  igb: Fix code comments and whitespace

Matthew Vick (1):
  igb: Enable EEE LP advertisement

 drivers/net/ethernet/intel/igb/e1000_82575.c    |  130 +--
 drivers/net/ethernet/intel/igb/e1000_82575.h    |    1 +
 drivers/net/ethernet/intel/igb/e1000_defines.h  |   34 +-
 drivers/net/ethernet/intel/igb/e1000_hw.h       |   53 +-
 drivers/net/ethernet/intel/igb/e1000_i210.c     |   93 +-
 drivers/net/ethernet/intel/igb/e1000_i210.h     |    4 +
 drivers/net/ethernet/intel/igb/e1000_mac.c      |  111 +--
 drivers/net/ethernet/intel/igb/e1000_mac.h      |   17 +-
 drivers/net/ethernet/intel/igb/e1000_mbx.c      |   11 +-
 drivers/net/ethernet/intel/igb/e1000_mbx.h      |   52 +-
 drivers/net/ethernet/intel/igb/e1000_nvm.c      |   25 +-
 drivers/net/ethernet/intel/igb/e1000_phy.c      |  258 ++---
 drivers/net/ethernet/intel/igb/e1000_regs.h     |   49 +-
 drivers/net/ethernet/intel/igb/igb.h            |  132 +--
 drivers/net/ethernet/intel/igb/igb_ethtool.c    |  299 ++++--
 drivers/net/ethernet/intel/igb/igb_hwmon.c      |   29 +-
 drivers/net/ethernet/intel/igb/igb_main.c       | 1187 ++++++++++++-----------
 drivers/net/ethernet/intel/igb/igb_ptp.c        |   57 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_common.c |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |    9 +-
 20 files changed, 1323 insertions(+), 1230 deletions(-)

-- 
1.7.11.7

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox