* Re: [PATCH 2/2] af_unix: If we don't care about credentials coallesce all messages
From: dingtianhong @ 2013-04-04 7:56 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David S. Miller, Sven Joachim, Greg Kroah-Hartman, linux-kernel,
stable, Eric Dumazet, Andy Lutomirski, Karel Srot, netdev,
Eric Dumazet
In-Reply-To: <87d2ubhwiw.fsf_-_@xmission.com>
On 2013/4/4 10:14, Eric W. Biederman wrote:
>
> It was reported that the following LSB test case failed
> https://lsbbugs.linuxfoundation.org/attachment.cgi?id=2144 because we
> were not coallescing unix stream messages when the application was
> expecting us to.
>
> The problem was that the first send was before the socket was accepted
> and thus sock->sk_socket was NULL in maybe_add_creds, and the second
> send after the socket was accepted had a non-NULL value for sk->socket
> and thus we could tell the credentials were not needed so we did not
> bother.
>
> The unnecessary credentials on the first message cause
> unix_stream_recvmsg to start verifying that all messages had the same
> credentials before coallescing and then the coallescing failed because
> the second message had no credentials.
>
> Ignoring credentials when we don't care in unix_stream_recvmsg fixes a
> long standing pessimization which would fail to coallesce messages when
> reading from a unix stream socket if the senders were different even if
> we did not care about their credentials.
>
> I have tested this and verified that the in the LSB test case mentioned
> above that the messages do coallesce now, while the were failing to
> coallesce without this change.
>
> Reported-by: Karel Srot <ksrot@redhat.com>
> Reported-by: Ding Tianhong <dingtianhong@huawei.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> net/unix/af_unix.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index f153a8d..2db702d 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1993,7 +1993,7 @@ again:
> if ((UNIXCB(skb).pid != siocb->scm->pid) ||
> (UNIXCB(skb).cred != siocb->scm->cred))
> break;
> - } else {
> + } else if (test_bit(SOCK_PASSCRED, &sock->flags)) {
> /* Copy credentials */
> scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
> check_creds = 1;
>
As your opinion, I think the way is better:
if (test_bit(SOCK_PASSCRED, &sock->flags)) {
if (check_creds) {
/* Never glue messages from different writers */
if ((UNIXCB(skb).pid != siocb->scm->pid) ||
(UNIXCB(skb).cred != siocb->scm->cred))
break;
} else {
/* Copy credentials */
scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
check_creds = 1;
}
}
Ding
^ permalink raw reply
* [net-next PATCH V2] net: frag queue per hash bucket locking
From: Jesper Dangaard Brouer @ 2013-04-04 7:52 UTC (permalink / raw)
To: Eric Dumazet, David S. Miller
Cc: Jesper Dangaard Brouer, netdev, Florian Westphal, Daniel Borkmann,
Hannes Frederic Sowa
In-Reply-To: <1365027060.12728.30.camel@localhost>
This patch implements per hash bucket locking for the frag queue
hash. This removes two write locks, and the only remaining write
lock is for protecting hash rebuild. This essentially reduce the
readers-writer lock to a rebuild lock.
V2:
- By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't
need the spinlock _bh versions, as Netfilter currently does a
local_bh_disable() before entering inet_fragment.
- Fold-in desc from cover-mail
This patch is part of "net: frag performance followup"
http://thread.gmane.org/gmane.linux.network/263644
of which two patches have already been accepted:
Same test setup as previous:
(http://thread.gmane.org/gmane.linux.network/257155)
Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses
Ethernet flow-control. A third interface is used for generating the
DoS attack (with trafgen).
Notice, I have changed the frag DoS generator script to be more
efficient/deadly. Before it would only hit one RX queue, now its
sending packets causing multi-queue RX, due to "better" RX hashing.
Test types summary (netperf UDP_STREAM):
Test-20G64K == 2x10G with 65K fragments
Test-20G3F == 2x10G with 3x fragments (3*1472 bytes)
Test-20G64K+DoS == Same as 20G64K with frag DoS
Test-20G3F+DoS == Same as 20G3F with frag DoS
Test-20G64K+MQ == Same as 20G64K with Multi-Queue frag DoS
Test-20G3F+MQ == Same as 20G3F with Multi-Queue frag DoS
When I rebased this-patch(03) (on top of net-next commit a210576c) and
removed the _bh spinlock, I saw a performance regression. BUT this
was caused by some unrelated change in-between. See tests below.
Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de.
Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02.
Test (C) is what I reported before for this-patch
Test (D) is net-next master HEAD (commit a210576c), which reveals some
(unknown) performance regression (compared against test (B)).
Test (D) function as a new base-test.
Performance table summary (in Mbit/s):
(#) Test-type: 20G64K 20G3F 20G64K+DoS 20G3F+DoS 20G64K+MQ 20G3F+MQ
---------- ------- ------- ---------- --------- -------- -------
(A) Patch-02 : 18848.7 13230.1 4103.04 5310.36 130.0 440.2
(B) 1b5ab0de : 18841.5 13156.8 4101.08 5314.57 129.0 424.2
(C) Patch-03v1: 18838.0 13490.5 4405.11 6814.72 196.6 461.6
(D) a210576c : 18321.5 11250.4 3635.34 5160.13 119.1 405.2
(E) with _bh : 17247.3 11492.6 3994.74 6405.29 166.7 413.6
(F) without bh: 17471.3 11298.7 3818.05 6102.11 165.7 406.3
Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks.
I cannot explain the slow down for 20G64K (but its an artificial
"lab-test" so I'm not worried). But the other results does show
improvements. And test (E) "with _bh" version is slightly better.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
include/net/inet_frag.h | 9 ++++++-
net/ipv4/inet_fragment.c | 60 ++++++++++++++++++++++++++++++++++++----------
2 files changed, 55 insertions(+), 14 deletions(-)
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 7cac9c5..c4f5183 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -50,10 +50,17 @@ struct inet_frag_queue {
*/
#define INETFRAGS_MAXDEPTH 128
+struct inet_frag_bucket {
+ struct hlist_head chain;
+ spinlock_t chain_lock;
+ u16 chain_len;
+};
+
struct inet_frags {
- struct hlist_head hash[INETFRAGS_HASHSZ];
+ struct inet_frag_bucket hash[INETFRAGS_HASHSZ];
/* This rwlock is a global lock (seperate per IPv4, IPv6 and
* netfilter). Important to keep this on a seperate cacheline.
+ * Its primarily a rebuild protection rwlock.
*/
rwlock_t lock ____cacheline_aligned_in_smp;
int secret_interval;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 1206ca6..2bf15e8 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -52,20 +52,27 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
unsigned long now = jiffies;
int i;
+ /* Per bucket lock NOT needed here, due to write lock protection */
write_lock(&f->lock);
+
get_random_bytes(&f->rnd, sizeof(u32));
for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+ struct inet_frag_bucket *hb;
struct inet_frag_queue *q;
struct hlist_node *n;
- hlist_for_each_entry_safe(q, n, &f->hash[i], list) {
+ hb = &f->hash[i];
+ hlist_for_each_entry_safe(q, n, &hb->chain, list) {
unsigned int hval = f->hashfn(q);
if (hval != i) {
+ struct inet_frag_bucket *hb_dest;
+
hlist_del(&q->list);
/* Relink to new hash chain. */
- hlist_add_head(&q->list, &f->hash[hval]);
+ hb_dest = &f->hash[hval];
+ hlist_add_head(&q->list, &hb_dest->chain);
}
}
}
@@ -78,9 +85,13 @@ void inet_frags_init(struct inet_frags *f)
{
int i;
- for (i = 0; i < INETFRAGS_HASHSZ; i++)
- INIT_HLIST_HEAD(&f->hash[i]);
+ for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+ struct inet_frag_bucket *hb = &f->hash[i];
+ spin_lock_init(&hb->chain_lock);
+ INIT_HLIST_HEAD(&hb->chain);
+ hb->chain_len = 0;
+ }
rwlock_init(&f->lock);
f->rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^
@@ -122,9 +133,19 @@ EXPORT_SYMBOL(inet_frags_exit_net);
static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
{
- write_lock(&f->lock);
+ struct inet_frag_bucket *hb;
+ unsigned int hash;
+
+ read_lock(&f->lock);
+ hash = f->hashfn(fq);
+ hb = &f->hash[hash];
+
+ spin_lock(&hb->chain_lock);
hlist_del(&fq->list);
- write_unlock(&f->lock);
+ hb->chain_len--;
+ spin_unlock(&hb->chain_lock);
+
+ read_unlock(&f->lock);
inet_frag_lru_del(fq);
}
@@ -226,27 +247,32 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
struct inet_frag_queue *qp_in, struct inet_frags *f,
void *arg)
{
+ struct inet_frag_bucket *hb;
struct inet_frag_queue *qp;
#ifdef CONFIG_SMP
#endif
unsigned int hash;
- write_lock(&f->lock);
+ read_lock(&f->lock); /* Protects against hash rebuild */
/*
* While we stayed w/o the lock other CPU could update
* the rnd seed, so we need to re-calculate the hash
* chain. Fortunatelly the qp_in can be used to get one.
*/
hash = f->hashfn(qp_in);
+ hb = &f->hash[hash];
+ spin_lock(&hb->chain_lock);
+
#ifdef CONFIG_SMP
/* With SMP race we have to recheck hash table, because
* such entry could be created on other cpu, while we
- * promoted read lock to write lock.
+ * released the hash bucket lock.
*/
- hlist_for_each_entry(qp, &f->hash[hash], list) {
+ hlist_for_each_entry(qp, &hb->chain, list) {
if (qp->net == nf && f->match(qp, arg)) {
atomic_inc(&qp->refcnt);
- write_unlock(&f->lock);
+ spin_unlock(&hb->chain_lock);
+ read_unlock(&f->lock);
qp_in->last_in |= INET_FRAG_COMPLETE;
inet_frag_put(qp_in, f);
return qp;
@@ -258,8 +284,10 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
atomic_inc(&qp->refcnt);
atomic_inc(&qp->refcnt);
- hlist_add_head(&qp->list, &f->hash[hash]);
- write_unlock(&f->lock);
+ hlist_add_head(&qp->list, &hb->chain);
+ hb->chain_len++;
+ spin_unlock(&hb->chain_lock);
+ read_unlock(&f->lock);
inet_frag_lru_add(nf, qp);
return qp;
}
@@ -300,17 +328,23 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
struct inet_frags *f, void *key, unsigned int hash)
__releases(&f->lock)
{
+ struct inet_frag_bucket *hb;
struct inet_frag_queue *q;
int depth = 0;
- hlist_for_each_entry(q, &f->hash[hash], list) {
+ hb = &f->hash[hash];
+
+ spin_lock(&hb->chain_lock);
+ hlist_for_each_entry(q, &hb->chain, list) {
if (q->net == nf && f->match(q, key)) {
atomic_inc(&q->refcnt);
+ spin_unlock(&hb->chain_lock);
read_unlock(&f->lock);
return q;
}
depth++;
}
+ spin_unlock(&hb->chain_lock);
read_unlock(&f->lock);
if (depth <= INETFRAGS_MAXDEPTH)
^ permalink raw reply related
* Re: [PATCH 1/2] Revert "af_unix: dont send SCM_CREDENTIAL when dest socket is NULL"
From: dingtianhong @ 2013-04-04 7:51 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David S. Miller, Sven Joachim, Greg Kroah-Hartman, linux-kernel,
stable, Eric Dumazet, Andy Lutomirski, Karel Srot, netdev,
Eric Dumazet
In-Reply-To: <87li8zhwkw.fsf_-_@xmission.com>
On 2013/4/4 10:13, Eric W. Biederman wrote:
>
> This reverts commit 14134f6584212d585b310ce95428014b653dfaf6.
>
> The problem that the above patch was meant to address is that af_unix
> messages are not being coallesced because we are sending unnecesarry
> credentials. Not sending credentials in maybe_add_creds totally
> breaks unconnected unix domain sockets that wish to send credentails
> to other sockets.
>
thanks for check the question and make a fix solution, but I still doubt that if unconnected unix
domain socket wish to send credentails to oher sockets, why dont set
SOCK_PASSCRED on sock->flags, I think the user need to decide the param
and shouldnt send creds by default way.
Ding
> In practice this break some versions of udev because they receive a
> message and the sending uid is bogus so they drop the message.
>
> Cc: stable@vger.kernel.org
> Reported-by: Sven Joachim <svenjoac@gmx.de>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> net/unix/af_unix.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 971282b..f153a8d 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -1412,8 +1412,8 @@ static void maybe_add_creds(struct sk_buff *skb, const struct socket *sock,
> if (UNIXCB(skb).cred)
> return;
> if (test_bit(SOCK_PASSCRED, &sock->flags) ||
> - (other->sk_socket &&
> - test_bit(SOCK_PASSCRED, &other->sk_socket->flags))) {
> + !other->sk_socket ||
> + test_bit(SOCK_PASSCRED, &other->sk_socket->flags)) {
> UNIXCB(skb).pid = get_pid(task_tgid(current));
> UNIXCB(skb).cred = get_current_cred();
> }
>
^ permalink raw reply
* Re: [PATCH v5 1/6] drivers: phy: add generic PHY framework
From: Felipe Balbi @ 2013-04-04 7:15 UTC (permalink / raw)
To: Stephen Warren
Cc: broonie-yzvPICuk2AATkU/dhu1WVueM+bqZidxxQQ4Iyu8u01E,
mchehab-H+wXaHxf7aLQT0dZR+AlfA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-lFZ/pmaqli7XmaaqVzeoHQ, Kishon Vijay Abraham I,
javier-0uQlZySMnqxg9hUCZPvPmw, cesarb-PWySMVKUnqmsTnJN9+BGXg,
eballetbo-Re5JQEeQqe8AvxtiuMwx3w,
devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
rob.herring-bsGFqQB8/DxBDgjK7y7TUQ,
swarren-DDmLM1+adcrQT0dZR+AlfA,
sylvester.nawrocki-Re5JQEeQqe8AvxtiuMwx3w,
linux-omap-u79uwXL29TY76Z2rM5mHXA,
linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
linux-usb-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, balbi-l0cyMroinI0,
santosh.shilimkar-l0cyMroinI0, netdev-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <515CC123.4060402-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 494 bytes --]
Hi,
On Wed, Apr 03, 2013 at 05:54:11PM -0600, Stephen Warren wrote:
> struct phy {
> struct device *dev;
> struct module *owner;
> int (*init)(struct phy *phy);
> int (*exit)(struct phy *phy);
> int (*suspend)(struct phy *phy);
> int (*resume)(struct phy *phy);
I wonder whether it makes sense to provide ->suspend/->resume callbacks
here. We already have dev_pm_ops providing those methods and we can just
wrap pm_runtime_* calls to be called by consumers.
--
balbi
[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
[-- Attachment #2: Type: text/plain, Size: 192 bytes --]
_______________________________________________
devicetree-discuss mailing list
devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org
https://lists.ozlabs.org/listinfo/devicetree-discuss
^ permalink raw reply
* Re: [PATCH] af_unix: fix build warning
From: dingtianhong @ 2013-04-04 7:00 UTC (permalink / raw)
To: Rami Rosen; +Cc: David S. Miller, Eric Dumazet, Netdev, Li Zefan
In-Reply-To: <CAKoUArk67UyqAC26t7JCfL_LZWHAO1qB=dH=chaXHd-jsxrPfg@mail.gmail.com>
On 2013/4/3 20:27, Rami Rosen wrote:
> Hi,
> Which version of gcc are you using ?
> A patch like yours, for the same method, unix_bind(), was sent in the
> past, and Stephen Hemminger noted the
> with gcc 4.7 the warning you got does not occur.
> see:
> http://permalink.gmane.org/gmane.linux.network/247807
>
> Best Regards,
>
> Rami Rosen
> http://ramirose.wix.com/ramirosen
>
ok,thanks
>
> On Wed, Apr 3, 2013 at 12:31 PM, dingtianhong <dingtianhong@huawei.com> wrote:
>> net/unix/af_unix.c: In function ‘unix_bind’:
>> net/unix/af_unix.c:892: warning: ‘path.mnt’ may be used uninitialized in this function
>> net/unix/af_unix.c:892: warning: ‘path.dentry’ may be used uninitialized in this function
>>
>> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
>> ---
>> net/unix/af_unix.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> index 971282b..3ccc049 100644
>> --- a/net/unix/af_unix.c
>> +++ b/net/unix/af_unix.c
>> @@ -889,7 +889,7 @@ static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
>> atomic_set(&addr->refcnt, 1);
>>
>> if (sun_path[0]) {
>> - struct path path;
>> + struct path path = {};
>> umode_t mode = S_IFSOCK |
>> (SOCK_INODE(sock)->i_mode & ~current_umask());
>> err = unix_mknod(sun_path, mode, &path);
>> --
>> 1.8.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply
* Re: [net-next.git 1/7] stmmac: review napi gro support
From: Giuseppe CAVALLARO @ 2013-04-04 6:20 UTC (permalink / raw)
To: Ben Hutchings; +Cc: netdev, Eric Dumazet
In-Reply-To: <1365004913.2897.5.camel@bwh-desktop.uk.solarflarecom.com>
On 4/3/2013 6:01 PM, Ben Hutchings wrote:
> On Wed, 2013-04-03 at 07:41 +0200, Giuseppe CAVALLARO wrote:
>> This patch is to:
>> o use napi_gro_flush() before calling __napi_complete()
>
> napi_complete() already does that, and some other important things. Why
> do you think it makes sense to call __napi_complete() directly?
yes Ben you are right. In fact Eric already alerted me on this.
It is my fault, I'll remove this in my next set.
>
> Ben.
>
>> o turn on NETIF_F_GRO by default
>>
>> Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
>> ---
>> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +++--
>> 1 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> index 6b26d31..8b69e3b 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> @@ -2046,7 +2046,8 @@ static int stmmac_poll(struct napi_struct *napi, int budget)
>>
>> work_done = stmmac_rx(priv, budget);
>> if (work_done < budget) {
>> - napi_complete(napi);
>> + napi_gro_flush(napi, false);
>> + __napi_complete(napi);
>> stmmac_enable_dma_irq(priv);
>> }
>> return work_done;
>> @@ -2586,7 +2587,7 @@ struct stmmac_priv *stmmac_dvr_probe(struct device *device,
>>
>> ndev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
>> NETIF_F_RXCSUM;
>> - ndev->features |= ndev->hw_features | NETIF_F_HIGHDMA;
>> + ndev->features |= ndev->hw_features | NETIF_F_HIGHDMA | NETIF_F_GRO;
>> ndev->watchdog_timeo = msecs_to_jiffies(watchdog);
>> #ifdef STMMAC_VLAN_TAG_USED
>> /* Both mac100 and gmac support receive VLAN tag detection */
>
^ permalink raw reply
* Re: [net-next.git 1/7] stmmac: review napi gro support
From: Giuseppe CAVALLARO @ 2013-04-04 6:16 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1364996374.13853.15.camel@edumazet-glaptop>
On 4/3/2013 3:39 PM, Eric Dumazet wrote:
> On Wed, 2013-04-03 at 09:41 +0200, Giuseppe CAVALLARO wrote:
>
>> Hmm, I'm in trouble on this :-). Indeed I can understand the (fatal)
>> race and why napi_complete should be used. Sorry! So my fault and this
>> patch has to be discarded. I don't understand why I have not seen any
>> problems while running/stressing on SMP system. Have you got any idea?
>>
>
> So because you don't hit the race on your machine and your tests, we can
> remove all the needed spinlock() and various hard irq masking, and
> introduce all sort of races ?
No, no this was not my intention. If fact, I asked to discard the patch
;-) ... it is clear to me the race that can happen in that case.
I'll rework the patch to only add the GRO in the feature.
peppe
>
> Try to use a combination of two NICS, and you'll hit the bug even on UP.
>
> There is a single poll_list per cpu, and we insert new elements in this
> list under hard irq.
>
> So deleting elements _must_ be done under the protection of hard irq
> masking.
>
>
>
>
>
^ permalink raw reply
* Re: [net-next.git 3/7] stmmac: review private structure fields
From: Giuseppe CAVALLARO @ 2013-04-04 6:11 UTC (permalink / raw)
To: Ben Hutchings; +Cc: Eric Dumazet, netdev
In-Reply-To: <1365005352.2897.12.camel@bwh-desktop.uk.solarflarecom.com>
On 4/3/2013 6:09 PM, Ben Hutchings wrote:
> On Wed, 2013-04-03 at 09:33 +0200, Giuseppe CAVALLARO wrote:
>> On 4/3/2013 9:21 AM, Eric Dumazet wrote:
>>> On Wed, 2013-04-03 at 07:41 +0200, Giuseppe CAVALLARO wrote:
>>>> recently many new supports have been added in the stmmac driver w/o taking care
>>>> about where each new field had to be placed inside the private structure for
>>>> guaranteeing the best cache usage.
>>>> This is what I wanted in the beginning, so this patch reorganizes all the fields
>>>> in order to keep adjacent fields for cache effect.
>>>> I have also tried to optimize them by using pahole.
>>>>
>>>> Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
>>>> ---
>>>> drivers/net/ethernet/stmicro/stmmac/stmmac.h | 70 +++++++++++++-------------
>>>> 1 files changed, 35 insertions(+), 35 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
>>>> index 75f997b..8aa28c5 100644
>>>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
>>>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
>>>> @@ -35,36 +35,45 @@
>>>>
>>>> struct stmmac_priv {
>>>> /* Frequently used values are kept adjacent for cache effect */
>>>> - struct dma_desc *dma_tx ____cacheline_aligned; /* Basic TX desc */
>>>> - struct dma_extended_desc *dma_etx; /* Extended TX descriptor */
>>>> - dma_addr_t dma_tx_phy;
>>>> - struct sk_buff **tx_skbuff;
>>>> - dma_addr_t *tx_skbuff_dma;
>>>> + struct dma_extended_desc *dma_etx;
>>>> + struct dma_desc *dma_tx ____cacheline_aligned_in_smp;
>>>> + struct sk_buff **tx_skbuff ____cacheline_aligned_in_smp;
>>>
>>> dma_tx & tx_skbuff are readonly, why put them in separate cache lines ?
>>
>> I put tx_skbuff in a separate cache line because, when we use extended
>> descriptors, the driver works with dma_etx instead of dma_tx.
>> So my idea was to have both dma_etx, dma_tx and tx_skbuff aligned in
>> any case.
> [...]
>
> It's generally not that important to put fields at the beginning of a
> cache line. (If you've measured with typical systems using stmmac and
> found an advantage, then I accept that you have a good reason to do
> this. But that advantage is unlikely to be specific to SMP systems.)
That is the point. I had seen an improvement when testing on SH4
platforms if these pointers were at the beginning of a cache line.
> I would use ____cacheline_aligned_in_smp to separate fields that are
> likely to be changed on different CPUs, so that the cache line doesn't
> bounce between their caches. Fields that are rarely modified (such as
> these pointers), or are used together on the same CPU should be packed
> together into a small number of cache lines.
Thx Ben for this explanation. Let me do some other tests on SMP ARM.
I'll rework this patch trying to balance the "abuse" of
____cacheline_aligned_in_smp and the best performances (I can re-test
on ARM and SH4).
peppe
>
> Ben.
>
^ permalink raw reply
* Re: [net-next.git 2/7] stmmac: review barriers
From: Giuseppe CAVALLARO @ 2013-04-04 6:06 UTC (permalink / raw)
To: Shiraz HASHIM; +Cc: netdev@vger.kernel.org, Deepak Sikri, sergei.shtylyov
In-Reply-To: <20130403085106.GA2039@localhost.localdomain>
Ciao Shiraz!
On 4/3/2013 10:51 AM, Shiraz HASHIM wrote:
> Hi Giuseppe,
>
> On Wed, Apr 03, 2013 at 01:41:24PM +0800, Giuseppe CAVALLARO wrote:
>> In all my tests performed on SH4 and ARM A9 platforms, I've never met problems
>> that can be fixed by using memory barriers. In the past there was some issues
>> on SMP ARM but fixed by reviewing xmit spinlock.
>
> The problem which was addressed was not because of SMP IMO. It was
> rather due to the fact that the write to the GMAC descriptor (which
> happens to be in normal memory) has to be ordered with respect to GMAC
> DMA as observer. Isn't it ?
Hmm yes you are right, now I remember that this was a code reordering
issue especially when we had looked at the commit:
stmmac: add memory barriers at appropriate places
eb0dc4bb2e22c04964d
>> Further barriers have been added in the commits too: 8e83989106562326bf
>>
>> This patch is to use the smp_wbm instead of wbm because the driver
> ^^^^^^^^^^^^^^^^^^^^^^
> Perhaps you meant smp_wmb and wmb
sure.
from the commit:
stmmac: Fix for nfs hang on multiple reboot
8e83989106562326bf
I had not understand if the problem was related to the SMP or to the
code ordering.
>
>> runs on UP systems. Then, IMO it could make sense to only maintain the barriers
>> just in places where we touch the dma owner bits (that is the
>> only real critical path as we had seen and fixed in the commit:
>> eb0dc4bb2e22c04964d).
>
> Replacing wmb by smp_wmb may not be a good idea as we need to order
> the store transaction to the descriptor with respect to GMAC DMA and
> using smp_* version would just be compiler barrier in uniprocessor
> systems.
Yes this what I wanted although the main point remains pending.
On my side, on SH4 (UP) and ARM (SMP) with several different
compiler and flags I have never seen problems and no barriers
are needed.
Especially in the commit "stmmac: Fix for nfs hang on multiple reboot"
the description of the problem looks to be quite obscure and I cannot
find any "particular" relation with extra barrier.
In fact, if we can demonstrate that barriers are needed no problem to
keep them in the code. Otherwise I prefer to remove them.
What do you think?
Cheers
peppe
>> Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
>> Cc: Deepak Sikri <deepak.sikri@st.com>
>> Cc: Shiraz Hashim <shiraz.hashim@st.com>
>> ---
>> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 9 +++------
>> 1 files changed, 3 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> index 8b69e3b..c92dcbc 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> @@ -1797,15 +1797,13 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
>> priv->tx_skbuff[entry] = NULL;
>> priv->hw->desc->prepare_tx_desc(desc, 0, len, csum_insertion,
>> priv->mode);
>> - wmb();
>> + smp_wmb();
>> priv->hw->desc->set_tx_owner(desc);
>> - wmb();
>
> Since it is a loop, shouldn't we ensure that the ownership of a tx
> descriptor is set before next descriptor in chain is programmed ?
>
>> }
>>
>> /* Finalize the latest segment. */
>> priv->hw->desc->close_tx_desc(desc);
>>
>> - wmb();
>> /* According to the coalesce parameter the IC bit for the latest
>> * segment could be reset and the timer re-started to invoke the
>> * stmmac_tx function. This approach takes care about the fragments.
>> @@ -1821,9 +1819,9 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
>> } else
>> priv->tx_count_frames = 0;
>>
>> + smp_wmb();
>
> Please reconsider, may be keeping wmb is better.
>
>> /* To avoid raise condition */
>> priv->hw->desc->set_tx_owner(first);
>> - wmb();
>
> Not sure about this, perhaps can be removed.
>
>>
>> priv->cur_tx++;
>>
>> @@ -1899,9 +1897,8 @@ static inline void stmmac_rx_refill(struct stmmac_priv *priv)
>>
>> RX_DBG(KERN_INFO "\trefill entry #%d\n", entry);
>> }
>> - wmb();
>> + smp_wmb();
>> priv->hw->desc->set_rx_owner(p);
>> - wmb();
>
> Similarly this is a part of a loop, we need to see if set rx owner
> should be reflected before next descriptor program.
>
> --
> regards
> Shiraz
>
^ permalink raw reply
* [net-next] stmmac: modified pcs mode support for SGMII
From: Byungho An @ 2013-04-04 5:57 UTC (permalink / raw)
To: netdev
Cc: 'Giuseppe CAVALLARO', '김국진',
cpgs
This patch modifies the pcs mode support for SGMII. Even though
SGMII does auto-negotiation with phy, it needs stmmac_init_phy and
stmmac_mdio_register function for initializing phy.
Signed-off-by: Byungho An <bh74.an@samsung.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 6b26d31..3ac9bd7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1504,7 +1504,8 @@ static int stmmac_open(struct net_device *dev)
stmmac_check_ether_addr(priv);
- if (!priv->pcs) {
+ if (priv->pcs != STMMAC_PCS_RGMII && priv->pcs != STMMAC_PCS_TBI &&
+ priv->pcs != STMMAC_PCS_RTBI) {
ret = stmmac_init_phy(dev);
if (ret) {
pr_err("%s: Cannot attach to PHY (error: %d)\n",
@@ -1607,7 +1608,8 @@ static int stmmac_open(struct net_device *dev)
/* Using PCS we cannot dial with the phy registers at this stage
* so we do not support extra feature like EEE.
*/
- if (!priv->pcs)
+ if (priv->pcs != STMMAC_PCS_RGMII && priv->pcs != STMMAC_PCS_TBI &&
+ priv->pcs != STMMAC_PCS_RTBI)
priv->eee_enabled = stmmac_eee_init(priv);
stmmac_init_tx_coalesce(priv);
@@ -2637,7 +2639,8 @@ struct stmmac_priv *stmmac_dvr_probe(struct device
*device,
stmmac_check_pcs_mode(priv);
- if (!priv->pcs) {
+ if (priv->pcs != STMMAC_PCS_RGMII && priv->pcs != STMMAC_PCS_TBI &&
+ priv->pcs != STMMAC_PCS_RTBI) {
/* MDIO bus Registration */
ret = stmmac_mdio_register(ndev);
if (ret < 0) {
@@ -2677,7 +2680,8 @@ int stmmac_dvr_remove(struct net_device *ndev)
priv->hw->dma->stop_tx(priv->ioaddr);
stmmac_set_mac(priv->ioaddr, false);
- if (!priv->pcs)
+ if (priv->pcs != STMMAC_PCS_RGMII && priv->pcs != STMMAC_PCS_TBI &&
+ priv->pcs != STMMAC_PCS_RTBI)
stmmac_mdio_unregister(ndev);
netif_carrier_off(ndev);
unregister_netdev(ndev);
--
1.7.10.4
^ permalink raw reply related
* [PATCH 3/2] scm: Stop passing struct cred
From: Eric W. Biederman @ 2013-04-04 3:28 UTC (permalink / raw)
To: David S. Miller
Cc: Sven Joachim, Greg Kroah-Hartman, linux-kernel, Ding Tianhong,
Eric Dumazet, Andy Lutomirski, Karel Srot, netdev, Eric Dumazet
In-Reply-To: <87d2ubhwiw.fsf_-_@xmission.com>
Now that uids and gids are completely encapsulated in kuid_t
and kgid_t we no longer need to pass struct cred which allowed
us to test both the uid and the user namespace for equality.
Passing struct cred potentially allows us to pass the entire group
list as BSD does but I don't believe the cost of cache line misses
justifies retaining code for a future potential application.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
Included in this patchset because there are trivial dependencies,
and since we are sort of arguing about this anyway. This definitely is
not for stable.
include/net/af_unix.h | 3 ++-
include/net/scm.h | 16 ++++++----------
net/core/scm.c | 16 ----------------
net/unix/af_unix.c | 16 ++++++++--------
4 files changed, 16 insertions(+), 35 deletions(-)
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 0a996a3..a8836e8 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -29,7 +29,8 @@ struct unix_address {
struct unix_skb_parms {
struct pid *pid; /* Skb credentials */
- const struct cred *cred;
+ kuid_t uid;
+ kgid_t gid;
struct scm_fp_list *fp; /* Passed files */
#ifdef CONFIG_SECURITY_NETWORK
u32 secid; /* Security ID */
diff --git a/include/net/scm.h b/include/net/scm.h
index 975cca0..5a4c6a9 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -26,7 +26,6 @@ struct scm_fp_list {
struct scm_cookie {
struct pid *pid; /* Skb credentials */
- const struct cred *cred;
struct scm_fp_list *fp; /* Passed files */
struct scm_creds creds; /* Skb credentials */
#ifdef CONFIG_SECURITY_NETWORK
@@ -51,23 +50,18 @@ static __inline__ void unix_get_peersec_dgram(struct socket *sock, struct scm_co
#endif /* CONFIG_SECURITY_NETWORK */
static __inline__ void scm_set_cred(struct scm_cookie *scm,
- struct pid *pid, const struct cred *cred)
+ struct pid *pid, kuid_t uid, kgid_t gid)
{
scm->pid = get_pid(pid);
- scm->cred = cred ? get_cred(cred) : NULL;
scm->creds.pid = pid_vnr(pid);
- scm->creds.uid = cred ? cred->euid : INVALID_UID;
- scm->creds.gid = cred ? cred->egid : INVALID_GID;
+ scm->creds.uid = uid;
+ scm->creds.gid = gid;
}
static __inline__ void scm_destroy_cred(struct scm_cookie *scm)
{
put_pid(scm->pid);
scm->pid = NULL;
-
- if (scm->cred)
- put_cred(scm->cred);
- scm->cred = NULL;
}
static __inline__ void scm_destroy(struct scm_cookie *scm)
@@ -81,8 +75,10 @@ static __inline__ int scm_send(struct socket *sock, struct msghdr *msg,
struct scm_cookie *scm, bool forcecreds)
{
memset(scm, 0, sizeof(*scm));
+ scm->creds.uid = INVALID_UID;
+ scm->creds.gid = INVALID_GID;
if (forcecreds)
- scm_set_cred(scm, task_tgid(current), current_cred());
+ scm_set_cred(scm, task_tgid(current), current_euid(), current_egid());
unix_get_peersec_dgram(sock, scm);
if (msg->msg_controllen <= 0)
return 0;
diff --git a/net/core/scm.c b/net/core/scm.c
index 2dc6cda..83b2b38 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -187,22 +187,6 @@ int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *p)
p->creds.uid = uid;
p->creds.gid = gid;
-
- if (!p->cred ||
- !uid_eq(p->cred->euid, uid) ||
- !gid_eq(p->cred->egid, gid)) {
- struct cred *cred;
- err = -ENOMEM;
- cred = prepare_creds();
- if (!cred)
- goto error;
-
- cred->uid = cred->euid = uid;
- cred->gid = cred->egid = gid;
- if (p->cred)
- put_cred(p->cred);
- p->cred = cred;
- }
break;
}
default:
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 2db702d..f5594b5 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1340,7 +1340,6 @@ static void unix_destruct_scm(struct sk_buff *skb)
struct scm_cookie scm;
memset(&scm, 0, sizeof(scm));
scm.pid = UNIXCB(skb).pid;
- scm.cred = UNIXCB(skb).cred;
if (UNIXCB(skb).fp)
unix_detach_fds(&scm, skb);
@@ -1391,8 +1390,8 @@ static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool sen
int err = 0;
UNIXCB(skb).pid = get_pid(scm->pid);
- if (scm->cred)
- UNIXCB(skb).cred = get_cred(scm->cred);
+ UNIXCB(skb).uid = scm->creds.uid;
+ UNIXCB(skb).gid = scm->creds.gid;
UNIXCB(skb).fp = NULL;
if (scm->fp && send_fds)
err = unix_attach_fds(scm, skb);
@@ -1409,13 +1408,13 @@ static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool sen
static void maybe_add_creds(struct sk_buff *skb, const struct socket *sock,
const struct sock *other)
{
- if (UNIXCB(skb).cred)
+ if (UNIXCB(skb).pid)
return;
if (test_bit(SOCK_PASSCRED, &sock->flags) ||
!other->sk_socket ||
test_bit(SOCK_PASSCRED, &other->sk_socket->flags)) {
UNIXCB(skb).pid = get_pid(task_tgid(current));
- UNIXCB(skb).cred = get_current_cred();
+ current_euid_egid(&UNIXCB(skb).uid, &UNIXCB(skb).gid);
}
}
@@ -1819,7 +1818,7 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock,
siocb->scm = &tmp_scm;
memset(&tmp_scm, 0, sizeof(tmp_scm));
}
- scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
+ scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).uid, UNIXCB(skb).gid);
unix_set_secdata(siocb->scm, skb);
if (!(flags & MSG_PEEK)) {
@@ -1991,11 +1990,12 @@ again:
if (check_creds) {
/* Never glue messages from different writers */
if ((UNIXCB(skb).pid != siocb->scm->pid) ||
- (UNIXCB(skb).cred != siocb->scm->cred))
+ !uid_eq(UNIXCB(skb).uid, siocb->scm->creds.uid) ||
+ !gid_eq(UNIXCB(skb).gid, siocb->scm->creds.gid))
break;
} else if (test_bit(SOCK_PASSCRED, &sock->flags)) {
/* Copy credentials */
- scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
+ scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).uid, UNIXCB(skb).gid);
check_creds = 1;
}
--
1.7.5.4
^ permalink raw reply related
* Re: [net] ixgbe: fix registration order of driver and DCA nofitication
From: David Miller @ 2013-04-04 3:25 UTC (permalink / raw)
To: joe; +Cc: jeffrey.t.kirsher, jakub.kicinski, netdev, gospo, sassmann,
stable
In-Reply-To: <1365045072.5248.1.camel@joe-AO722>
From: Joe Perches <joe@perches.com>
Date: Wed, 03 Apr 2013 20:11:12 -0700
> On Wed, 2013-04-03 at 19:50 -0700, Jeff Kirsher wrote:
>> From: Jakub Kicinski <jakub.kicinski@intel.com>
>>
>> ixgbe_notify_dca cannot be called before driver registration
>> because it expects driver's klist_devices to be allocated and
>> initialized. While on it make sure debugfs files are removed
>> when registration fails.
>
> Why not take out a bunch of these #ifdef CONFIG_DEBUG_FS at
> the same time? Something like:
Because it's a bug fix targetted at 'net' and thus must be a minimal
fix rather than something targetted at 'net-next' where we could do
cleanups like that.
^ permalink raw reply
* Re: [net] ixgbe: fix registration order of driver and DCA nofitication
From: Joe Perches @ 2013-04-04 3:11 UTC (permalink / raw)
To: Jeff Kirsher; +Cc: davem, Jakub Kicinski, netdev, gospo, sassmann, stable
In-Reply-To: <1365043854-8443-1-git-send-email-jeffrey.t.kirsher@intel.com>
On Wed, 2013-04-03 at 19:50 -0700, Jeff Kirsher wrote:
> From: Jakub Kicinski <jakub.kicinski@intel.com>
>
> ixgbe_notify_dca cannot be called before driver registration
> because it expects driver's klist_devices to be allocated and
> initialized. While on it make sure debugfs files are removed
> when registration fails.
Why not take out a bunch of these #ifdef CONFIG_DEBUG_FS at
the same time? Something like:
drivers/net/ethernet/intel/ixgbe/ixgbe.h | 5 +++++
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 17 +++++++----------
2 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index a8e10cf..ca93238 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -740,6 +740,11 @@ extern void ixgbe_dbg_adapter_init(struct ixgbe_adapter *adapter);
extern void ixgbe_dbg_adapter_exit(struct ixgbe_adapter *adapter);
extern void ixgbe_dbg_init(void);
extern void ixgbe_dbg_exit(void);
+#else
+static inline void ixgbe_dbg_adapter_init(struct ixgbe_adapter *adapter) {}
+static inline void ixgbe_dbg_adapter_exit(struct ixgbe_adapter *adapter) {}
+static inline void ixgbe_dbg_init(void) {}
+static inline void ixgbe_dbg_exit(void) {}
#endif /* CONFIG_DEBUG_FS */
static inline struct netdev_queue *txring_txq(const struct ixgbe_ring *ring)
{
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index e56a3d1..06cd8cd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7575,9 +7575,7 @@ skip_sriov:
e_err(probe, "failed to allocate sysfs resources\n");
#endif /* CONFIG_IXGBE_HWMON */
-#ifdef CONFIG_DEBUG_FS
ixgbe_dbg_adapter_init(adapter);
-#endif /* CONFIG_DEBUG_FS */
return 0;
@@ -7613,9 +7611,7 @@ static void ixgbe_remove(struct pci_dev *pdev)
struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
struct net_device *netdev = adapter->netdev;
-#ifdef CONFIG_DEBUG_FS
ixgbe_dbg_adapter_exit(adapter);
-#endif /*CONFIG_DEBUG_FS */
set_bit(__IXGBE_DOWN, &adapter->state);
cancel_work_sync(&adapter->service_task);
@@ -7878,16 +7874,19 @@ static int __init ixgbe_init_module(void)
pr_info("%s - version %s\n", ixgbe_driver_string, ixgbe_driver_version);
pr_info("%s\n", ixgbe_copyright);
-#ifdef CONFIG_DEBUG_FS
ixgbe_dbg_init();
-#endif /* CONFIG_DEBUG_FS */
+
+ ret = pci_register_driver(&ixgbe_driver);
+ if (ret) {
+ ixgbe_dbg_exit();
+ return ret;
+ }
#ifdef CONFIG_IXGBE_DCA
dca_register_notify(&dca_notifier);
#endif
- ret = pci_register_driver(&ixgbe_driver);
- return ret;
+ return 0;
}
module_init(ixgbe_init_module);
@@ -7905,9 +7904,7 @@ static void __exit ixgbe_exit_module(void)
#endif
pci_unregister_driver(&ixgbe_driver);
-#ifdef CONFIG_DEBUG_FS
ixgbe_dbg_exit();
-#endif /* CONFIG_DEBUG_FS */
rcu_barrier(); /* Wait for completion of call_rcu()'s */
}
^ permalink raw reply related
* [net] ixgbe: fix registration order of driver and DCA nofitication
From: Jeff Kirsher @ 2013-04-04 2:50 UTC (permalink / raw)
To: davem; +Cc: Jakub Kicinski, netdev, gospo, sassmann, stable, Jeff Kirsher
From: Jakub Kicinski <jakub.kicinski@intel.com>
ixgbe_notify_dca cannot be called before driver registration
because it expects driver's klist_devices to be allocated and
initialized. While on it make sure debugfs files are removed
when registration fails.
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index db5611a..79f4a26 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7922,12 +7922,19 @@ static int __init ixgbe_init_module(void)
ixgbe_dbg_init();
#endif /* CONFIG_DEBUG_FS */
+ ret = pci_register_driver(&ixgbe_driver);
+ if (ret) {
+#ifdef CONFIG_DEBUG_FS
+ ixgbe_dbg_exit();
+#endif /* CONFIG_DEBUG_FS */
+ return ret;
+ }
+
#ifdef CONFIG_IXGBE_DCA
dca_register_notify(&dca_notifier);
#endif
- ret = pci_register_driver(&ixgbe_driver);
- return ret;
+ return 0;
}
module_init(ixgbe_init_module);
--
1.7.11.7
^ permalink raw reply related
* [PATCH 2/2] af_unix: If we don't care about credentials coallesce all messages
From: Eric W. Biederman @ 2013-04-04 2:14 UTC (permalink / raw)
To: David S. Miller
Cc: Sven Joachim, Greg Kroah-Hartman, linux-kernel, stable,
Ding Tianhong, Eric Dumazet, Andy Lutomirski, Karel Srot, netdev,
Eric Dumazet
In-Reply-To: <87li8zhwkw.fsf_-_@xmission.com>
It was reported that the following LSB test case failed
https://lsbbugs.linuxfoundation.org/attachment.cgi?id=2144 because we
were not coallescing unix stream messages when the application was
expecting us to.
The problem was that the first send was before the socket was accepted
and thus sock->sk_socket was NULL in maybe_add_creds, and the second
send after the socket was accepted had a non-NULL value for sk->socket
and thus we could tell the credentials were not needed so we did not
bother.
The unnecessary credentials on the first message cause
unix_stream_recvmsg to start verifying that all messages had the same
credentials before coallescing and then the coallescing failed because
the second message had no credentials.
Ignoring credentials when we don't care in unix_stream_recvmsg fixes a
long standing pessimization which would fail to coallesce messages when
reading from a unix stream socket if the senders were different even if
we did not care about their credentials.
I have tested this and verified that the in the LSB test case mentioned
above that the messages do coallesce now, while the were failing to
coallesce without this change.
Reported-by: Karel Srot <ksrot@redhat.com>
Reported-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
net/unix/af_unix.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f153a8d..2db702d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1993,7 +1993,7 @@ again:
if ((UNIXCB(skb).pid != siocb->scm->pid) ||
(UNIXCB(skb).cred != siocb->scm->cred))
break;
- } else {
+ } else if (test_bit(SOCK_PASSCRED, &sock->flags)) {
/* Copy credentials */
scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
check_creds = 1;
--
1.7.5.4
^ permalink raw reply related
* [PATCH 1/2] Revert "af_unix: dont send SCM_CREDENTIAL when dest socket is NULL"
From: Eric W. Biederman @ 2013-04-04 2:13 UTC (permalink / raw)
To: David S. Miller
Cc: Sven Joachim, Greg Kroah-Hartman, linux-kernel, stable,
Ding Tianhong, Eric Dumazet, Andy Lutomirski, Karel Srot, netdev,
Eric Dumazet
In-Reply-To: <878v4zjei0.fsf@xmission.com>
This reverts commit 14134f6584212d585b310ce95428014b653dfaf6.
The problem that the above patch was meant to address is that af_unix
messages are not being coallesced because we are sending unnecesarry
credentials. Not sending credentials in maybe_add_creds totally
breaks unconnected unix domain sockets that wish to send credentails
to other sockets.
In practice this break some versions of udev because they receive a
message and the sending uid is bogus so they drop the message.
Cc: stable@vger.kernel.org
Reported-by: Sven Joachim <svenjoac@gmx.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
net/unix/af_unix.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 971282b..f153a8d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1412,8 +1412,8 @@ static void maybe_add_creds(struct sk_buff *skb, const struct socket *sock,
if (UNIXCB(skb).cred)
return;
if (test_bit(SOCK_PASSCRED, &sock->flags) ||
- (other->sk_socket &&
- test_bit(SOCK_PASSCRED, &other->sk_socket->flags))) {
+ !other->sk_socket ||
+ test_bit(SOCK_PASSCRED, &other->sk_socket->flags)) {
UNIXCB(skb).pid = get_pid(task_tgid(current));
UNIXCB(skb).cred = get_current_cred();
}
--
1.7.5.4
^ permalink raw reply related
* [PATCH -next] decnet: remove duplicated include from dn_table.c
From: Wei Yongjun @ 2013-04-04 1:33 UTC (permalink / raw)
To: davem, tgraf, shemminger, sasha.levin, akpm
Cc: yongjun_wei, linux-decnet-user, netdev
From: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Remove duplicated include.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
---
net/decnet/dn_table.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/net/decnet/dn_table.c b/net/decnet/dn_table.c
index b15c1d1..86e3807 100644
--- a/net/decnet/dn_table.c
+++ b/net/decnet/dn_table.c
@@ -19,7 +19,6 @@
#include <linux/sockios.h>
#include <linux/init.h>
#include <linux/skbuff.h>
-#include <net/netlink.h>
#include <linux/rtnetlink.h>
#include <linux/proc_fs.h>
#include <linux/netdevice.h>
^ permalink raw reply related
* [PATCH -next] net_cls: remove duplicated include from cls_api.c
From: Wei Yongjun @ 2013-04-04 1:21 UTC (permalink / raw)
To: jhs, davem; +Cc: yongjun_wei, netdev
From: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Remove duplicated include.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
---
net/sched/cls_api.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 5c81b26..8e118af 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -22,7 +22,6 @@
#include <linux/skbuff.h>
#include <linux/init.h>
#include <linux/kmod.h>
-#include <net/netlink.h>
#include <linux/err.h>
#include <linux/slab.h>
#include <net/net_namespace.h>
^ permalink raw reply related
* Powered By Google.
From: Google Incorporation ® @ 2013-04-03 21:47 UTC (permalink / raw)
[-- Attachment #1: Type: text/plain, Size: 207 bytes --]
Dear Google User,
You have been selected as a winner for using Google services. Find attached email with more details.
Congratulations,
Matt Brittin.
CEO Google UK.
©2013 Google Incorporation's
[-- Attachment #2: Google.pdf --]
[-- Type: application/pdf, Size: 1676251 bytes --]
^ permalink raw reply
* Re: [PATCH v2] MPLS: Add limited GSO support
From: Simon Horman @ 2013-04-04 0:55 UTC (permalink / raw)
To: Jesse Gross; +Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev
In-Reply-To: <CAEP_g=9cnm39i7rBk=Pi_1e6dFDLMa6oVFuO3PufEhODCVBqnw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Wed, Apr 03, 2013 at 05:44:17PM -0700, Jesse Gross wrote:
> On Wed, Apr 3, 2013 at 5:28 PM, Simon Horman <horms-/R6kz+dDXgpPR4JQBCEnsQ@public.gmane.org> wrote:
> > On Wed, Apr 03, 2013 at 04:51:38PM -0700, Jesse Gross wrote:
> >> On Wed, Apr 3, 2013 at 2:11 AM, Simon Horman <horms-/R6kz+dDXgpPR4JQBCEnsQ@public.gmane.org> wrote:
> >> > In the case where a non-MPLS packet is recieved and an MPLS stack is
> >> > added it may well be the case that the original skb is GSO but the
> >> > NIC used for transmit does not support GSO of MPLS packets.
> >> >
> >> > The aim of this code is to provide GSO in software for MPLS packets
> >> > whose skbs are GSO.
> >> >
> >> > When an implementation adds an MPLS stack to a non-MPLS packet it should do
> >> > the following to skb metadata:
> >> >
> >> > * Set skb_mac_header(skb)->protocol to the new MPLS ethertype.
> >> > That is, either ETH_P_MPLS_UC or ETH_P_MPLS_MC.
> >> >
> >> > * Leave skb->protocol as the old non-MPLS ethertype.
> >> >
> >> > * Set skb->encapsulation = 1.
> >> >
> >> > This may not strictly be necessary as I believe that checking
> >> > skb_mac_header(skb)->protocol and skb->protocol should be necessary and
> >> > sufficient.
> >> >
> >> > However, it does seem to fit nicely with the current implementation of
> >> > dev_hard_start_xmit() where the more expensive check of
> >> > skb_mac_header(skb)->protocol may be guarded by an existing check of
> >> > skb->encapsulation.
> >> >
> >> > One aspect of this patch that I am unsure about is the modification I have
> >> > made to skb_segment(). This seems to be necessary as checskum accelearation
> >> > may no longer be possible as the packet has changed to be MPLS from some
> >> > other packet type which may have been supported by the hardware in use.
> >> >
> >> > I will post a patch, "[PATCH v3.24] datapath: Add basic MPLS support to
> >> > kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
> >> > That patch sets the above requirements in
> >> > datapath/actions.c:set_ethertype() and was used to exercise the MPLS GSO
> >> > code. The datapath patch is against the Open vSwtich tree but it is
> >> > intended that it be added to the Open vSwtich code present in the mainline
> >> > Linux kernel at some point.
> >> >
> >> > Suggested by Jesse Gross. Based heavily on "v4 GRE: Add TCP segmentation
> >> > offload for GRE" by Pravin B Shelar.
> >> >
> >> > Cc: Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
> >> > Cc: Pravin B Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
> >> > Signed-off-by: Simon Horman <horms-/R6kz+dDXgpPR4JQBCEnsQ@public.gmane.org>
> >>
> >> MPLS is very similar to both the Ethernet header and vlans in that GSO
> >> only requires replication without any modification. That means that
> >> if we look at the mac_len as containing all three then we can just
> >> copy it without any special knowledge. I don't know that we carefully
> >> maintain mac_len in all places but you are already doing that in your
> >> MPLS patches.
> >
> > At least for the cases that I am aware of I think that mac_len is
> > predictable. But I'm a little unsure of what you are getting at here.
>
> If you have the MAC len then you don't need any new MPLS code here at
> all; just replicate the whole thing onto each packet.
The MAC len is set to cover everything up to the top of the MPLS stack.
So it seems that something needs to be done to account for the length
of the MPLS stack.
Are you suggesting that if Open vSwtich set up the MAC len to extend
to the end of the MPLS stack then mpls_gro_segment() would not be necessary?
^ permalink raw reply
* linux-next: manual merge of the net-next tree with the wireless tree
From: Stephen Rothwell @ 2013-04-04 0:44 UTC (permalink / raw)
To: David Miller, netdev
Cc: linux-next, linux-kernel, Joe Perches, Gabor Juhos,
John W. Linville
[-- Attachment #1: Type: text/plain, Size: 1589 bytes --]
Hi all,
Today's linux-next merge of the net-next tree got a conflict in
drivers/net/wireless/rt2x00/rt2x00pci.c between commit 69a2bac8984c
("rt2x00: rt2x00pci: fix build error on Ralink RT3x5x SoCs") from the
wireless tree and commit 1f9061d27d3d ("drivers:net: dma_alloc_coherent:
use __GFP_ZERO instead of memset(, 0)") from the net-next tree.
The former moved the code modified by the latter into a different file,
so I added the following merge fix patch and can carry the fix as
necessary (no action is required).
From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 4 Apr 2013 11:42:05 +1100
Subject: [PATCH] drivers:net: fix up for code movement from rt2x00pci.c
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
drivers/net/wireless/rt2x00/rt2x00mmio.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/net/wireless/rt2x00/rt2x00mmio.c b/drivers/net/wireless/rt2x00/rt2x00mmio.c
index d84a680..9fe9a36 100644
--- a/drivers/net/wireless/rt2x00/rt2x00mmio.c
+++ b/drivers/net/wireless/rt2x00/rt2x00mmio.c
@@ -123,12 +123,10 @@ static int rt2x00pci_alloc_queue_dma(struct rt2x00_dev *rt2x00dev,
*/
addr = dma_alloc_coherent(rt2x00dev->dev,
queue->limit * queue->desc_size,
- &dma, GFP_KERNEL);
+ &dma, GFP_KERNEL | __GFP_ZERO);
if (!addr)
return -ENOMEM;
- memset(addr, 0, queue->limit * queue->desc_size);
-
/*
* Initialize all queue entries to contain valid addresses.
*/
--
1.8.1
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply related
* Re: [PATCH v2] MPLS: Add limited GSO support
From: Jesse Gross @ 2013-04-04 0:44 UTC (permalink / raw)
To: Simon Horman
Cc: dev@openvswitch.org, netdev, Jarno Rajahalme, Pravin B Shelar
In-Reply-To: <20130404002809.GF29678@verge.net.au>
On Wed, Apr 3, 2013 at 5:28 PM, Simon Horman <horms@verge.net.au> wrote:
> On Wed, Apr 03, 2013 at 04:51:38PM -0700, Jesse Gross wrote:
>> On Wed, Apr 3, 2013 at 2:11 AM, Simon Horman <horms@verge.net.au> wrote:
>> > In the case where a non-MPLS packet is recieved and an MPLS stack is
>> > added it may well be the case that the original skb is GSO but the
>> > NIC used for transmit does not support GSO of MPLS packets.
>> >
>> > The aim of this code is to provide GSO in software for MPLS packets
>> > whose skbs are GSO.
>> >
>> > When an implementation adds an MPLS stack to a non-MPLS packet it should do
>> > the following to skb metadata:
>> >
>> > * Set skb_mac_header(skb)->protocol to the new MPLS ethertype.
>> > That is, either ETH_P_MPLS_UC or ETH_P_MPLS_MC.
>> >
>> > * Leave skb->protocol as the old non-MPLS ethertype.
>> >
>> > * Set skb->encapsulation = 1.
>> >
>> > This may not strictly be necessary as I believe that checking
>> > skb_mac_header(skb)->protocol and skb->protocol should be necessary and
>> > sufficient.
>> >
>> > However, it does seem to fit nicely with the current implementation of
>> > dev_hard_start_xmit() where the more expensive check of
>> > skb_mac_header(skb)->protocol may be guarded by an existing check of
>> > skb->encapsulation.
>> >
>> > One aspect of this patch that I am unsure about is the modification I have
>> > made to skb_segment(). This seems to be necessary as checskum accelearation
>> > may no longer be possible as the packet has changed to be MPLS from some
>> > other packet type which may have been supported by the hardware in use.
>> >
>> > I will post a patch, "[PATCH v3.24] datapath: Add basic MPLS support to
>> > kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
>> > That patch sets the above requirements in
>> > datapath/actions.c:set_ethertype() and was used to exercise the MPLS GSO
>> > code. The datapath patch is against the Open vSwtich tree but it is
>> > intended that it be added to the Open vSwtich code present in the mainline
>> > Linux kernel at some point.
>> >
>> > Suggested by Jesse Gross. Based heavily on "v4 GRE: Add TCP segmentation
>> > offload for GRE" by Pravin B Shelar.
>> >
>> > Cc: Jesse Gross <jesse@nicira.com>
>> > Cc: Pravin B Shelar <pshelar@nicira.com>
>> > Signed-off-by: Simon Horman <horms@verge.net.au>
>>
>> MPLS is very similar to both the Ethernet header and vlans in that GSO
>> only requires replication without any modification. That means that
>> if we look at the mac_len as containing all three then we can just
>> copy it without any special knowledge. I don't know that we carefully
>> maintain mac_len in all places but you are already doing that in your
>> MPLS patches.
>
> At least for the cases that I am aware of I think that mac_len is
> predictable. But I'm a little unsure of what you are getting at here.
If you have the MAC len then you don't need any new MPLS code here at
all; just replicate the whole thing onto each packet.
^ permalink raw reply
* Re: [PATCH v2] MPLS: Add limited GSO support
From: Simon Horman @ 2013-04-04 0:28 UTC (permalink / raw)
To: Jesse Gross; +Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org, netdev
In-Reply-To: <CAEP_g=_Y0hN2DvyO4JitMrpQvN=TmTO5Pk28DpvF6Ya09AeTcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Wed, Apr 03, 2013 at 04:51:38PM -0700, Jesse Gross wrote:
> On Wed, Apr 3, 2013 at 2:11 AM, Simon Horman <horms-/R6kz+dDXgpPR4JQBCEnsQ@public.gmane.org> wrote:
> > In the case where a non-MPLS packet is recieved and an MPLS stack is
> > added it may well be the case that the original skb is GSO but the
> > NIC used for transmit does not support GSO of MPLS packets.
> >
> > The aim of this code is to provide GSO in software for MPLS packets
> > whose skbs are GSO.
> >
> > When an implementation adds an MPLS stack to a non-MPLS packet it should do
> > the following to skb metadata:
> >
> > * Set skb_mac_header(skb)->protocol to the new MPLS ethertype.
> > That is, either ETH_P_MPLS_UC or ETH_P_MPLS_MC.
> >
> > * Leave skb->protocol as the old non-MPLS ethertype.
> >
> > * Set skb->encapsulation = 1.
> >
> > This may not strictly be necessary as I believe that checking
> > skb_mac_header(skb)->protocol and skb->protocol should be necessary and
> > sufficient.
> >
> > However, it does seem to fit nicely with the current implementation of
> > dev_hard_start_xmit() where the more expensive check of
> > skb_mac_header(skb)->protocol may be guarded by an existing check of
> > skb->encapsulation.
> >
> > One aspect of this patch that I am unsure about is the modification I have
> > made to skb_segment(). This seems to be necessary as checskum accelearation
> > may no longer be possible as the packet has changed to be MPLS from some
> > other packet type which may have been supported by the hardware in use.
> >
> > I will post a patch, "[PATCH v3.24] datapath: Add basic MPLS support to
> > kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
> > That patch sets the above requirements in
> > datapath/actions.c:set_ethertype() and was used to exercise the MPLS GSO
> > code. The datapath patch is against the Open vSwtich tree but it is
> > intended that it be added to the Open vSwtich code present in the mainline
> > Linux kernel at some point.
> >
> > Suggested by Jesse Gross. Based heavily on "v4 GRE: Add TCP segmentation
> > offload for GRE" by Pravin B Shelar.
> >
> > Cc: Jesse Gross <jesse-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
> > Cc: Pravin B Shelar <pshelar-l0M0P4e3n4LQT0dZR+AlfA@public.gmane.org>
> > Signed-off-by: Simon Horman <horms-/R6kz+dDXgpPR4JQBCEnsQ@public.gmane.org>
>
> MPLS is very similar to both the Ethernet header and vlans in that GSO
> only requires replication without any modification. That means that
> if we look at the mac_len as containing all three then we can just
> copy it without any special knowledge. I don't know that we carefully
> maintain mac_len in all places but you are already doing that in your
> MPLS patches.
At least for the cases that I am aware of I think that mac_len is
predictable. But I'm a little unsure of what you are getting at here.
> The other piece at that point is getting the inner protocol. I'm
> worried that using skb->protocol for this will break things that try
> to use it for filtering. It also means that depending on whether an
> MPLS packet is locally sourced or not skb->protocol may be different
> because we won't always be able to find the inner header. I think
> this will require a careful definition of that field to make it
> consistent.
Yes, I agree. I was hoping that posting the patch to netdev would
result in some analysis of this.
Another idea I had would be to a new element to struct sk_buff to
explicitly hold the encapsulated protocol for (cases like) MPLS.
But it seems that it would be nicer to re-use the protocol element
if possible.
^ permalink raw reply
* Re: [PATCH] MPLS: Add limited GSO support
From: Simon Horman @ 2013-04-04 0:17 UTC (permalink / raw)
To: Ben Hutchings; +Cc: dev, netdev, Jesse Gross, Pravin B Shelar
In-Reply-To: <1365004780.2897.3.camel@bwh-desktop.uk.solarflarecom.com>
On Wed, Apr 03, 2013 at 04:59:40PM +0100, Ben Hutchings wrote:
> I don't know anything about MPLS so this is a pretty superficial review.
>
> On Wed, 2013-04-03 at 14:24 +0900, Simon Horman wrote:
> [...]
> > --- a/include/linux/netdev_features.h
> > +++ b/include/linux/netdev_features.h
> > @@ -43,6 +43,7 @@ enum {
> > NETIF_F_FSO_BIT, /* ... FCoE segmentation */
> > NETIF_F_GSO_GRE_BIT, /* ... GRE with TSO */
> > NETIF_F_GSO_UDP_TUNNEL_BIT, /* ... UDP TUNNEL with TSO */
> > + NETIF_F_GSO_MPLS_BIT, /* ... MPLS segmentation */
> > /**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */
> > NETIF_F_GSO_UDP_TUNNEL_BIT,
>
> You need to change NETIF_F_GSO_LAST as well.
Thanks, I have fixed that in v2.
>
> [...]
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> [...]
> > @@ -2789,12 +2791,17 @@ static inline struct sec_path *skb_sec_path(struct sk_buff *skb)
> > }
> > #endif
> >
> > -/* Keeps track of mac header offset relative to skb->head.
> > - * It is useful for TSO of Tunneling protocol. e.g. GRE.
> > - * For non-tunnel skb it points to skb_mac_header() and for
> > - * tunnel skb it points to outer mac header. */
> > struct skb_gso_cb {
> > + /* Keeps track of mac header offset relative to skb->head.
> > + * It is useful for TSO of Tunneling protocol. e.g. GRE.
> > + * For non-tunnel skb it points to skb_mac_header() and for
> > + * tunnel skb it points to outer mac header. */
> > int mac_offset;
> > +
> > + /* Keeps track of the ethernet type of an encapsualted
> [...]
>
> Typo: 'encapsualted' should be 'encapsulated'.
Thanks, I will fix that in v3.
^ permalink raw reply
* Re: this cpu documentation
From: Randy Dunlap @ 2013-04-04 0:09 UTC (permalink / raw)
To: Christoph Lameter
Cc: Eric Dumazet, RongQing Li, Shan Wei, netdev, Tejun Heo, srostedt
In-Reply-To: <0000013dd1a20ebf-4a76fb06-4b9d-492e-9d77-4b3f43aceca7-000000@email.amazonses.com>
On 04/03/13 13:41, Christoph Lameter wrote:
>
> From: Christoph Lameter <cl@linux.com>
> Subject: this_cpu: Add documentation
>
> Document the rationale and the way to use this_cpu operations.
>
> Signed-off-by: Christoph Lameter <cl@linux.com>
>
> Index: linux/Documentation/this_cpu_ops
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux/Documentation/this_cpu_ops 2013-04-03 15:25:41.424846306 -0500
> @@ -0,0 +1,194 @@
> +this_cpu operations
> +-------------------
> +
> +this_cpu operations are a way of optimizing access to per cpu variables
> +associated with the *currently* executing processor
> +through the use of segment registers (or a dedicated register where the cpu
> +permanently stored the beginning of the per cpu area for a specific
> +processor).
> +
> +The this_cpu operations add an per cpu variable offset to the processor
add a per
> +specific percpu base and encode that operation in the instruction operating
> +on the per cpu variable.
> +
> +This mean there are no atomicity issues between the calculation
means
> +of the offset and the operation on the data. Therefore it is not necessary
> +to disable preempt or interrupts to ensure that the processor is not changed
> +between the calculation of the address and the operation on the data.
> +
> +Read-modify-write operations are of particular interest. Frequently
> +processors have special lower latency instructions that can operate without
> +the typical synchronization overhead but still provide some sort of relaxed
> +atomicity guarantee. The x86 for example can execute RMV instructions like
RMW ??
> +inc/dec/cmpxchg without the lock prefix and the associated latency penalty.
> +
> +Access to the variable without the lock prefix is not synchronized but
> +synchronization is not necessary since we are dealing with per cpu data
> +specific to the currently executing processor. Only the current processor
> +should be accessing that variable and therefore there are no concurency
concurrency
> +issues with other processors in the system.
> +
> +On x86 the fs: or the gs: segment registers contain the basis of the per cpu area. It is
base
> +then possible to simply use the segment override to relocate a per cpu relative address
> +to the proper per cpu area for the processor. So the relocation to the per cpu base
> +is encoded in the instruction via a segment register prefix.
> +
> +For example:
> +
> + DEFINE_PER_CPU(int, x);
> + int z;
> +
> + z = this_cpu_read(x);
> +
> +results in a single instruction
> +
> + mov ax, gs:[x]
> +
> +instead of a sequence of calculation of the address and then a fetch from
> +that address which occurs with the percpu operations. Before this_cpu_ops
> +such sequence also required preempt disable/enable to prevent the Os from
OS or O/S or kernel
> +moving the thread to a different processor while the calculation is performed.
> +
> +
> +The main use of the this_cpu operations has been to optimize counter operations.
> +
> +
> + this_cpu_inc(x)
> +
> +results in the following single instruction (no lock prefix!)
> +
> + inc gs:[x]
> +
> +
> +instead of the following operations required if there is no segment register.
> +
> + int *y;
> + int cpu;
> +
> + cpu = get_cpu();
> + y = per_cpu_ptr(&x, cpu);
> + (*y)++;
> + put_cpu();
> +
> +
> +Note that these operations can only be used on percpu data that is reserved for
> +a specific processor. Without disabling preemption in the surrounding code
> +this_cpu_inc() will only guarantee that one of the percpu counters is correctly
> +incremented. However, there is no guarantee that the OS will not move the process
> +directly before or after the this_cpu instruction is executed. In general this
> +means that the value of the individual counters for each processor are
> +meaningless. The sum of all the per cpu counters is the only value that is of
> +interest.
> +
> +Per cpu variables are used for performance reasons. Bouncing cache lines can
> +be avoided if multiple processors concurrently go through the same code paths.
> +Since each processor has its own per cpu variables no concurrent cacheline
> +updates take place. The price that has to be paid for this optimization is
> +the need to add up the per cpu counters when the value of the counter is
> +needed.
> +
> +
> +Special operations:
> +-------------------
> +
> + y = this_cpu_ptr(&x)
> +
> +Takes the offset of a per cpu variable (&x !) and returns the address of the
> +per cpu variable that belongs to the currently executing processor.
> +this_cpu_ptr avoids multiple steps that the common get_cpu/put_cpu sequence
> +requires. No processor number is available. Instead the offset of the local\
drop ending backslash
> +per cpu area is simply added to the percpu offset.
> +
> +
> +
> +Per cpu variables and offsets
> +-----------------------------
> +
> +Per cpu variables have *offsets* to the beginning of the percpu area. They do
> +not have addresses although they look like that in the code. Offsets
> +cannot be directly dereferenced. The offset must be added to a base pointer of
> +a percpu area of a processor in order to form a valid address.
> +
> +Therefore the use of x or &x outside of the context of per cpu operations
> +is invalid and will generally be treated like a NULL pointer dereference.
> +
> +In the context of per cpu operations
> +
> + x is a per cpu variable. Most this_cpu operations take a cpu variable.
> +
> + &x is the *offset* a per cpu variable. this_cpu_ptr() takes the offset
> + of a per cpu variable which makes this look a bit strange.
> +
> +
> +
> +Operations on a field of a per cpu structure
> +--------------------------------------------
> +
> +Lets say we have a percpu structure
Let's
> +
> + struct s {
> + int n,m;
> + };
> +
> + DEFINE_PER_CPU(struct s, p);
> +
> +
> +Operations on these fields are straightforward
> +
> + this_cpu_inc(p.m)
> +
> + z = this_cpu_cmpxchg(p.m, 0, 1);
> +
> +
> +If we have an offset to struct s:
> +
> + struct s __percpu *ps = &p;
> +
> + z = this_cpu_dec(ps->m);
> +
> + z = this_cpu_inc_return(ps->n);
> +
> +
> +The calculation of the pointer may require the use of this_cpu_ptr() if we
> +do not make use of this_cpu ops later to manipulate fields:
> +
> + struct s *pp;
> +
> + pp = this_cpu_ptr(&p);
> +
> + pp->m--
add ;
> +
> + z = pp->n++
add ;
> +
> +
> +Variants of this_cpu ops
> +-------------------------
> +
> +this_cpu ops are interupt safe. Some architecture do not support these per
interrupt
> +cpu local operations. In that case the operation must be replaced by code
> +that disables interrupts, then does the operations that are guaranteed to be
> +atomic and then reenable interrupts. Doing so is expensive. If there are
> +other reasons why the scheduler cannot change the processor we are executing
> +on then there is no reason to disable interrupts. For that purpose
> +the __this_cpu operations are provided. F.e.
E.g. or For example:
> +
> + __this_cpu_inc(x)
> +
> +Will increment x and will not fallback to code that disables interrupts on
> +platforms that cannot accomplish atomicity through address relocation and
> +an RMV operation in the same instruction.
RMW ?
> +
> +
> +
> +&this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n)
> +--------------------------------------------
> +
> +The first operation takes the offset and forms an address and then adds
> +the offset of the n field.
> +
> +The second one first adds the two offsets and then does the relocation.
> +IMHO the second form looks cleaner and has an easier time with ().
> +
> +
> +Christoph Lameter, April 3rd, 2013
--
~Randy
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox