* [PATCH 04/14] mm: allow PF_MEMALLOC from softirq context
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>
This is needed to allow network softirq packet processing to make
use of PF_MEMALLOC.
Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle
with - thus the gfp to alloc flag mapping ignores the task flags when
in interrupts (hard or soft) context.
Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery. We basically borrow the task flags from whatever process
happens to be preempted by the softirq.
So we modify the gfp to alloc flags mapping to not exclude task flags
in softirq context, and modify the softirq code to save, clear and
restore the PF_MEMALLOC flag.
The save and clear, ensures the preempted task's PF_MEMALLOC flag
doesn't leak into the softirq. The restore ensures a softirq's
PF_MEMALLOC flag cannot leak back into the preempted process.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/sched.h | 7 +++++++
kernel/softirq.c | 3 +++
mm/page_alloc.c | 5 ++++-
3 files changed, 14 insertions(+), 1 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ac2c05..791536c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1869,6 +1869,13 @@ static inline void rcu_copy_process(struct task_struct *p)
#endif
+static inline void tsk_restore_flags(struct task_struct *p,
+ unsigned long pflags, unsigned long mask)
+{
+ p->flags &= ~mask;
+ p->flags |= pflags & mask;
+}
+
#ifdef CONFIG_SMP
extern void do_set_cpus_allowed(struct task_struct *p,
const struct cpumask *new_mask);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index fca82c3..f773afe 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -210,6 +210,8 @@ asmlinkage void __do_softirq(void)
__u32 pending;
int max_restart = MAX_SOFTIRQ_RESTART;
int cpu;
+ unsigned long pflags = current->flags;
+ current->flags &= ~PF_MEMALLOC;
pending = local_softirq_pending();
account_system_vtime(current);
@@ -265,6 +267,7 @@ restart:
account_system_vtime(current);
__local_bh_enable(SOFTIRQ_OFFSET);
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
}
#ifndef __ARCH_HAS_DO_SOFTIRQ
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 03fd18c..31e0eb2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2060,7 +2060,10 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
if (gfp_mask & __GFP_MEMALLOC)
alloc_flags |= ALLOC_NO_WATERMARKS;
- else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
+ else if (!in_irq() && (current->flags & PF_MEMALLOC))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ else if (!in_interrupt() &&
+ unlikely(test_thread_flag(TIF_MEMDIE)))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 05/14] mm: Ignore mempolicies when using ALLOC_NO_WATERMARK
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>
The reserve is proportionally distributed over all !highmem zones
in the system. So we need to allow an emergency allocation access to
all zones. In order to do that we need to break out of any mempolicy
boundaries we might have.
In my opinion that does not break mempolicies as those are user
oriented and not system oriented. That is, system allocations are
not guaranteed to be within mempolicy boundaries. For instance IRQs
do not even have a mempolicy.
So breaking out of mempolicy boundaries for 'rare' emergency
allocations, which are always system allocations (as opposed to user)
is ok.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 31e0eb2..17c8f93 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2140,6 +2140,13 @@ rebalance:
/* Allocate without watermarks if the context allows */
if (alloc_flags & ALLOC_NO_WATERMARKS) {
+ /*
+ * Ignore mempolicies if ALLOC_NO_WATERMARKS on the grounds
+ * the allocation is high priority and these type of
+ * allocations are system rather than user orientated
+ */
+ zonelist = node_zonelist(numa_node_id(), gfp_mask);
+
page = __alloc_pages_high_priority(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 06/14] net: Introduce sk_allocation() to allow addition of GFP flags depending on the individual socket
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>
Introduce sk_allocation(), this function allows to inject sock specific
flags to each sock related allocation. It is only used on allocation
paths that may be required for writing pages back to network storage.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/net/sock.h | 5 +++++
net/ipv4/tcp.c | 3 ++-
net/ipv4/tcp_output.c | 13 +++++++------
net/ipv6/tcp_ipv6.c | 12 +++++++++---
4 files changed, 23 insertions(+), 10 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 8e4062f..a4d5e61 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -586,6 +586,11 @@ static inline int sock_flag(struct sock *sk, enum sock_flags flag)
return test_bit(flag, &sk->sk_flags);
}
+static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
+{
+ return gfp_mask;
+}
+
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 46febca..67f4a6d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -698,7 +698,8 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
/* The TCP header must be at least 32-bit aligned. */
size = ALIGN(size, 4);
- skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp);
+ skb = alloc_skb_fclone(size + sk->sk_prot->max_header,
+ sk_allocation(sk, gfp));
if (skb) {
if (sk_wmem_schedule(sk, skb->truesize)) {
/*
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..87b98f6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2324,7 +2324,7 @@ void tcp_send_fin(struct sock *sk)
/* Socket is locked, keep trying until memory is available. */
for (;;) {
skb = alloc_skb_fclone(MAX_TCP_HEADER,
- sk->sk_allocation);
+ sk_allocation(sk, GFP_KERNEL));
if (skb)
break;
yield();
@@ -2350,7 +2350,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority)
struct sk_buff *skb;
/* NOTE: No TCP options attached and we never retransmit this. */
- skb = alloc_skb(MAX_TCP_HEADER, priority);
+ skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, priority));
if (!skb) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTFAILED);
return;
@@ -2423,7 +2423,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
if (cvp != NULL && cvp->s_data_constant && cvp->s_data_desired)
s_data_desired = cvp->s_data_desired;
- skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1, GFP_ATOMIC);
+ skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15 + s_data_desired, 1,
+ sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return NULL;
@@ -2719,7 +2720,7 @@ void tcp_send_ack(struct sock *sk)
* tcp_transmit_skb() will set the ownership to this
* sock.
*/
- buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+ buff = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (buff == NULL) {
inet_csk_schedule_ack(sk);
inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
@@ -2734,7 +2735,7 @@ void tcp_send_ack(struct sock *sk)
/* Send it off, this clears delayed acks for us. */
TCP_SKB_CB(buff)->when = tcp_time_stamp;
- tcp_transmit_skb(sk, buff, 0, GFP_ATOMIC);
+ tcp_transmit_skb(sk, buff, 0, sk_allocation(sk, GFP_ATOMIC));
}
/* This routine sends a packet with an out of date sequence
@@ -2754,7 +2755,7 @@ static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
struct sk_buff *skb;
/* We don't queue it, tcp_transmit_skb() sets ownership. */
- skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
+ skb = alloc_skb(MAX_TCP_HEADER, sk_allocation(sk, GFP_ATOMIC));
if (skb == NULL)
return -1;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3c9fa61..e9f2c84 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -584,7 +584,8 @@ static int tcp_v6_md5_do_add(struct sock *sk, const struct in6_addr *peer,
} else {
/* reallocate new list if current one is full. */
if (!tp->md5sig_info) {
- tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info), GFP_ATOMIC);
+ tp->md5sig_info = kzalloc(sizeof(*tp->md5sig_info),
+ sk_allocation(sk, GFP_ATOMIC));
if (!tp->md5sig_info) {
kfree(newkey);
return -ENOMEM;
@@ -597,7 +598,8 @@ static int tcp_v6_md5_do_add(struct sock *sk, const struct in6_addr *peer,
}
if (tp->md5sig_info->alloced6 == tp->md5sig_info->entries6) {
keys = kmalloc((sizeof (tp->md5sig_info->keys6[0]) *
- (tp->md5sig_info->entries6 + 1)), GFP_ATOMIC);
+ (tp->md5sig_info->entries6 + 1)),
+ sk_allocation(sk, GFP_ATOMIC));
if (!keys) {
tcp_free_md5sig_pool();
@@ -721,7 +723,8 @@ static int tcp_v6_parse_md5_keys (struct sock *sk, char __user *optval,
struct tcp_sock *tp = tcp_sk(sk);
struct tcp_md5sig_info *p;
- p = kzalloc(sizeof(struct tcp_md5sig_info), GFP_KERNEL);
+ p = kzalloc(sizeof(struct tcp_md5sig_info),
+ sk_allocation(sk, GFP_KERNEL));
if (!p)
return -ENOMEM;
@@ -1071,6 +1074,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
struct tcphdr *th = tcp_hdr(skb);
u32 seq = 0, ack_seq = 0;
struct tcp_md5sig_key *key = NULL;
+ gfp_t gfp_mask = GFP_ATOMIC;
if (th->rst)
return;
@@ -1082,6 +1086,8 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
if (sk)
key = tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr);
#endif
+ if (sk)
+ gfp_mask = sk_allocation(sk, gfp_mask);
if (th->ack)
seq = ntohl(th->ack_seq);
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 07/14] netvm: Allow the use of __GFP_MEMALLOC by specific sockets
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>
Allow specific sockets to be tagged SOCK_MEMALLOC and use
__GFP_MEMALLOC for their allocations. These sockets will be able to go
below watermarks and allocate from the emergency reserve. Such sockets
are to be used to service the VM (iow. to swap over). They must be
handled kernel side, exposing such a socket to user-space is a bug.
There is a risk that the reserves be depleted so for now, the
administrator is responsible for increasing min_free_kbytes as
necessary to prevent deadlock for their workloads.
[a.p.zijlstra@chello.nl: Original patches]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/net/sock.h | 5 ++++-
net/core/sock.c | 22 ++++++++++++++++++++++
2 files changed, 26 insertions(+), 1 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index a4d5e61..583df68 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -554,6 +554,7 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_MEMALLOC, /* VM depends on this socket for swapping */
SOCK_TIMESTAMPING_TX_HARDWARE, /* %SOF_TIMESTAMPING_TX_HARDWARE */
SOCK_TIMESTAMPING_TX_SOFTWARE, /* %SOF_TIMESTAMPING_TX_SOFTWARE */
SOCK_TIMESTAMPING_RX_HARDWARE, /* %SOF_TIMESTAMPING_RX_HARDWARE */
@@ -588,7 +589,7 @@ static inline int sock_flag(struct sock *sk, enum sock_flags flag)
static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
{
- return gfp_mask;
+ return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
}
static inline void sk_acceptq_removed(struct sock *sk)
@@ -718,6 +719,8 @@ extern int sk_stream_wait_memory(struct sock *sk, long *timeo_p);
extern void sk_stream_wait_close(struct sock *sk, long timeo_p);
extern int sk_stream_error(struct sock *sk, int flags, int err);
extern void sk_stream_kill_queues(struct sock *sk);
+extern void sk_set_memalloc(struct sock *sk);
+extern void sk_clear_memalloc(struct sock *sk);
extern int sk_wait_data(struct sock *sk, long *timeo);
diff --git a/net/core/sock.c b/net/core/sock.c
index bc745d0..2e3b69b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -221,6 +221,28 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
EXPORT_SYMBOL(sysctl_optmem_max);
+/**
+ * sk_set_memalloc - sets %SOCK_MEMALLOC
+ * @sk: socket to set it on
+ *
+ * Set %SOCK_MEMALLOC on a socket for access to emergency reserves.
+ * It's the responsibility of the admin to adjust min_free_kbytes
+ * to meet the requirements
+ */
+void sk_set_memalloc(struct sock *sk)
+{
+ sock_set_flag(sk, SOCK_MEMALLOC);
+ sk->sk_allocation |= __GFP_MEMALLOC;
+}
+EXPORT_SYMBOL_GPL(sk_set_memalloc);
+
+void sk_clear_memalloc(struct sock *sk)
+{
+ sock_reset_flag(sk, SOCK_MEMALLOC);
+ sk->sk_allocation &= ~__GFP_MEMALLOC;
+}
+EXPORT_SYMBOL_GPL(sk_clear_memalloc);
+
#if defined(CONFIG_CGROUPS) && !defined(CONFIG_NET_CLS_CGROUP)
int net_cls_subsys_id = -1;
EXPORT_SYMBOL_GPL(net_cls_subsys_id);
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 08/14] netvm: Allow skb allocation to use PFMEMALLOC reserves
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>
Change the skb allocation API to indicate RX usage and use this to fall
back to the PFMEMALLOC reserve when needed. SKBs allocated from the
reserve are tagged in skb->pfmemalloc. If an SKB is allocated from
the reserve and the socket is later found to be unrelated to page
reclaim, the packet is dropped so that the memory remains available
for page reclaim. Network protocols are expected to recover from this
packet loss.
[a.p.zijlstra@chello.nl: Ideas taken from various patches]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/gfp.h | 3 ++
include/linux/skbuff.h | 19 ++++++++--
include/net/sock.h | 6 +++
mm/internal.h | 3 --
net/core/filter.c | 8 ++++
net/core/skbuff.c | 95 ++++++++++++++++++++++++++++++++++++++++--------
net/core/sock.c | 4 ++
7 files changed, 116 insertions(+), 22 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 38acdc7..11588cdf 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -375,6 +375,9 @@ void drain_local_pages(void *dummy);
extern gfp_t gfp_allowed_mask;
+/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
+bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
+
extern void pm_restrict_gfp_mask(void);
extern void pm_restore_gfp_mask(void);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bd383c..639a372 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -413,6 +413,7 @@ struct sk_buff {
#ifdef CONFIG_IPV6_NDISC_NODETYPE
__u8 ndisc_nodetype:2;
#endif
+ __u8 pfmemalloc:1;
__u8 ooo_okay:1;
kmemcheck_bitfield_end(flags2);
@@ -451,6 +452,15 @@ struct sk_buff {
#include <asm/system.h>
+#define SKB_ALLOC_FCLONE 0x01
+#define SKB_ALLOC_RX 0x02
+
+/* Returns true if the skb was allocated from PFMEMALLOC reserves */
+static inline bool skb_pfmemalloc(struct sk_buff *skb)
+{
+ return unlikely(skb->pfmemalloc);
+}
+
/*
* skb might have a dst pointer attached, refcounted or not.
* _skb_refdst low order bit is set if refcount was _not_ taken
@@ -508,7 +518,7 @@ extern void kfree_skb(struct sk_buff *skb);
extern void consume_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
extern struct sk_buff *__alloc_skb(unsigned int size,
- gfp_t priority, int fclone, int node);
+ gfp_t priority, int flags, int node);
static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
{
@@ -518,7 +528,7 @@ static inline struct sk_buff *alloc_skb(unsigned int size,
static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
+ return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
}
extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
@@ -1550,7 +1560,8 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
gfp_t gfp_mask)
{
- struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+ struct sk_buff *skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
+ SKB_ALLOC_RX, NUMA_NO_NODE);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
@@ -1607,7 +1618,7 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
*/
static inline struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
{
- return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
+ return alloc_pages_node(NUMA_NO_NODE, gfp_mask | __GFP_MEMALLOC, 0);
}
/**
diff --git a/include/net/sock.h b/include/net/sock.h
index 583df68..cf3f102 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -587,6 +587,12 @@ static inline int sock_flag(struct sock *sk, enum sock_flags flag)
return test_bit(flag, &sk->sk_flags);
}
+extern atomic_t memalloc_socks;
+static inline int sk_memalloc_socks(void)
+{
+ return atomic_read(&memalloc_socks);
+}
+
static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
{
return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
diff --git a/mm/internal.h b/mm/internal.h
index a520f3b..d071d380 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -193,9 +193,6 @@ static inline struct page *mem_map_next(struct page *iter,
#define __paginginit __init
#endif
-/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
-bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
-
/* Memory initialisation debug and verification */
enum mminit_level {
MMINIT_WARNING,
diff --git a/net/core/filter.c b/net/core/filter.c
index 36f975f..4ccf6f4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -80,6 +80,14 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
int err;
struct sk_filter *filter;
+ /*
+ * If the skb was allocated from pfmemalloc reserves, only
+ * allow SOCK_MEMALLOC sockets to use it as this socket is
+ * helping free memory
+ */
+ if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
+ return -ENOMEM;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 387703f..4ce6d75 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -147,6 +147,43 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
BUG();
}
+
+/*
+ * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
+ * the caller if emergency pfmemalloc reserves are being used. If it is and
+ * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves
+ * may be used. Otherwise, the packet data may be discarded until enough
+ * memory is free
+ */
+#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
+ __kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
+void *__kmalloc_reserve(size_t size, gfp_t flags, int node, unsigned long ip,
+ bool *pfmemalloc)
+{
+ void *obj;
+ bool ret_pfmemalloc = false;
+
+ /*
+ * Try a regular allocation, when that fails and we're not entitled
+ * to the reserves, fail.
+ */
+ obj = kmalloc_node_track_caller(size,
+ flags | __GFP_NOMEMALLOC | __GFP_NOWARN,
+ node);
+ if (obj || !(gfp_pfmemalloc_allowed(flags)))
+ goto out;
+
+ /* Try again but now we are using pfmemalloc reserves */
+ ret_pfmemalloc = true;
+ obj = kmalloc_node_track_caller(size, flags, node);
+
+out:
+ if (pfmemalloc)
+ *pfmemalloc = ret_pfmemalloc;
+
+ return obj;
+}
+
/* Allocate a new skbuff. We do this ourselves so we can fill in a few
* 'private' fields and also do memory statistics to find all the
* [BEEP] leaks.
@@ -157,8 +194,10 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
* __alloc_skb - allocate a network buffer
* @size: size to allocate
* @gfp_mask: allocation mask
- * @fclone: allocate from fclone cache instead of head cache
- * and allocate a cloned (child) skb
+ * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache
+ * instead of head cache and allocate a cloned (child) skb.
+ * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
+ * allocations in case the data is required for writeback
* @node: numa node to allocate memory on
*
* Allocate a new &sk_buff. The returned buffer has no headroom and a
@@ -169,14 +208,19 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
* %GFP_ATOMIC.
*/
struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone, int node)
+ int flags, int node)
{
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
+ bool pfmemalloc;
+
+ cache = (flags & SKB_ALLOC_FCLONE)
+ ? skbuff_fclone_cache : skbuff_head_cache;
- cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+ if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
+ gfp_mask |= __GFP_MEMALLOC;
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
@@ -185,8 +229,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
prefetchw(skb);
size = SKB_DATA_ALIGN(size);
- data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
- gfp_mask, node);
+ data = kmalloc_reserve(size + sizeof(struct skb_shared_info),
+ gfp_mask, node, &pfmemalloc);
if (!data)
goto nodata;
prefetchw(data + size);
@@ -197,6 +241,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
* the tail pointer in struct sk_buff!
*/
memset(skb, 0, offsetof(struct sk_buff, tail));
+ skb->pfmemalloc = pfmemalloc;
skb->truesize = size + sizeof(struct sk_buff);
atomic_set(&skb->users, 1);
skb->head = data;
@@ -213,7 +258,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
atomic_set(&shinfo->dataref, 1);
kmemcheck_annotate_variable(shinfo->destructor_arg);
- if (fclone) {
+ if (flags & SKB_ALLOC_FCLONE) {
struct sk_buff *child = skb + 1;
atomic_t *fclone_ref = (atomic_t *) (child + 1);
@@ -223,6 +268,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
atomic_set(fclone_ref, 1);
child->fclone = SKB_FCLONE_UNAVAILABLE;
+ child->pfmemalloc = pfmemalloc;
}
out:
return skb;
@@ -251,7 +297,8 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
{
struct sk_buff *skb;
- skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
+ skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
+ SKB_ALLOC_RX, NUMA_NO_NODE);
if (likely(skb)) {
skb_reserve(skb, NET_SKB_PAD);
skb->dev = dev;
@@ -542,6 +589,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
#if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
new->ipvs_property = old->ipvs_property;
#endif
+ new->pfmemalloc = old->pfmemalloc;
new->protocol = old->protocol;
new->mark = old->mark;
new->skb_iif = old->skb_iif;
@@ -701,6 +749,9 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
n->fclone = SKB_FCLONE_CLONE;
atomic_inc(fclone_ref);
} else {
+ if (skb_pfmemalloc(skb))
+ gfp_mask |= __GFP_MEMALLOC;
+
n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
if (!n)
return NULL;
@@ -737,6 +788,13 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
}
+static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
+{
+ if (skb_pfmemalloc((struct sk_buff *)skb))
+ return SKB_ALLOC_RX;
+ return 0;
+}
+
/**
* skb_copy - create private copy of an sk_buff
* @skb: buffer to copy
@@ -758,7 +816,8 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
{
int headerlen = skb_headroom(skb);
unsigned int size = (skb_end_pointer(skb) - skb->head) + skb->data_len;
- struct sk_buff *n = alloc_skb(size, gfp_mask);
+ struct sk_buff *n = __alloc_skb(size, gfp_mask,
+ skb_alloc_rx_flag(skb), NUMA_NO_NODE);
if (!n)
return NULL;
@@ -792,7 +851,8 @@ EXPORT_SYMBOL(skb_copy);
struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask)
{
unsigned int size = skb_end_pointer(skb) - skb->head;
- struct sk_buff *n = alloc_skb(size, gfp_mask);
+ struct sk_buff *n = __alloc_skb(size, gfp_mask,
+ skb_alloc_rx_flag(skb), NUMA_NO_NODE);
if (!n)
goto out;
@@ -889,7 +949,10 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
goto adjust_others;
}
- data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
+ if (skb_pfmemalloc(skb))
+ gfp_mask |= __GFP_MEMALLOC;
+ data = kmalloc_reserve(size + sizeof(struct skb_shared_info), gfp_mask,
+ NUMA_NO_NODE, NULL);
if (!data)
goto nodata;
@@ -997,8 +1060,9 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
/*
* Allocate the copy buffer
*/
- struct sk_buff *n = alloc_skb(newheadroom + skb->len + newtailroom,
- gfp_mask);
+ struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
+ gfp_mask, skb_alloc_rx_flag(skb),
+ NUMA_NO_NODE);
int oldheadroom = skb_headroom(skb);
int head_copy_len, head_copy_off;
int off;
@@ -2659,8 +2723,9 @@ struct sk_buff *skb_segment(struct sk_buff *skb, u32 features)
skb_release_head_state(nskb);
__skb_push(nskb, doffset);
} else {
- nskb = alloc_skb(hsize + doffset + headroom,
- GFP_ATOMIC);
+ nskb = __alloc_skb(hsize + doffset + headroom,
+ GFP_ATOMIC, skb_alloc_rx_flag(skb),
+ NUMA_NO_NODE);
if (unlikely(!nskb))
goto err;
diff --git a/net/core/sock.c b/net/core/sock.c
index 2e3b69b..07e1292 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -221,6 +221,8 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
EXPORT_SYMBOL(sysctl_optmem_max);
+atomic_t memalloc_socks __read_mostly;
+
/**
* sk_set_memalloc - sets %SOCK_MEMALLOC
* @sk: socket to set it on
@@ -233,6 +235,7 @@ void sk_set_memalloc(struct sock *sk)
{
sock_set_flag(sk, SOCK_MEMALLOC);
sk->sk_allocation |= __GFP_MEMALLOC;
+ atomic_inc(&memalloc_socks);
}
EXPORT_SYMBOL_GPL(sk_set_memalloc);
@@ -240,6 +243,7 @@ void sk_clear_memalloc(struct sock *sk)
{
sock_reset_flag(sk, SOCK_MEMALLOC);
sk->sk_allocation &= ~__GFP_MEMALLOC;
+ atomic_dec(&memalloc_socks);
}
EXPORT_SYMBOL_GPL(sk_clear_memalloc);
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 09/14] netvm: Propagate page->pfmemalloc to skb
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>
The skb->pfmemalloc flag gets set to true iff during the slab
allocation of data in __alloc_skb that the the PFMEMALLOC reserves
were used. If the packet is fragmented, it is possible that pages
will be allocated from the PFMEMALLOC reserve without propagating
this information to the skb. This patch propagates page->pfmemalloc
from pages allocated for fragments to the skb.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/skbuff.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 639a372..c7da15f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1142,6 +1142,8 @@ static inline void skb_fill_page_desc(struct sk_buff *skb, int i,
{
skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+ if (page->pfmemalloc)
+ skb->pfmemalloc = true;
frag->page = page;
frag->page_offset = off;
frag->size = size;
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 10/14] netvm: Set PF_MEMALLOC as appropriate during SKB processing
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>
In order to make sure pfmemalloc packets receive all memory
needed to proceed, ensure processing of pfmemalloc SKBs happens
under PF_MEMALLOC. This is limited to a subset of protocols that
are expected to be used for writing to swap. Taps are not allowed to
use PF_MEMALLOC as these are expected to communicate with userspace
processes which could be paged out.
[a.p.zijlstra@chello.nl: Ideas taken from various patches]
[jslaby@suse.cz: Lock imbalance fix]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/net/sock.h | 5 +++++
net/core/dev.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
net/core/sock.c | 16 ++++++++++++++++
3 files changed, 65 insertions(+), 4 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index cf3f102..09813fc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -669,8 +669,13 @@ static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *s
return 0;
}
+extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+
static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
{
+ if (skb_pfmemalloc(skb))
+ return __sk_backlog_rcv(sk, skb);
+
return sk->sk_backlog_rcv(sk, skb);
}
diff --git a/net/core/dev.c b/net/core/dev.c
index b10ff0a..fd9deb1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3107,6 +3107,23 @@ void netdev_rx_handler_unregister(struct net_device *dev)
}
EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
+/*
+ * Limit the use of PFMEMALLOC reserves to those protocols that implement
+ * the special handling of PFMEMALLOC skbs.
+ */
+static bool skb_pfmemalloc_protocol(struct sk_buff *skb)
+{
+ switch (skb->protocol) {
+ case __constant_htons(ETH_P_ARP):
+ case __constant_htons(ETH_P_IP):
+ case __constant_htons(ETH_P_IPV6):
+ case __constant_htons(ETH_P_8021Q):
+ return true;
+ default:
+ return false;
+ }
+}
+
static int __netif_receive_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
@@ -3116,15 +3133,28 @@ static int __netif_receive_skb(struct sk_buff *skb)
bool deliver_exact = false;
int ret = NET_RX_DROP;
__be16 type;
+ unsigned long pflags = current->flags;
if (!netdev_tstamp_prequeue)
net_timestamp_check(skb);
trace_netif_receive_skb(skb);
+ /*
+ * PFMEMALLOC skbs are special, they should
+ * - be delivered to SOCK_MEMALLOC sockets only
+ * - stay away from userspace
+ * - have bounded memory usage
+ *
+ * Use PF_MEMALLOC as this saves us from propagating the allocation
+ * context down to all allocation sites.
+ */
+ if (skb_pfmemalloc(skb))
+ current->flags |= PF_MEMALLOC;
+
/* if we've gotten here through NAPI, check netpoll */
if (netpoll_receive_skb(skb))
- return NET_RX_DROP;
+ goto out;
if (!skb->skb_iif)
skb->skb_iif = skb->dev->ifindex;
@@ -3155,6 +3185,9 @@ another_round:
}
#endif
+ if (skb_pfmemalloc(skb))
+ goto skip_taps;
+
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
@@ -3163,13 +3196,17 @@ another_round:
}
}
+skip_taps:
#ifdef CONFIG_NET_CLS_ACT
skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
if (!skb)
- goto out;
+ goto unlock;
ncls:
#endif
+ if (skb_pfmemalloc(skb) && !skb_pfmemalloc_protocol(skb))
+ goto drop;
+
rx_handler = rcu_dereference(skb->dev->rx_handler);
if (rx_handler) {
if (pt_prev) {
@@ -3178,7 +3215,7 @@ ncls:
}
switch (rx_handler(&skb)) {
case RX_HANDLER_CONSUMED:
- goto out;
+ goto unlock;
case RX_HANDLER_ANOTHER:
goto another_round;
case RX_HANDLER_EXACT:
@@ -3220,6 +3257,7 @@ ncls:
if (pt_prev) {
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
} else {
+drop:
atomic_long_inc(&skb->dev->rx_dropped);
kfree_skb(skb);
/* Jamal, now you will not able to escape explaining
@@ -3228,8 +3266,10 @@ ncls:
ret = NET_RX_DROP;
}
-out:
+unlock:
rcu_read_unlock();
+out:
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
return ret;
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 07e1292..0f28a9b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -247,6 +247,22 @@ void sk_clear_memalloc(struct sock *sk)
}
EXPORT_SYMBOL_GPL(sk_clear_memalloc);
+int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+ int ret;
+ unsigned long pflags = current->flags;
+
+ /* these should have been dropped before queueing */
+ BUG_ON(!sock_flag(sk, SOCK_MEMALLOC));
+
+ current->flags |= PF_MEMALLOC;
+ ret = sk->sk_backlog_rcv(sk, skb);
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
+
+ return ret;
+}
+EXPORT_SYMBOL(__sk_backlog_rcv);
+
#if defined(CONFIG_CGROUPS) && !defined(CONFIG_NET_CLS_CGROUP)
int net_cls_subsys_id = -1;
EXPORT_SYMBOL_GPL(net_cls_subsys_id);
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 11/14] mm: Micro-optimise slab to avoid a function call
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>
Getting and putting objects in SLAB currently requires a function call
but the bulk of the work is related to PFMEMALLOC reserves which are
only consumed when network-backed storage is critical. Use an inline
function to determine if the function call is required.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/slab.c | 28 ++++++++++++++++++++++++++--
1 files changed, 26 insertions(+), 2 deletions(-)
diff --git a/mm/slab.c b/mm/slab.c
index 25f69ec..31276f9 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -117,6 +117,8 @@
#include <linux/memory.h>
#include <linux/prefetch.h>
+#include <net/sock.h>
+
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
#include <asm/page.h>
@@ -985,7 +987,7 @@ static void check_ac_pfmemalloc(struct kmem_cache *cachep,
ac->pfmemalloc = false;
}
-static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
+static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
gfp_t flags, bool force_refill)
{
int i;
@@ -1032,7 +1034,20 @@ static void *ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
return objp;
}
-static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+static inline void *ac_get_obj(struct kmem_cache *cachep,
+ struct array_cache *ac, gfp_t flags, bool force_refill)
+{
+ void *objp;
+
+ if (unlikely(sk_memalloc_socks()))
+ objp = __ac_get_obj(cachep, ac, flags, force_refill);
+ else
+ objp = ac->entry[--ac->avail];
+
+ return objp;
+}
+
+static void *__ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
void *objp)
{
struct slab *slabp;
@@ -1045,6 +1060,15 @@ static void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
set_obj_pfmemalloc(&objp);
}
+ return objp;
+}
+
+static inline void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
+ void *objp)
+{
+ if (unlikely(sk_memalloc_socks()))
+ objp = __ac_put_obj(cachep, ac, objp);
+
ac->entry[ac->avail++] = objp;
}
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 12/14] nbd: Set SOCK_MEMALLOC for access to PFMEMALLOC reserves
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>
Set SOCK_MEMALLOC on the NBD socket to allow access to PFMEMALLOC
reserves so pages backed by NBD, particularly if swap related, can
be cleaned to prevent the machine being deadlocked. It is still
possible that the PFMEMALLOC reserves get depleted resulting in
deadlock but this can be resolved by the administrator by increasing
min_free_kbytes.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
drivers/block/nbd.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index f533f33..ca7cd81 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -156,6 +156,7 @@ static int sock_xmit(struct nbd_device *lo, int send, void *buf, int size,
struct msghdr msg;
struct kvec iov;
sigset_t blocked, oldset;
+ unsigned long pflags = current->flags;
if (unlikely(!sock)) {
printk(KERN_ERR "%s: Attempted %s on closed socket in sock_xmit\n",
@@ -168,8 +169,9 @@ static int sock_xmit(struct nbd_device *lo, int send, void *buf, int size,
siginitsetinv(&blocked, sigmask(SIGKILL));
sigprocmask(SIG_SETMASK, &blocked, &oldset);
+ current->flags |= PF_MEMALLOC;
do {
- sock->sk->sk_allocation = GFP_NOIO;
+ sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
iov.iov_base = buf;
iov.iov_len = size;
msg.msg_name = NULL;
@@ -215,6 +217,7 @@ static int sock_xmit(struct nbd_device *lo, int send, void *buf, int size,
} while (size > 0);
sigprocmask(SIG_SETMASK, &oldset, NULL);
+ tsk_restore_flags(current, pflags, PF_MEMALLOC);
return result;
}
@@ -405,6 +408,8 @@ static int nbd_do_it(struct nbd_device *lo)
BUG_ON(lo->magic != LO_MAGIC);
+ sk_set_memalloc(lo->sock->sk);
+
lo->pid = current->pid;
ret = sysfs_create_file(&disk_to_dev(lo->disk)->kobj, &pid_attr.attr);
if (ret) {
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 13/14] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>
If swap is backed by network storage such as NBD, there is a risk
that a large number of reclaimers can hang the system by consuming
all PF_MEMALLOC reserves. To avoid these hangs, the administrator
must tune min_free_kbytes in advance. This patch will throttle direct
reclaimers if half the PF_MEMALLOC reserves are in use as the system
is at risk of hanging.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 1 +
mm/vmscan.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 73 insertions(+), 0 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index be1ac8d..d502217 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -639,6 +639,7 @@ typedef struct pglist_data {
range, including holes */
int node_id;
wait_queue_head_t kswapd_wait;
+ wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd;
int kswapd_max_order;
enum zone_type classzone_idx;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17c8f93..d0685b9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4302,6 +4302,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
pgdat_resize_init(pgdat);
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
+ init_waitqueue_head(&pgdat->pfmemalloc_wait);
pgdat->kswapd_max_order = 0;
pgdat_page_cgroup_init(pgdat);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b55699c..ca5ca02 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2219,6 +2219,49 @@ out:
return 0;
}
+static bool pfmemalloc_watermark_ok(pg_data_t *pgdat, int high_zoneidx)
+{
+ struct zone *zone;
+ unsigned long pfmemalloc_reserve = 0;
+ unsigned long free_pages = 0;
+ int i;
+
+ for (i = 0; i <= high_zoneidx; i++) {
+ zone = &pgdat->node_zones[i];
+ pfmemalloc_reserve += min_wmark_pages(zone);
+ free_pages += zone_page_state(zone, NR_FREE_PAGES);
+ }
+
+ return (free_pages > pfmemalloc_reserve / 2) ? true : false;
+}
+
+/*
+ * Throttle direct reclaimers if backing storage is backed by the network
+ * and the PFMEMALLOC reserve for the preferred node is getting dangerously
+ * depleted. kswapd will continue to make progress and wake the processes
+ * when the low watermark is reached
+ */
+static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+ nodemask_t *nodemask)
+{
+ struct zone *zone;
+ int high_zoneidx = gfp_zone(gfp_mask);
+ DEFINE_WAIT(wait);
+
+ /* Kernel threads such as kjournald should not be throttled */
+ if (current->flags & PF_KTHREAD)
+ return;
+
+ /* Check if the pfmemalloc reserves are ok */
+ first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
+ if (pfmemalloc_watermark_ok(zone->zone_pgdat, high_zoneidx))
+ return;
+
+ /* Throttle */
+ wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(zone->zone_pgdat, high_zoneidx));
+}
+
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
@@ -2237,6 +2280,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.gfp_mask = sc.gfp_mask,
};
+ throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
+
+ /*
+ * Do not enter reclaim if fatal signal is pending. 1 is returned so
+ * that the page allocator does not consider triggering OOM
+ */
+ if (fatal_signal_pending(current))
+ return 1;
+
trace_mm_vmscan_direct_reclaim_begin(order,
sc.may_writepage,
gfp_mask);
@@ -2609,6 +2661,12 @@ loop_again:
}
}
+
+ /* Wake throttled direct reclaimers if low watermark is met */
+ if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
+ pfmemalloc_watermark_ok(pgdat, MAX_NR_ZONES - 1))
+ wake_up(&pgdat->pfmemalloc_wait);
+
if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
break; /* kswapd: all done */
/*
@@ -2728,6 +2786,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
/*
+ * There is a potential race between when kswapd checks it
+ * watermarks and a process gets throttled. There is also
+ * a potential race if processes get throttled, kswapd wakes,
+ * a large process exits therby balancing the zones that causes
+ * kswapd to miss a wakeup. If kswapd is going to sleep, no
+ * process should be sleeping on pfmemalloc_wait so wake them
+ * now if necessary. If necessary, processes will wake kswapd
+ * and get throttled again
+ */
+ if (waitqueue_active(&pgdat->pfmemalloc_wait))
+ wake_up(&pgdat->pfmemalloc_wait);
+
+ /*
* vmstat counters are not perfectly accurate and the estimated
* value for counters such as NR_FREE_PAGES can deviate from the
* true value by nr_online_cpus * threshold. To avoid the zone
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 14/14] mm: Account for the number of times direct reclaimers get throttled
From: Mel Gorman @ 2011-10-06 12:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Neil Brown,
Peter Zijlstra, Mel Gorman
In-Reply-To: <1317904910-14095-1-git-send-email-mgorman@suse.de>
Under significant pressure when writing back to network-backed storage,
direct reclaimers may get throttled. This is expected to be a
short-lived event and the processes get woken up again but processes do
get stalled. This patch counts how many times such stalling occurs. It's
up to the administrator whether to reduce these stalls by increasing
min_free_kbytes.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/vm_event_item.h | 1 +
mm/vmscan.c | 1 +
mm/vmstat.c | 1 +
3 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03b90cdc..652e5f3 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -29,6 +29,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGSTEAL),
FOR_ALL_ZONES(PGSCAN_KSWAPD),
FOR_ALL_ZONES(PGSCAN_DIRECT),
+ PGSCAN_DIRECT_THROTTLE,
#ifdef CONFIG_NUMA
PGSCAN_ZONE_RECLAIM_FAILED,
#endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ca5ca02..6cb1aee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2258,6 +2258,7 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
return;
/* Throttle */
+ count_vm_event(PGSCAN_DIRECT_THROTTLE);
wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
pfmemalloc_watermark_ok(zone->zone_pgdat, high_zoneidx));
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d52b13d..96070b4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -740,6 +740,7 @@ const char * const vmstat_text[] = {
TEXTS_FOR_ZONES("pgsteal")
TEXTS_FOR_ZONES("pgscan_kswapd")
TEXTS_FOR_ZONES("pgscan_direct")
+ "pgscan_direct_throttle",
#ifdef CONFIG_NUMA
"zone_reclaim_failed",
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 0/3] iproute2: various usage and doc fixes
From: Petr Sabata @ 2011-10-06 12:45 UTC (permalink / raw)
To: netdev; +Cc: Petr Sabata
Petr Sabata (3):
iproute2: ss - fix missing parameters
iproute2: lnstat - fix typos
iproute2: arpd - fix usage and manpage options
man/man8/arpd.8 | 4 ++--
man/man8/ss.8 | 4 ++--
misc/arpd.c | 2 +-
misc/lnstat.c | 4 ++--
misc/ss.c | 6 ++++--
5 files changed, 11 insertions(+), 9 deletions(-)
--
1.7.6.4
^ permalink raw reply
* [PATCH 2/3] iproute2: lnstat - fix typos
From: Petr Sabata @ 2011-10-06 12:45 UTC (permalink / raw)
To: netdev; +Cc: Petr Sabata
In-Reply-To: <1317905134-23175-1-git-send-email-contyk@redhat.com>
Signed-off-by: Petr Sabata <contyk@redhat.com>
---
misc/lnstat.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/misc/lnstat.c b/misc/lnstat.c
index 32ab6a4..bd19cc1 100644
--- a/misc/lnstat.c
+++ b/misc/lnstat.c
@@ -45,7 +45,7 @@ static struct option opts[] = {
{ "file", 1, NULL, 'f' },
{ "help", 0, NULL, 'h' },
{ "interval", 1, NULL, 'i' },
- { "key", 1, NULL, 'k' },
+ { "keys", 1, NULL, 'k' },
{ "subject", 1, NULL, 's' },
{ "width", 1, NULL, 'w' },
};
@@ -61,7 +61,7 @@ static int usage(char *name, int exit_code)
fprintf(stderr, "\t-V --version\t\tPrint Version of Program\n");
fprintf(stderr, "\t-c --count <count>\t"
"Print <count> number of intervals\n");
- fprintf(stderr, "\t-d --dumpt\t\t"
+ fprintf(stderr, "\t-d --dump\t\t"
"Dump list of available files/keys\n");
fprintf(stderr, "\t-f --file <file>\tStatistics file to use\n");
fprintf(stderr, "\t-h --help\t\tThis help message\n");
--
1.7.6.4
^ permalink raw reply related
* [PATCH 1/3] iproute2: ss - fix missing parameters
From: Petr Sabata @ 2011-10-06 12:45 UTC (permalink / raw)
To: netdev; +Cc: Petr Sabata
In-Reply-To: <1317905134-23175-1-git-send-email-contyk@redhat.com>
Signed-off-by: Petr Sabata <contyk@redhat.com>
---
man/man8/ss.8 | 4 ++--
misc/ss.c | 6 ++++--
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/man/man8/ss.8 b/man/man8/ss.8
index b309df2..0b9a8c4 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -85,12 +85,12 @@ Display Unix domain sockets (alias for -f unix).
Display sockets of type FAMILY.
Currently the following families are supported: unix, inet, inet6, link, netlink.
.TP
-.B \-A QUERY, \-\-query=QUERY
+.B \-A QUERY, \-\-query=QUERY, \-\-socket=QUERY
List of socket tables to dump, separated by commas. The following identifiers
are understood: all, inet, tcp, udp, raw, unix, packet, netlink, unix_dgram,
unix_stream, packet_raw, packet_dgram.
.TP
-.B \-D FILE
+.B \-D FILE, \-\-diag=FILE
Do not display anything, just dump raw information about TCP sockets to FILE after applying filters. If FILE is - stdout is used.
.TP
.B \-F FILE, \-\-filter=FILE
diff --git a/misc/ss.c b/misc/ss.c
index 1597ff9..253f8d3 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2416,9 +2416,10 @@ static void _usage(FILE *dest)
" -x, --unix display only Unix domain sockets\n"
" -f, --family=FAMILY display sockets of type FAMILY\n"
"\n"
-" -A, --query=QUERY\n"
+" -A, --query=QUERY, --socket=QUERY\n"
" QUERY := {all|inet|tcp|udp|raw|unix|packet|netlink}[,QUERY]\n"
"\n"
+" -D, --diag=FILE Dump raw information about TCP sockets to FILE\n"
" -F, --filter=FILE read filter information from FILE\n"
" FILTER := [ state TCP-STATE ] [ EXPRESSION ]\n"
);
@@ -2486,8 +2487,9 @@ static const struct option long_opts[] = {
{ "packet", 0, 0, '0' },
{ "family", 1, 0, 'f' },
{ "socket", 1, 0, 'A' },
+ { "query", 1, 0, 'A' },
{ "summary", 0, 0, 's' },
- { "diag", 0, 0, 'D' },
+ { "diag", 1, 0, 'D' },
{ "filter", 1, 0, 'F' },
{ "version", 0, 0, 'V' },
{ "help", 0, 0, 'h' },
--
1.7.6.4
^ permalink raw reply related
* [PATCH 3/3] iproute2: arpd - fix usage and manpage options
From: Petr Sabata @ 2011-10-06 12:45 UTC (permalink / raw)
To: netdev; +Cc: Petr Sabata
In-Reply-To: <1317905134-23175-1-git-send-email-contyk@redhat.com>
Signed-off-by: Petr Sabata <contyk@redhat.com>
---
man/man8/arpd.8 | 4 ++--
misc/arpd.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/man/man8/arpd.8 b/man/man8/arpd.8
index d172600..37b6ba4 100644
--- a/man/man8/arpd.8
+++ b/man/man8/arpd.8
@@ -4,7 +4,7 @@
arpd \- userspace arp daemon.
.SH SYNOPSIS
-Usage: arpd [ -lk ] [ -a N ] [ -b dbase ] [ -f file ] [ interfaces ]
+Usage: arpd [ -lkh? ] [ -a N ] [ -b dbase ] [ -B number ] [ -f file ] [ -n time ] [ -R rate ] [ interfaces ]
.SH DESCRIPTION
The
@@ -34,7 +34,7 @@ Suppress sending broadcast queries by kernel. It takes sense together with optio
-n <TIME>
Timeout of negative cache. When resolution fails arpd suppresses further attempts to resolve for this period. It makes sense only together with option -k This timeout should not be too much longer than boot time of a typical host not supporting gratuitous ARP. Default value is 60 seconds.
.TP
--r <RATE>
+-R <RATE>
Maximal steady rate of broadcasts sent by arpd in packets per second. Default value is 1.
.TP
-B <NUMBER>
diff --git a/misc/arpd.c b/misc/arpd.c
index 128c49d..124d3fb 100644
--- a/misc/arpd.c
+++ b/misc/arpd.c
@@ -94,7 +94,7 @@ int broadcast_burst = 3000;
void usage(void)
{
fprintf(stderr,
-"Usage: arpd [ -lk ] [ -a N ] [ -b dbase ] [ -f file ] [ interfaces ]\n");
+"Usage: arpd [ -lkh? ] [ -a N ] [ -b dbase ] [ -B number ] [ -f file ] [ -n time ] [ -R rate ] [ interfaces ]\n");
exit(1);
}
--
1.7.6.4
^ permalink raw reply related
* [net-next] cs89x0: Move the driver into the Cirrus dir
From: Jeff Kirsher @ 2011-10-06 13:14 UTC (permalink / raw)
To: davem; +Cc: Jeff Kirsher, netdev, gospo, sassmann, Russell Nelson,
Andrew Morton
The cs89x0 driver was initial placed in the apple/ when it
should have been placed in the cirrus/. This resolves the
issue by moving the dirver and fixing up the respective
Kconfig(s) and Makefile(s).
Thanks to Sascha for reporting the issue.
CC: Russell Nelson <nelson@crynwr.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/apple/Kconfig | 22 +---------------------
drivers/net/ethernet/apple/Makefile | 1 -
drivers/net/ethernet/cirrus/Kconfig | 22 +++++++++++++++++++++-
drivers/net/ethernet/cirrus/Makefile | 1 +
drivers/net/ethernet/{apple => cirrus}/cs89x0.c | 0
drivers/net/ethernet/{apple => cirrus}/cs89x0.h | 0
6 files changed, 23 insertions(+), 23 deletions(-)
rename drivers/net/ethernet/{apple => cirrus}/cs89x0.c (100%)
rename drivers/net/ethernet/{apple => cirrus}/cs89x0.h (100%)
diff --git a/drivers/net/ethernet/apple/Kconfig b/drivers/net/ethernet/apple/Kconfig
index 59d5c26..90ad2c1 100644
--- a/drivers/net/ethernet/apple/Kconfig
+++ b/drivers/net/ethernet/apple/Kconfig
@@ -5,8 +5,7 @@
config NET_VENDOR_APPLE
bool "Apple devices"
default y
- depends on (PPC_PMAC && PPC32) || MAC || ISA || EISA || MACH_IXDP2351 \
- || ARCH_IXDP2X01 || MACH_MX31ADS || MACH_QQ2440
+ depends on (PPC_PMAC && PPC32) || MACE || MAC
---help---
If you have a network (Ethernet) card belonging to this class, say Y
and read the Ethernet-HOWTO, available from
@@ -75,23 +74,4 @@ config MACMACE
say Y and read the Ethernet-HOWTO, available from
<http://www.tldp.org/docs.html#howto>.
-config CS89x0
- tristate "CS89x0 support"
- depends on (ISA || EISA || MACH_IXDP2351 \
- || ARCH_IXDP2X01 || MACH_MX31ADS || MACH_QQ2440)
- ---help---
- Support for CS89x0 chipset based Ethernet cards. If you have a
- network (Ethernet) card of this type, say Y and read the
- Ethernet-HOWTO, available from
- <http://www.tldp.org/docs.html#howto> as well as
- <file:Documentation/networking/cs89x0.txt>.
-
- To compile this driver as a module, choose M here. The module
- will be called cs89x0.
-
-config CS89x0_NONISA_IRQ
- def_bool y
- depends on CS89x0 != n
- depends on MACH_IXDP2351 || ARCH_IXDP2X01 || MACH_MX31ADS || MACH_QQ2440
-
endif # NET_VENDOR_APPLE
diff --git a/drivers/net/ethernet/apple/Makefile b/drivers/net/ethernet/apple/Makefile
index 9d30086..0d3a591 100644
--- a/drivers/net/ethernet/apple/Makefile
+++ b/drivers/net/ethernet/apple/Makefile
@@ -5,5 +5,4 @@
obj-$(CONFIG_MACE) += mace.o
obj-$(CONFIG_BMAC) += bmac.o
obj-$(CONFIG_MAC89x0) += mac89x0.o
-obj-$(CONFIG_CS89x0) += cs89x0.o
obj-$(CONFIG_MACMACE) += macmace.o
diff --git a/drivers/net/ethernet/cirrus/Kconfig b/drivers/net/ethernet/cirrus/Kconfig
index e9386ef..6cbb81c 100644
--- a/drivers/net/ethernet/cirrus/Kconfig
+++ b/drivers/net/ethernet/cirrus/Kconfig
@@ -5,7 +5,8 @@
config NET_VENDOR_CIRRUS
bool "Cirrus devices"
default y
- depends on ARM && ARCH_EP93XX
+ depends on ISA || EISA || MACH_IXDP2351 || ARCH_IXDP2X01 \
+ || MACH_MX31ADS || MACH_QQ2440 || (ARM && ARCH_EP93XX)
---help---
If you have a network (Ethernet) card belonging to this class, say Y
and read the Ethernet-HOWTO, available from
@@ -18,6 +19,25 @@ config NET_VENDOR_CIRRUS
if NET_VENDOR_CIRRUS
+config CS89x0
+ tristate "CS89x0 support"
+ depends on (ISA || EISA || MACH_IXDP2351 \
+ || ARCH_IXDP2X01 || MACH_MX31ADS || MACH_QQ2440)
+ ---help---
+ Support for CS89x0 chipset based Ethernet cards. If you have a
+ network (Ethernet) card of this type, say Y and read the
+ Ethernet-HOWTO, available from
+ <http://www.tldp.org/docs.html#howto> as well as
+ <file:Documentation/networking/cs89x0.txt>.
+
+ To compile this driver as a module, choose M here. The module
+ will be called cs89x0.
+
+config CS89x0_NONISA_IRQ
+ def_bool y
+ depends on CS89x0 != n
+ depends on MACH_IXDP2351 || ARCH_IXDP2X01 || MACH_MX31ADS || MACH_QQ2440
+
config EP93XX_ETH
tristate "EP93xx Ethernet support"
depends on ARM && ARCH_EP93XX
diff --git a/drivers/net/ethernet/cirrus/Makefile b/drivers/net/ethernet/cirrus/Makefile
index 9905ea2..14bd77e 100644
--- a/drivers/net/ethernet/cirrus/Makefile
+++ b/drivers/net/ethernet/cirrus/Makefile
@@ -2,4 +2,5 @@
# Makefile for the Cirrus network device drivers.
#
+obj-$(CONFIG_CS89x0) += cs89x0.o
obj-$(CONFIG_EP93XX_ETH) += ep93xx_eth.o
diff --git a/drivers/net/ethernet/apple/cs89x0.c b/drivers/net/ethernet/cirrus/cs89x0.c
similarity index 100%
rename from drivers/net/ethernet/apple/cs89x0.c
rename to drivers/net/ethernet/cirrus/cs89x0.c
diff --git a/drivers/net/ethernet/apple/cs89x0.h b/drivers/net/ethernet/cirrus/cs89x0.h
similarity index 100%
rename from drivers/net/ethernet/apple/cs89x0.h
rename to drivers/net/ethernet/cirrus/cs89x0.h
--
1.7.6.4
^ permalink raw reply related
* Re: loopback IP alias breaks tftp?
From: Josh Boyer @ 2011-10-06 13:23 UTC (permalink / raw)
To: Julian Anastasov, Joel Sing; +Cc: netdev
In-Reply-To: <alpine.LFD.2.00.1110060009020.1959@ja.ssi.bg>
On Thu, Oct 06, 2011 at 12:18:44AM +0300, Julian Anastasov wrote:
>
> Hello,
>
> On Wed, 5 Oct 2011, Josh Boyer wrote:
>
> > Hi All,
> >
> > We've had a report [1] of a change in behavior when trying to use an IP
> > alias to tftp from a loopback device. Apparently the steps outlined in
> > the bug worked in 2.6.35, and broke somewhere before 2.6.38.6.
> >
> > I can confirm the steps fail on a 3.0 based kernel and I'm trying to do
> > a git bisect to find the commit involved, but I thought I would send
> > this along to see if anyone might have an idea. (Also, I'm not really
> > sure how valid of a usecase this was to begin with.)
>
> What about commit 9fc3bbb4a752f108cf096d96640f3b548bbbce6c ?
>
> ipv4/route.c: respect prefsrc for local routes
>
> http://marc.info/?t=129412232500001&r=1&w=2
>
> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=739534
Yep. That is exactly what my git bisect said too.
So now we have a change in behavior since that commit for the usecase
described in the bug above, but I'm unsure if that usecase was ever
really valid or if the commit had unintentional side effects.
Joel (or anyone else) can you take a look and comment?
josh
^ permalink raw reply
* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Eli Cohen @ 2011-10-06 13:57 UTC (permalink / raw)
To: Thadeu Lima de Souza Cascardo
Cc: Benjamin Herrenschmidt, Yevgeny Petrilin, netdev@vger.kernel.org,
Eli Cohen, linuxppc-dev
In-Reply-To: <20111005081502.GB2681@mtldesk30>
On Wed, Oct 05, 2011 at 10:15:02AM +0200, Eli Cohen wrote:
How about this patch - can you give it a try?
>From dee60547aa9e35a02835451d9e694cd80dd3072f Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli@mellanox.co.il>
Date: Thu, 6 Oct 2011 15:50:02 +0200
Subject: [PATCH] mlx4_en: Fix blue flame on powerpc
The source buffer used for copying into the blue flame register is already in
big endian. However, when copying to device on powerpc, the endianess is
swapped so the data reaches th device in little endian which is wrong. On x86
based platform no swapping occurs so it reaches the device with the correct
endianess. Fix this by calling le32_to_cpu() on the buffer. On LE systems there
is no change; on BE there will be a swap.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
---
drivers/net/mlx4/en_tx.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
index 16337fb..3743acc 100644
--- a/drivers/net/mlx4/en_tx.c
+++ b/drivers/net/mlx4/en_tx.c
@@ -601,6 +601,16 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
static void mlx4_bf_copy(unsigned long *dst, unsigned long *src, unsigned bytecnt)
{
+ int i;
+ __le32 *psrc = (__le32 *)src;
+
+ /*
+ * the buffer is already in big endian. For little endian machines that's
+ * fine. For big endain machines we must swap since the chipset swaps again
+ */
+ for (i = 0; i < bytecnt / 4; ++i)
+ psrc[i] = le32_to_cpu(psrc[i]);
+
__iowrite64_copy(dst, src, bytecnt / 8);
}
--
1.7.7.rc0.70.g82660
> On Tue, Oct 04, 2011 at 05:26:20PM -0300, Thadeu Lima de Souza Cascardo wrote:
>
> I believe we have an endianess problem here. The source buffer is in
> big endian - in x86 archs, it will rich the pci device unswapped since
> both x86 and pci are little endian. In ppc, it wil be swapped by the
> chipset so it will reach the device in little endian which is wrong.
> So, in mlx4_bf_copy, you could loop over the buffer and swap32 the all
> the dwords before the call to __iowrite64_copy. Of course which should
> fix this in an arch independent manner. Let me know this works for
> you.
>
> > On Tue, Oct 04, 2011 at 08:02:12AM +0200, Benjamin Herrenschmidt wrote:
> > > On Mon, 2011-10-03 at 17:53 -0300, Thadeu Lima de Souza Cascardo wrote:
> > >
> > > .../...
> > >
> > > > > Can you also send me the output of ethtool -i?
> > > > > It seems that there is a problem with write combining on Power processors, we will check this issue.
> > > > >
> > > > > Yevgeny
> > > >
> > > > Hello, Yevgeny.
> > > >
> > > > You will find the output of ethtool -i below.
> > > >
> > > > I am copying Ben and powerpc list, in case this is an issue with Power
> > > > processors. They can provide us some more insight into this.
> > >
> > > May I get some background please ? :-)
> > >
> > > I'm not aware of a specific issue with write combining but I'd need to
> > > know more about what you are doing and the code to do it to comment on
> > > whether it should work or not.
> > >
> > > Cheers,
> > > Ben.
> > >
> > >
> >
> > Hello, Ben.
> >
> > Sorry for that. I am testing mlx4_en driver on a POWER. Yevgeny has
> > added blue frame support, that does not require writing to the device
> > memory to indicate a new packet (the doorbell register as it is called).
> >
> > Well, the ring is getting full with no interrupt or packets transmitted.
> > I simply added a write to the doorbell register and it works for me.
> > Yevgeny says this is not the right fix, claiming there is a problem with
> > write combining on POWER. The code uses memory barriers, so I don't know
> > why there is any problem.
> >
> > I am posting the code here to show better what the situation is.
> > Yevgeny can tell more about the device and the driver.
> >
> > The code below is the driver as of now, including a diff with what I
> > changed and had resulted OK for me. Before the blue frame support, the
> > only code executed was the else part. I can't tell much what the device
> > should be seeing and doing after the blue frame part of the code is
> > executed. But it does send the packet if I write to the doorbell
> > register.
> >
> > Yevgeny, can you tell us what the device should be doing and why you
> > think this is a problem on POWER? Is it possible that this is simply a
> > problem with the firmware version?
> >
> > Thanks,
> > Cascardo.
> >
> > ---
> > if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
> > !vlan_tag) {
> > *(u32 *) (&tx_desc->ctrl.vlan_tag) |=
> > ring->doorbell_qpn;
> > op_own |= htonl((bf_index & 0xffff) << 8);
> > /* Ensure new descirptor hits memory
> > * before setting ownership of this descriptor to HW */
> > wmb();
> > tx_desc->ctrl.owner_opcode = op_own;
> >
> > wmb();
> >
> > mlx4_bf_copy(ring->bf.reg + ring->bf.offset, (unsigned
> > long *) &tx_desc->ctrl,
> > desc_size);
> >
> > wmb();
> >
> > ring->bf.offset ^= ring->bf.buf_size;
> > } else {
> > /* Ensure new descirptor hits memory
> > * before setting ownership of this descriptor to HW */
> > wmb();
> > tx_desc->ctrl.owner_opcode = op_own;
> > - wmb();
> > - writel(ring->doorbell_qpn, ring->bf.uar->map +
> > MLX4_SEND_DOORBELL);
> > }
> >
> > + wmb();
> > + writel(ring->doorbell_qpn, ring->bf.uar->map +
> > MLX4_SEND_DOORBELL);
> > +
> > ---
^ permalink raw reply related
* Re: [PATCH net] mscan: zero accidentally copied register content
From: Oliver Hartkopp @ 2011-10-06 14:01 UTC (permalink / raw)
To: Wolfram Sang, Wolfgang Grandegger; +Cc: Linux Netdev List, Andre Naujoks
In-Reply-To: <20111006092456.GB1974@pengutronix.de>
On 10/06/11 11:24, Wolfram Sang wrote:
>
>> Why do you want to change 16-bit accesses in general? They are faster
>> than two 8 bit accesses.
>
> Yup, was thinking the same.
Ah, i did not get this from your code example
if (can_dlc & 1)
*payload = in_be16() & mask;
which probably does the same as Wolfgangs more obvious suggestion
if (frame->can_dlc & 1)
frame->data[frame->can_dlc - 1] = in_8(data);
:-)
As my patch could be done without real testing, as i did not change the
register access and only fixed the result ...
if (frame->can_dlc & 1)
frame->data[frame->can_dlc] = 0;
... it would be nice if e.g. Wolfgang could send his patch after some testing,
as i currently don't have access to my MPC5200 hardware here.
Tnx & best regards,
Oliver
^ permalink raw reply
* Re: [PATCH net] mscan: zero accidentally copied register content
From: Wolfram Sang @ 2011-10-06 14:09 UTC (permalink / raw)
To: Oliver Hartkopp; +Cc: Wolfgang Grandegger, Linux Netdev List, Andre Naujoks
In-Reply-To: <4E8DB4C9.30109@hartkopp.net>
[-- Attachment #1: Type: text/plain, Size: 537 bytes --]
> Ah, i did not get this from your code example
>
> if (can_dlc & 1)
> *payload = in_be16() & mask;
Sorry, I thought it was obvious that I was using pseudo_code here :)
> ... it would be nice if e.g. Wolfgang could send his patch after some testing,
> as i currently don't have access to my MPC5200 hardware here.
What about Andre?
Regards,
Wolfram
--
Pengutronix e.K. | Wolfram Sang |
Industrial Linux Solutions | http://www.pengutronix.de/ |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply
* RE: [PATCH] mlx4_en: fix transmit of packages when blue frame isenabled
From: David Laight @ 2011-10-06 14:10 UTC (permalink / raw)
To: Eli Cohen, Thadeu Lima de Souza Cascardo
Cc: Eli Cohen, Yevgeny Petrilin, linuxppc-dev, netdev
In-Reply-To: <20111006135759.GH2681@mtldesk30>
> static void mlx4_bf_copy(unsigned long *dst, unsigned long *src,
unsigned bytecnt) {
> + int i;
> + __le32 *psrc = (__le32 *)src;
> +
> + /*
> + * the buffer is already in big endian. For little endian
machines that's
> + * fine. For big endain machines we must swap since the chipset
swaps again
> + */
> + for (i = 0; i < bytecnt / 4; ++i)
> + psrc[i] = le32_to_cpu(psrc[i]);
> +
> __iowrite64_copy(dst, src, bytecnt / 8);
> }
That code looks horrid...
1) I'm not sure the caller expects the buffer to be corrupted.
2) It contains a lot of memory cycles.
3) It looked from the calls that this code is copying descriptors,
so the transfer length is probably 1 or 2 words - so the loop
is inefficient.
4) ppc doesn't have a fast byteswap instruction (very new gcc might
use the byteswapping memery access for the le32_to_cpu() though),
so it would be better getting the byteswap done inside
__iowrite64_copy() - since that is probably requesting a byteswap
anyway.
OTOH I'm not at all clear about the 64bit xfers....
^ permalink raw reply
* Re: [PATCH net] mscan: zero accidentally copied register content
From: Andre Naujoks @ 2011-10-06 14:14 UTC (permalink / raw)
To: Wolfram Sang; +Cc: Oliver Hartkopp, Wolfgang Grandegger, Linux Netdev List
In-Reply-To: <20111006140904.GD1974@pengutronix.de>
Hi.
2011/10/6 Wolfram Sang <w.sang@pengutronix.de>
>
> > Ah, i did not get this from your code example
> >
> > if (can_dlc & 1)
> > *payload = in_be16() & mask;
>
> Sorry, I thought it was obvious that I was using pseudo_code here :)
>
> > ... it would be nice if e.g. Wolfgang could send his patch after some testing,
> > as i currently don't have access to my MPC5200 hardware here.
>
> What about Andre?
I am currently trying to test the first version Oliver sent. I am
currently having some problems
with my hardware and I don't know what it is yet.
I will give Wolfgangs suggestion a try after my problems are sorted out here.
Regards
Andre
>
> Regards,
>
> Wolfram
>
> --
> Pengutronix e.K. | Wolfram Sang |
> Industrial Linux Solutions | http://www.pengutronix.de/ |
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
>
> iEYEARECAAYFAk6NtoAACgkQD27XaX1/VRudMQCeMX3yMc+Au65BwKDocJNXtG/d
> RxgAoJnY/p0csrHa0o/7SpnQdtEhEOQn
> =Sxlb
> -----END PGP SIGNATURE-----
>
^ permalink raw reply
* Re: [PATCH net] mscan: zero accidentally copied register content
From: Oliver Hartkopp @ 2011-10-06 14:33 UTC (permalink / raw)
To: Wolfgang Grandegger; +Cc: Wolfram Sang, Linux Netdev List, Andre Naujoks
In-Reply-To: <4E8D7065.8040905@grandegger.com>
On 10/06/11 11:09, Wolfgang Grandegger wrote:
> On 10/06/2011 09:02 AM, Oliver Hartkopp wrote:
>>
>> I think if one would like to rework the 16bit register access (which is used
>> in the rx path /and/ in the tx path also) this should go via net-next after
>> some discussion and testing.
>
> Why do you want to change 16-bit accesses in general? They are faster
> than two 8 bit accesses.
>
>> IMHO this fix is small and clear and especially not risky. I wonder if
>> reworking the 16 bit register access is worth the effort?
>
> I would prefer:
>
> if (!(frame->can_id & CAN_RTR_FLAG)) {
> void __iomem *data = ®s->rx.dsr1_0;
> u16 *payload = (u16 *)frame->data;
>
> for (i = 0; i < frame->can_dlc / 2; i++) {
> *payload++ = in_be16(data);
> data += 2 + _MSCAN_RESERVED_DSR_SIZE;
> }
> /* copy remaining byte */
> if (frame->can_dlc & 1)
> frame->data[frame->can_dlc - 1] = in_8(data);
> }
Besides the fact that Andre is going to test this idea from Wolfgang now, are
you really sure that it must be
in_8(data)
and not
in_8(data+1)
???
And that data definitely points to the right place?
I would prefer to be really cautious with these big endian 16 bit registers!
Therefore my fix with
+ /* zero accidentally copied register content at odd DLCs */
+ if (frame->can_dlc & 1)
+ frame->data[frame->can_dlc] = 0;
only repairing the result looks much more defensive to me.
Regards,
Oliver
^ permalink raw reply
* Re: [PATCH net] mscan: zero accidentally copied register content
From: Andre Naujoks @ 2011-10-06 15:03 UTC (permalink / raw)
To: Oliver Hartkopp; +Cc: Wolfgang Grandegger, Wolfram Sang, Linux Netdev List
In-Reply-To: <4E8DBC34.80607@hartkopp.net>
2011/10/6 Oliver Hartkopp <socketcan@hartkopp.net>:
> On 10/06/11 11:09, Wolfgang Grandegger wrote:
>
>> On 10/06/2011 09:02 AM, Oliver Hartkopp wrote:
>>>
>>> I think if one would like to rework the 16bit register access (which is used
>>> in the rx path /and/ in the tx path also) this should go via net-next after
>>> some discussion and testing.
>>
>> Why do you want to change 16-bit accesses in general? They are faster
>> than two 8 bit accesses.
>>
>>> IMHO this fix is small and clear and especially not risky. I wonder if
>>> reworking the 16 bit register access is worth the effort?
>>
>> I would prefer:
>>
>> if (!(frame->can_id & CAN_RTR_FLAG)) {
>> void __iomem *data = ®s->rx.dsr1_0;
>> u16 *payload = (u16 *)frame->data;
>>
>> for (i = 0; i < frame->can_dlc / 2; i++) {
>> *payload++ = in_be16(data);
>> data += 2 + _MSCAN_RESERVED_DSR_SIZE;
>> }
>> /* copy remaining byte */
>> if (frame->can_dlc & 1)
>> frame->data[frame->can_dlc - 1] = in_8(data);
>> }
>
>
> Besides the fact that Andre is going to test this idea from Wolfgang now, are
> you really sure that it must be
>
> in_8(data)
>
> and not
>
> in_8(data+1)
>
> ???
>
> And that data definitely points to the right place?
>
> I would prefer to be really cautious with these big endian 16 bit registers!
>
> Therefore my fix with
>
> + /* zero accidentally copied register content at odd DLCs */
> + if (frame->can_dlc & 1)
> + frame->data[frame->can_dlc] = 0;
>
> only repairing the result looks much more defensive to me.
First things first: Both ways seem to work correctly. At least on the
MPC5200 I have here.
But I am with Oliver on this one. The solution looks much simpler and
endianess errors are not possible. If the few CPU cycles are worth it
on the other hand, then Wolfgangs version is probably preferable. I
don't have access to this kind of hardware on a little endian machine
to test it, though.
Greetings
Andre
>
> Regards,
> Oliver
>
^ permalink raw reply
* Re: A new 40G Network driver ready to submit to the kernel tree
From: David @ 2011-10-06 15:06 UTC (permalink / raw)
To: Joyce Yu - System Software; +Cc: netdev@vger.kernel.org
In-Reply-To: <4E8CE4A0.4060202@oracle.com>
在 2011-10-6,7:13,Joyce Yu - System Software <joyce.yu@oracle.com> 写道:
>
>
> I have a new 40G Network driver ready to submit to the kernel tree. The driver has been ported to latest linux-3.0-rc5 and net-2.6-353e5c9 tree. The driver versions for 2.6.18 and 2.6.32 based kernel have been fully tested and released to the customer. Shall I just send the driverxx.c and driverxx.h for net-2.6-353e5c9 and linux-3.0-rc5 to this alias?
net tree and linux tree are just only for bug fix. New features or new drivers should be merged from net-next tree which locates at Github site.
Shan Wei
>
> Thanks,
> Joyce
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox