[PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)'
@ 2020-12-10  8:08 SeongJae Park
  2020-12-10  8:08 ` [PATCH v2 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works SeongJae Park
  2020-12-10 14:09 ` [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' Eric Dumazet
  0 siblings, 2 replies; 4+ messages in thread
From: SeongJae Park @ 2020-12-10  8:08 UTC (permalink / raw)
  To: davem
  Cc: SeongJae Park, kuba, kuznet, edumazet, fw, paulmck, netdev, rcu,
	linux-kernel

From: SeongJae Park <sjpark@amazon.de>

On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
make the number of active slab objects including 'sock_inode_cache' type
rapidly and continuously increase.  As a result, memory pressure occurs.

In more detail, I made an artificial reproducer that resembles the
workload that we found the problem and reproduce the problem faster.  It
merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop.  It takes
about 2 minutes.  On 40 CPU cores, 70GB DRAM machine, it reduced about
15GB of available memory in total.  Note that the issue don't reproduce
on every machine.  On my 6 CPU cores machine, the problem didn't
reproduce.

'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
relevant memory objects.  They are asynchronously invoked by the work
queues and internally use 'rcu_barrier()' to ensure safe destructions.
'cleanup_net()' works in a batched maneer in a single thread worker,
while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
'system_wq'.

Therefore, 'fqdir_work_fn()' called frequently under the workload and
made the contention for 'rcu_barrier()' high.  In more detail, the
global mutex, 'rcu_state.barrier_mutex' became the bottleneck.

I tried making 'fqdir_work_fn()' batched and confirmed it works.  The
following patch is for the change.  I think this is the right solution
for point fix of this issue, but someone might blame different parts.

1. User: Frequent 'unshare()' calls
From some point of view, such frequent 'unshare()' calls might seem only
insane.

2. Global mutex in 'rcu_barrier()'
Because of the global mutex, 'rcu_barrier()' callers could wait long
even after the callbacks started before the call finished.  Therefore,
similar issues could happen in another 'rcu_barrier()' usages.  Maybe we
can use some wait queue like mechanism to notify the waiters when the
desired time came.

I personally believe applying the point fix for now and making
'rcu_barrier()' improvement in longterm make sense.  If I'm missing
something or you have different opinion, please feel free to let me
know.

Patch History
-------------

Changes from v1
(https://lore.kernel.org/netdev/20201208094529.23266-1-sjpark@amazon.com/)
- Keep xmas tree variable ordering (Jakub Kicinski)
- Add more numbers (Eric Dumazet)
- Use 'llist_for_each_entry_safe()' (Eric Dumazet)

SeongJae Park (1):
  net/ipv4/inet_fragment: Batch fqdir destroy works

 include/net/inet_frag.h  |  2 +-
 net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
 2 files changed, 21 insertions(+), 9 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works
  2020-12-10  8:08 [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' SeongJae Park
@ 2020-12-10  8:08 ` SeongJae Park
  2020-12-10 14:09 ` [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' Eric Dumazet
  1 sibling, 0 replies; 4+ messages in thread
From: SeongJae Park @ 2020-12-10  8:08 UTC (permalink / raw)
  To: davem
  Cc: SeongJae Park, kuba, kuznet, edumazet, fw, paulmck, netdev, rcu,
	linux-kernel

From: SeongJae Park <sjpark@amazon.de>

In 'fqdir_exit()', a work for destroy of the 'fqdir' is enqueued.  The
work function, 'fqdir_work_fn()', calls 'rcu_barrier()'.  In case of
intensive 'fqdir_exit()' (e.g., frequent 'unshare()' systemcalls), this
increased contention could result in unacceptably high latency of
'rcu_barrier()'.  This commit avoids such contention by doing the
destroy work in batched manner, as similar to that of 'cleanup_net()'.

Signed-off-by: SeongJae Park <sjpark@amazon.de>
---
 include/net/inet_frag.h  |  2 +-
 net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index bac79e817776..558893d8810c 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -20,7 +20,7 @@ struct fqdir {
 
 	/* Keep atomic mem on separate cachelines in structs that include it */
 	atomic_long_t		mem ____cacheline_aligned_in_smp;
-	struct work_struct	destroy_work;
+	struct llist_node	destroy_list;
 };
 
 /**
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 10d31733297d..d5c40386a764 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -145,12 +145,19 @@ static void inet_frags_free_cb(void *ptr, void *arg)
 		inet_frag_destroy(fq);
 }
 
+static LLIST_HEAD(destroy_list);
+
 static void fqdir_work_fn(struct work_struct *work)
 {
-	struct fqdir *fqdir = container_of(work, struct fqdir, destroy_work);
-	struct inet_frags *f = fqdir->f;
+	struct llist_node *kill_list;
+	struct fqdir *fqdir, *tmp;
+	struct inet_frags *f;
 
-	rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
+	/* Atomically snapshot the list of fqdirs to destroy */
+	kill_list = llist_del_all(&destroy_list);
+
+	llist_for_each_entry(fqdir, kill_list, destroy_list)
+		rhashtable_free_and_destroy(&fqdir->rhashtable, inet_frags_free_cb, NULL);
 
 	/* We need to make sure all ongoing call_rcu(..., inet_frag_destroy_rcu)
 	 * have completed, since they need to dereference fqdir.
@@ -158,10 +165,13 @@ static void fqdir_work_fn(struct work_struct *work)
 	 */
 	rcu_barrier();
 
-	if (refcount_dec_and_test(&f->refcnt))
-		complete(&f->completion);
+	llist_for_each_entry_safe(fqdir, tmp, kill_list, destroy_list) {
+		f = fqdir->f;
+		if (refcount_dec_and_test(&f->refcnt))
+			complete(&f->completion);
 
-	kfree(fqdir);
+		kfree(fqdir);
+	}
 }
 
 int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
@@ -184,10 +194,12 @@ int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net)
 }
 EXPORT_SYMBOL(fqdir_init);
 
+static DECLARE_WORK(fqdir_destroy_work, fqdir_work_fn);
+
 void fqdir_exit(struct fqdir *fqdir)
 {
-	INIT_WORK(&fqdir->destroy_work, fqdir_work_fn);
-	queue_work(system_wq, &fqdir->destroy_work);
+	if (llist_add(&fqdir->destroy_list, &destroy_list))
+		queue_work(system_wq, &fqdir_destroy_work);
 }
 EXPORT_SYMBOL(fqdir_exit);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)'
  2020-12-10  8:08 [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' SeongJae Park
  2020-12-10  8:08 ` [PATCH v2 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works SeongJae Park
@ 2020-12-10 14:09 ` Eric Dumazet
  2020-12-10 22:16   ` SeongJae Park
  1 sibling, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2020-12-10 14:09 UTC (permalink / raw)
  To: SeongJae Park
  Cc: David Miller, SeongJae Park, Jakub Kicinski, Alexey Kuznetsov,
	Florian Westphal, Paul E. McKenney, netdev, rcu, LKML

On Thu, Dec 10, 2020 at 9:09 AM SeongJae Park <sjpark@amazon.com> wrote:
>
> From: SeongJae Park <sjpark@amazon.de>
>
> On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
> make the number of active slab objects including 'sock_inode_cache' type
> rapidly and continuously increase.  As a result, memory pressure occurs.
>
> In more detail, I made an artificial reproducer that resembles the
> workload that we found the problem and reproduce the problem faster.  It
> merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop.  It takes
> about 2 minutes.  On 40 CPU cores, 70GB DRAM machine, it reduced about
> 15GB of available memory in total.  Note that the issue don't reproduce
> on every machine.  On my 6 CPU cores machine, the problem didn't
> reproduce.

OK, that is the number before the patch, but what is the number after
the patch ?

I think the idea is very nice, but this will serialize fqdir hash
tables destruction on one single cpu,
this might become a real issue _if_ these hash tables are populated.

(Obviously in your for (i=1;i<50000;i++) unshare(CLONE_NEWNET);  all
these tables are empty...)

As you may now, frags are often used as vectors for DDOS attacks.

I would suggest maybe to not (ab)use system_wq, but a dedicated work queue
with a limit (@max_active argument set to 1 in alloc_workqueue()) , to
make sure that the number of
threads is optimal/bounded.

Only the phase after hash table removal could benefit from your
deferral to a single context,
so that a single rcu_barrier() is active, since the part after
rcu_barrier() is damn cheap and _can_ be serialized

  if (refcount_dec_and_test(&f->refcnt))
                complete(&f->completion);

Thanks !

>
> 'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
> relevant memory objects.  They are asynchronously invoked by the work
> queues and internally use 'rcu_barrier()' to ensure safe destructions.
> 'cleanup_net()' works in a batched maneer in a single thread worker,
> while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
> 'system_wq'.
>
> Therefore, 'fqdir_work_fn()' called frequently under the workload and
> made the contention for 'rcu_barrier()' high.  In more detail, the
> global mutex, 'rcu_state.barrier_mutex' became the bottleneck.
>
> I tried making 'fqdir_work_fn()' batched and confirmed it works.  The
> following patch is for the change.  I think this is the right solution
> for point fix of this issue, but someone might blame different parts.
>
> 1. User: Frequent 'unshare()' calls
> From some point of view, such frequent 'unshare()' calls might seem only
> insane.
>
> 2. Global mutex in 'rcu_barrier()'
> Because of the global mutex, 'rcu_barrier()' callers could wait long
> even after the callbacks started before the call finished.  Therefore,
> similar issues could happen in another 'rcu_barrier()' usages.  Maybe we
> can use some wait queue like mechanism to notify the waiters when the
> desired time came.
>
> I personally believe applying the point fix for now and making
> 'rcu_barrier()' improvement in longterm make sense.  If I'm missing
> something or you have different opinion, please feel free to let me
> know.
>
>
> Patch History
> -------------
>
> Changes from v1
> (https://lore.kernel.org/netdev/20201208094529.23266-1-sjpark@amazon.com/)
> - Keep xmas tree variable ordering (Jakub Kicinski)
> - Add more numbers (Eric Dumazet)
> - Use 'llist_for_each_entry_safe()' (Eric Dumazet)
>
> SeongJae Park (1):
>   net/ipv4/inet_fragment: Batch fqdir destroy works
>
>  include/net/inet_frag.h  |  2 +-
>  net/ipv4/inet_fragment.c | 28 ++++++++++++++++++++--------
>  2 files changed, 21 insertions(+), 9 deletions(-)
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)'
  2020-12-10 14:09 ` [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' Eric Dumazet
@ 2020-12-10 22:16   ` SeongJae Park
  0 siblings, 0 replies; 4+ messages in thread
From: SeongJae Park @ 2020-12-10 22:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: SeongJae Park, David Miller, SeongJae Park, Jakub Kicinski,
	Alexey Kuznetsov, Florian Westphal, Paul E. McKenney, netdev, rcu,
	LKML

On Thu, 10 Dec 2020 15:09:10 +0100 Eric Dumazet <edumazet@google.com> wrote:

> On Thu, Dec 10, 2020 at 9:09 AM SeongJae Park <sjpark@amazon.com> wrote:
> >
> > From: SeongJae Park <sjpark@amazon.de>
> >
> > On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
> > make the number of active slab objects including 'sock_inode_cache' type
> > rapidly and continuously increase.  As a result, memory pressure occurs.
> >
> > In more detail, I made an artificial reproducer that resembles the
> > workload that we found the problem and reproduce the problem faster.  It
> > merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop.  It takes
> > about 2 minutes.  On 40 CPU cores, 70GB DRAM machine, it reduced about
> > 15GB of available memory in total.  Note that the issue don't reproduce
> > on every machine.  On my 6 CPU cores machine, the problem didn't
> > reproduce.
> 
> OK, that is the number before the patch, but what is the number after
> the patch ?

No continuous memory reduction but some fluctuation observed.  Nevertheless,
the available memory reduction was only up to about 400MB.

> 
> I think the idea is very nice, but this will serialize fqdir hash
> tables destruction on one single cpu,
> this might become a real issue _if_ these hash tables are populated.
> 
> (Obviously in your for (i=1;i<50000;i++) unshare(CLONE_NEWNET);  all
> these tables are empty...)
> 
> As you may now, frags are often used as vectors for DDOS attacks.
> 
> I would suggest maybe to not (ab)use system_wq, but a dedicated work queue
> with a limit (@max_active argument set to 1 in alloc_workqueue()) , to
> make sure that the number of
> threads is optimal/bounded.
> 
> Only the phase after hash table removal could benefit from your
> deferral to a single context,
> so that a single rcu_barrier() is active, since the part after
> rcu_barrier() is damn cheap and _can_ be serialized
> 
>   if (refcount_dec_and_test(&f->refcnt))
>                 complete(&f->completion);

Good point, thanks for this kind suggestion.  I will do so in next version.


Thanks,
SeongJae Park

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-12-10 23:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-12-10  8:08 [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' SeongJae Park
2020-12-10  8:08 ` [PATCH v2 1/1] net/ipv4/inet_fragment: Batch fqdir destroy works SeongJae Park
2020-12-10 14:09 ` [PATCH v2 0/1] net: Reduce rcu_barrier() contentions from 'unshare(CLONE_NEWNET)' Eric Dumazet
2020-12-10 22:16   ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).