[PATCH v2 0/3] xfs: Reduce spinlock contention in log space slowpath code

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] xfs: Reduce spinlock contention in log space slowpath code
@ 2018-08-26 20:53 Waiman Long
  2018-08-26 20:53 ` [PATCH v2 1/3] sched/core: Export wake_q functions to kernel modules Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Waiman Long @ 2018-08-26 20:53 UTC (permalink / raw)
  To: Darrick J. Wong, Ingo Molnar, Peter Zijlstra
  Cc: linux-xfs, linux-kernel, Dave Chinner, Waiman Long

v1->v2:
 - For patch 1, remove wake_q_empty() & add task_in_wake_q().
 - Rewrite patch 2 after comments from Dave Chinner and break it down
   to 2 separate patches. Now the original xfs logic was kept. The
   patches just try to move the task wakeup calls to outside the
   spinlock.

While running the AIM7 microbenchmark on a small xfs filesystem, it
was found that there was a severe spinlock contention problem in the
current XFS log space reservation code. To alleviate the problem, the
patches try to move as much task wakeup code to outside the spinlock
using the wake_q mechanism so as to reduce the lock hold time as much
as possible.

Patch 1 exports the wake_up_q() and wake_q_add() functions and adds
the task_in_wake_q() inline function.

Patch 2 adds a new flag XLOG_TIC_WAKING to mark a task that is being
waken up and skip the wake_up_process() if a previous wakeup has
been issued.

Patch 3 modifies the xlog_grant_head_wait() and xlog_grant_head_wake()
functions to use wake_q for waking up tasks outside the lock critical
section instead of calling wake_up_process() directly.

The following table shows the performance improvement in the AIM7
fserver workload after applying patches 2 and 3:

  Patches	Jobs/min	% Change
  -------	--------	--------
     -		 91,486		   -
     2		192,666		 +111%
    2+3		285,221 	 +212%

So the final patched kernel performed more than 3X better than the
unpatched one.

Waiman Long (3):
  sched/core: Export wake_q functions to kernel modules
  xfs: Prevent multiple wakeups of the same log space waiter
  xfs: Use wake_q for waking up log space waiters

 fs/xfs/xfs_linux.h           |  1 +
 fs/xfs/xfs_log.c             | 57 ++++++++++++++++++++++++++++++------
 fs/xfs/xfs_log_priv.h        |  1 +
 include/linux/sched/wake_q.h |  5 ++++
 kernel/sched/core.c          |  2 ++
 5 files changed, 57 insertions(+), 9 deletions(-)

-- 
2.18.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/3] sched/core: Export wake_q functions to kernel modules
  2018-08-26 20:53 [PATCH v2 0/3] xfs: Reduce spinlock contention in log space slowpath code Waiman Long
@ 2018-08-26 20:53 ` Waiman Long
  2018-08-26 20:53 ` [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2018-08-26 20:53 UTC (permalink / raw)
  To: Darrick J. Wong, Ingo Molnar, Peter Zijlstra
  Cc: linux-xfs, linux-kernel, Dave Chinner, Waiman Long

The use of wake_q_add() and wake_up_q() functions help to do task wakeup
without holding lock can help to reduce lock hold time. They should be
available to kernel modules as well.

A new task_in_wake_q() inline function is also added to check if the
given task is in a wake_q.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sched/wake_q.h | 5 +++++
 kernel/sched/core.c          | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index 10b19a192b2d..902bf1228d32 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -47,6 +47,11 @@ static inline void wake_q_init(struct wake_q_head *head)
 	head->lastp = &head->first;
 }
 
+static inline bool task_in_wake_q(struct task_struct *task)
+{
+	return READ_ONCE(task->wake_q.next) != NULL;
+}
+
 extern void wake_q_add(struct wake_q_head *head,
 		       struct task_struct *task);
 extern void wake_up_q(struct wake_q_head *head);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 625bc9897f62..d90a2930b8ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -420,6 +420,7 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	*head->lastp = node;
 	head->lastp = &node->next;
 }
+EXPORT_SYMBOL_GPL(wake_q_add);
 
 void wake_up_q(struct wake_q_head *head)
 {
@@ -442,6 +443,7 @@ void wake_up_q(struct wake_q_head *head)
 		put_task_struct(task);
 	}
 }
+EXPORT_SYMBOL_GPL(wake_up_q);
 
 /*
  * resched_curr - mark rq's current task 'to be rescheduled now'.
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter
  2018-08-26 20:53 [PATCH v2 0/3] xfs: Reduce spinlock contention in log space slowpath code Waiman Long
  2018-08-26 20:53 ` [PATCH v2 1/3] sched/core: Export wake_q functions to kernel modules Waiman Long
@ 2018-08-26 20:53 ` Waiman Long
  2018-08-27  0:21   ` Dave Chinner
  2018-08-26 20:53 ` [PATCH v2 3/3] xfs: Use wake_q for waking up log space waiters Waiman Long
  2018-08-26 23:08 ` [PATCH v2 0/3] xfs: Reduce spinlock contention in log space slowpath code Dave Chinner
  3 siblings, 1 reply; 10+ messages in thread
From: Waiman Long @ 2018-08-26 20:53 UTC (permalink / raw)
  To: Darrick J. Wong, Ingo Molnar, Peter Zijlstra
  Cc: linux-xfs, linux-kernel, Dave Chinner, Waiman Long

The current log space reservation code allows multiple wakeups of the
same sleeping waiter to happen. This is a just a waste of cpu time as
well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is
added to track if a task is being waken up and skip the wake_up_process()
call if the flag is set.

Running the AIM7 fserver workload on a 2-socket 24-core 48-thread
Broadwell system with a small xfs filesystem on ramfs, the performance
increased from 91,486 jobs/min to 192,666 jobs/min with this change.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/xfs/xfs_log.c      | 9 +++++++++
 fs/xfs/xfs_log_priv.h | 1 +
 2 files changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index c3b610b687d1..ac1dc8db7112 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -232,8 +232,16 @@ xlog_grant_head_wake(
 			return false;
 
 		*free_bytes -= need_bytes;
+
+		/*
+		 * Skip task that is being waken up already.
+		 */
+		if (tic->t_flags & XLOG_TIC_WAKING)
+			continue;
+
 		trace_xfs_log_grant_wake_up(log, tic);
 		wake_up_process(tic->t_task);
+		tic->t_flags |= XLOG_TIC_WAKING;
 	}
 
 	return true;
@@ -264,6 +272,7 @@ xlog_grant_head_wait(
 		trace_xfs_log_grant_wake(log, tic);
 
 		spin_lock(&head->lock);
+		tic->t_flags &= ~XLOG_TIC_WAKING;
 		if (XLOG_FORCED_SHUTDOWN(log))
 			goto shutdown;
 	} while (xlog_space_left(log, &head->grant) < need_bytes);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b5f82cb36202..738df09bf352 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -59,6 +59,7 @@ static inline uint xlog_get_client_id(__be32 i)
  */
 #define XLOG_TIC_INITED		0x1	/* has been initialized */
 #define XLOG_TIC_PERM_RESERV	0x2	/* permanent reservation */
+#define XLOG_TIC_WAKING		0x4	/* task is being waken up */
 
 #define XLOG_TIC_FLAGS \
 	{ XLOG_TIC_INITED,	"XLOG_TIC_INITED" }, \
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter
  2018-08-26 20:53 ` [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter Waiman Long
@ 2018-08-27  0:21   ` Dave Chinner
  2018-08-27  7:39     ` Christoph Hellwig
  2018-08-27 15:34     ` Waiman Long
  0 siblings, 2 replies; 10+ messages in thread
From: Dave Chinner @ 2018-08-27  0:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, linux-xfs,
	linux-kernel

On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote:
> The current log space reservation code allows multiple wakeups of the
> same sleeping waiter to happen. This is a just a waste of cpu time as
> well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is
> added to track if a task is being waken up and skip the wake_up_process()
> call if the flag is set.
> 
> Running the AIM7 fserver workload on a 2-socket 24-core 48-thread
> Broadwell system with a small xfs filesystem on ramfs, the performance
> increased from 91,486 jobs/min to 192,666 jobs/min with this change.

Oh, I just noticed you are using a ramfs for this benchmark,

tl; dr: Once you pass a certain point, ramdisks can be *much* slower
than SSDs on journal intensive workloads like AIM7. Hence it would be
useful to see if you have the same problems on, say, high
performance nvme SSDs.

-----

Ramdisks have substantially different means log IO completion and
wakeup behaviour compared to real storage on real production
systems. Basically, ramdisks are synchronous and real storage is
asynchronous.

That is, on a ramdisk the IO completion is run synchronously in the
same task as the IO submission because the IO is just a memcpy().
Hence a single dispatch thread can only drive an IO queue depth of 1
IO - there is no concurrency possible. This serialises large parts
of the XFS journal - the journal is really an asynchronous IO engine
that gets it's performance from driving deep IO queues and batching
commits while IO is in flight.

Ramdisks also have very low IO latency, which means there's only a
very small window for "IO in flight" batching optimisations to be
made effectively. It effectively stops such algorithms from working
completely. This means the XFS journal behaves very differently on
ramdisks when compared to normal storage.

The submission batching techniques reduces log IOs by a factor of
10-20 under heavy synchrnous transaction loads when there is any
noticeable journal IO delay - a few tens of microseconds is enough
for it to function effectively, but a ramdisk doesn't even have this
delay on journal IO.  The submission batching also has the
effect of reducing log space wakeups by the same factor there are
less IO completions signalling that space has been made available.

Further, when we get async IO completions from real hardware, they
get processed in batches by a completion workqueue - this leads to
there typically only being a single reservation space update from
all batched IO completions. This tends to reduce log space wakeups
due to log IO completion by a factor of 6-8 as the log can have up
to 8 concurrent IOs in flight at a time.

And when we throw in the lack of batching, merging and IO completion
aggregation of metadata writeback because ramdisks are synchrnous
and don't queue or merge adjacent IOs, we end up with lots more
contention on the AIL lock and much more frequent log space wakeups
(i.e. from log tail movement updates). This futher exacerbates the
problems the log already has with synchronous IO.

IOWs, log space wakeups on real storage are likely to be 50-100x
lower than on a ramdisk for the same metadata and journal intensive
workload, and as such those workloads often run faster on real
storage than they do on ramdisks.

This can be trivially seen with dbench, a simple IO benchmark that
hammers the journal. On a ramdisk, I can only get 2-2.5GB/s
throughput from the benchmark before the log bottlenecks at about
20,000 log tiny IOs per second. In comparison, on an old, badly
abused Samsung 850EVO SSD, I see 5-6GB/s in 2,000 log IOs per second
because of the pipelining and IO batching in the XFS journal async
IO engine and the massive reduction in metadata IO due to merging of
adjacent IOs in the block layer. i.e. the journal and metadata
writeback design allows the filesystem to operate at a much higher
synchronous transaction rate than would otherwise be possible by
taking advantage of the IO concurrency that storage provides us
with.

So if you use proper storage hardware (e.g. nvme SSD) and/or an
appropriately sized log, does the slowpath wakeup contention go
away? Can you please test both of these things and report the
results so we can properly evaluate the impact of these changes?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter
  2018-08-27  0:21   ` Dave Chinner
@ 2018-08-27  7:39     ` Christoph Hellwig
  2018-08-27 21:42       ` Dave Chinner
  2018-08-27 15:34     ` Waiman Long
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2018-08-27  7:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, Darrick J. Wong, Ingo Molnar, Peter Zijlstra,
	linux-xfs, linux-kernel

On Mon, Aug 27, 2018 at 10:21:34AM +1000, Dave Chinner wrote:
> tl; dr: Once you pass a certain point, ramdisks can be *much* slower
> than SSDs on journal intensive workloads like AIM7. Hence it would be
> useful to see if you have the same problems on, say, high
> performance nvme SSDs.

Note that all these ramdisk issues you mentioned below will also apply
to using the pmem driver on nvdimms, which might be a more realistic
version.  Even worse at least for cases where the nvdimms aren't
actually powerfail dram of some sort with write through caching and
ADR the latency is going to be much higher than the ramdisk as well.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter
  2018-08-27  7:39     ` Christoph Hellwig
@ 2018-08-27 21:42       ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2018-08-27 21:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Waiman Long, Darrick J. Wong, Ingo Molnar, Peter Zijlstra,
	linux-xfs, linux-kernel

On Mon, Aug 27, 2018 at 12:39:06AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 27, 2018 at 10:21:34AM +1000, Dave Chinner wrote:
> > tl; dr: Once you pass a certain point, ramdisks can be *much* slower
> > than SSDs on journal intensive workloads like AIM7. Hence it would be
> > useful to see if you have the same problems on, say, high
> > performance nvme SSDs.
> 
> Note that all these ramdisk issues you mentioned below will also apply
> to using the pmem driver on nvdimms, which might be a more realistic
> version.  Even worse at least for cases where the nvdimms aren't
> actually powerfail dram of some sort with write through caching and
> ADR the latency is going to be much higher than the ramdisk as well.

Yes, I realise that.

I am expecting that when it comes to optimising for pmem, we'll
actually rewrite the journal to map pmem and memcpy() directly
rather than go through the buffering and IO layers we currently do
so we can minimise write latency and control concurrency ourselves.
Hence I'm not really concerned by performance issues with pmem at
this point - most of our still users have traditional storage and
will for a long time to come....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter
  2018-08-27  0:21   ` Dave Chinner
  2018-08-27  7:39     ` Christoph Hellwig
@ 2018-08-27 15:34     ` Waiman Long
  2018-08-28  1:26       ` Dave Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Waiman Long @ 2018-08-27 15:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, linux-xfs,
	linux-kernel

On 08/26/2018 08:21 PM, Dave Chinner wrote:
> On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote:
>> The current log space reservation code allows multiple wakeups of the
>> same sleeping waiter to happen. This is a just a waste of cpu time as
>> well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is
>> added to track if a task is being waken up and skip the wake_up_process()
>> call if the flag is set.
>>
>> Running the AIM7 fserver workload on a 2-socket 24-core 48-thread
>> Broadwell system with a small xfs filesystem on ramfs, the performance
>> increased from 91,486 jobs/min to 192,666 jobs/min with this change.
> Oh, I just noticed you are using a ramfs for this benchmark,
>
> tl; dr: Once you pass a certain point, ramdisks can be *much* slower
> than SSDs on journal intensive workloads like AIM7. Hence it would be
> useful to see if you have the same problems on, say, high
> performance nvme SSDs.

Oh sorry, I made a mistake.

There were some problems with my test configuration. I was actually
running the test on a regular enterprise-class disk device mount on /.

Filesystem                              1K-blocks     Used Available
Use% Mounted on
/dev/mapper/rhel_hp--xl420gen9--01-root  52403200 11284408  41118792  22% /

It was not an SSD, nor ramdisk. I reran the test on ramdisk, the
performance of the patched kernel was 679,880 jobs/min which was a bit
more than double the 285,221 score that I got on a regular disk.

So the filesystem used wasn't tiny, though it is still not very large.
The test was supposed to create 16 ramdisks and distribute the test
tasks to the ramdisks. Instead, they were all pounding on the same
filesystem worsening the spinlock contention problem.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter
  2018-08-27 15:34     ` Waiman Long
@ 2018-08-28  1:26       ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2018-08-28  1:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, linux-xfs,
	linux-kernel

On Mon, Aug 27, 2018 at 11:34:13AM -0400, Waiman Long wrote:
> On 08/26/2018 08:21 PM, Dave Chinner wrote:
> > On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote:
> >> The current log space reservation code allows multiple wakeups of the
> >> same sleeping waiter to happen. This is a just a waste of cpu time as
> >> well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is
> >> added to track if a task is being waken up and skip the wake_up_process()
> >> call if the flag is set.
> >>
> >> Running the AIM7 fserver workload on a 2-socket 24-core 48-thread
> >> Broadwell system with a small xfs filesystem on ramfs, the performance
> >> increased from 91,486 jobs/min to 192,666 jobs/min with this change.
> > Oh, I just noticed you are using a ramfs for this benchmark,
> >
> > tl; dr: Once you pass a certain point, ramdisks can be *much* slower
> > than SSDs on journal intensive workloads like AIM7. Hence it would be
> > useful to see if you have the same problems on, say, high
> > performance nvme SSDs.
> 
> Oh sorry, I made a mistake.
> 
> There were some problems with my test configuration. I was actually
> running the test on a regular enterprise-class disk device mount on /.
> 
> Filesystem                              1K-blocks     Used Available
> Use% Mounted on
> /dev/mapper/rhel_hp--xl420gen9--01-root  52403200 11284408  41118792  22% /
> 
> It was not an SSD, nor ramdisk. I reran the test on ramdisk, the
> performance of the patched kernel was 679,880 jobs/min which was a bit
> more than double the 285,221 score that I got on a regular disk.

Can you please re-run and report the results for each patch on the
ramdisk setup? And, please, include the mkfs.xfs or xfs_info output
for the ramdisk filesystem so I can see /exactly/ how much
concurrency the filesystems are providing to the benchmark you are
running.

> So the filesystem used wasn't tiny, though it is still not very large.

50GB is tiny for XFS. Personally, I've been using ~1PB
filesystems(*) for the performance testing I've been doing
recently...

Cheers,

Dave.

(*) Yes, petabytes. Sparse image files on really fast SSDs are a
wonderful thing.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 3/3] xfs: Use wake_q for waking up log space waiters
  2018-08-26 20:53 [PATCH v2 0/3] xfs: Reduce spinlock contention in log space slowpath code Waiman Long
  2018-08-26 20:53 ` [PATCH v2 1/3] sched/core: Export wake_q functions to kernel modules Waiman Long
  2018-08-26 20:53 ` [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter Waiman Long
@ 2018-08-26 20:53 ` Waiman Long
  2018-08-26 23:08 ` [PATCH v2 0/3] xfs: Reduce spinlock contention in log space slowpath code Dave Chinner
  3 siblings, 0 replies; 10+ messages in thread
From: Waiman Long @ 2018-08-26 20:53 UTC (permalink / raw)
  To: Darrick J. Wong, Ingo Molnar, Peter Zijlstra
  Cc: linux-xfs, linux-kernel, Dave Chinner, Waiman Long

In the current log space reservation slowpath code, the log space
waiters are waken up by an incoming waiter while holding the lock. As
the process of waking up a task can be time consuming, doing it while
holding the lock can make spinlock contention, if present, more severe.

This patch changes the slowpath code to use the wake_q for waking up
tasks without holding the lock, thus improving performance and reducing
spinlock contention level.

Running the AIM7 fserver workload on a 2-socket 24-core 48-thread
Broadwell system with a small xfs filesystem on ramfs, the performance
increased from 192,666 jobs/min to 285,221 with this change.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/xfs/xfs_linux.h |  1 +
 fs/xfs/xfs_log.c   | 50 ++++++++++++++++++++++++++++++++++++----------
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index edbd5a210df2..1548a353da1e 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -60,6 +60,7 @@ typedef __u32			xfs_nlink_t;
 #include <linux/list_sort.h>
 #include <linux/ratelimit.h>
 #include <linux/rhashtable.h>
+#include <linux/sched/wake_q.h>
 
 #include <asm/page.h>
 #include <asm/div64.h>
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index ac1dc8db7112..70d5f85ff059 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -221,7 +221,8 @@ STATIC bool
 xlog_grant_head_wake(
 	struct xlog		*log,
 	struct xlog_grant_head	*head,
-	int			*free_bytes)
+	int			*free_bytes,
+	struct wake_q_head	*wakeq)
 {
 	struct xlog_ticket	*tic;
 	int			need_bytes;
@@ -240,7 +241,7 @@ xlog_grant_head_wake(
 			continue;
 
 		trace_xfs_log_grant_wake_up(log, tic);
-		wake_up_process(tic->t_task);
+		wake_q_add(wakeq, tic->t_task);
 		tic->t_flags |= XLOG_TIC_WAKING;
 	}
 
@@ -252,8 +253,9 @@ xlog_grant_head_wait(
 	struct xlog		*log,
 	struct xlog_grant_head	*head,
 	struct xlog_ticket	*tic,
-	int			need_bytes) __releases(&head->lock)
-					    __acquires(&head->lock)
+	int			need_bytes,
+	struct wake_q_head	*wakeq) __releases(&head->lock)
+					__acquires(&head->lock)
 {
 	list_add_tail(&tic->t_queue, &head->waiters);
 
@@ -265,6 +267,11 @@ xlog_grant_head_wait(
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		spin_unlock(&head->lock);
 
+		if (wakeq) {
+			wake_up_q(wakeq);
+			wakeq = NULL;
+		}
+
 		XFS_STATS_INC(log->l_mp, xs_sleep_logspace);
 
 		trace_xfs_log_grant_sleep(log, tic);
@@ -272,7 +279,21 @@ xlog_grant_head_wait(
 		trace_xfs_log_grant_wake(log, tic);
 
 		spin_lock(&head->lock);
-		tic->t_flags &= ~XLOG_TIC_WAKING;
+		/*
+		 * The XLOG_TIC_WAKING flag should be set. However, it is
+		 * very unlikely that the current task is still in the
+		 * wake_q. If that happens (maybe anonymous wakeup), we
+		 * have to wait until the task is dequeued before proceeding
+		 * to avoid the possibility of having the task put into
+		 * another wake_q simultaneously.
+		 */
+		if (tic->t_flags & XLOG_TIC_WAKING) {
+			while (task_in_wake_q(current))
+				cpu_relax();
+
+			tic->t_flags &= ~XLOG_TIC_WAKING;
+		}
+
 		if (XLOG_FORCED_SHUTDOWN(log))
 			goto shutdown;
 	} while (xlog_space_left(log, &head->grant) < need_bytes);
@@ -310,6 +331,7 @@ xlog_grant_head_check(
 {
 	int			free_bytes;
 	int			error = 0;
+	DEFINE_WAKE_Q(wakeq);
 
 	ASSERT(!(log->l_flags & XLOG_ACTIVE_RECOVERY));
 
@@ -323,15 +345,17 @@ xlog_grant_head_check(
 	free_bytes = xlog_space_left(log, &head->grant);
 	if (!list_empty_careful(&head->waiters)) {
 		spin_lock(&head->lock);
-		if (!xlog_grant_head_wake(log, head, &free_bytes) ||
+		if (!xlog_grant_head_wake(log, head, &free_bytes, &wakeq) ||
 		    free_bytes < *need_bytes) {
 			error = xlog_grant_head_wait(log, head, tic,
-						     *need_bytes);
+						     *need_bytes, &wakeq);
+			wake_q_init(&wakeq);	/* Set wake_q to empty */
 		}
 		spin_unlock(&head->lock);
+		wake_up_q(&wakeq);
 	} else if (free_bytes < *need_bytes) {
 		spin_lock(&head->lock);
-		error = xlog_grant_head_wait(log, head, tic, *need_bytes);
+		error = xlog_grant_head_wait(log, head, tic, *need_bytes, NULL);
 		spin_unlock(&head->lock);
 	}
 
@@ -1077,6 +1101,7 @@ xfs_log_space_wake(
 {
 	struct xlog		*log = mp->m_log;
 	int			free_bytes;
+	DEFINE_WAKE_Q(wakeq);
 
 	if (XLOG_FORCED_SHUTDOWN(log))
 		return;
@@ -1086,8 +1111,11 @@ xfs_log_space_wake(
 
 		spin_lock(&log->l_write_head.lock);
 		free_bytes = xlog_space_left(log, &log->l_write_head.grant);
-		xlog_grant_head_wake(log, &log->l_write_head, &free_bytes);
+		xlog_grant_head_wake(log, &log->l_write_head, &free_bytes,
+				     &wakeq);
 		spin_unlock(&log->l_write_head.lock);
+		wake_up_q(&wakeq);
+		wake_q_init(&wakeq); /* Re-init wake_q to be reused again */
 	}
 
 	if (!list_empty_careful(&log->l_reserve_head.waiters)) {
@@ -1095,8 +1123,10 @@ xfs_log_space_wake(
 
 		spin_lock(&log->l_reserve_head.lock);
 		free_bytes = xlog_space_left(log, &log->l_reserve_head.grant);
-		xlog_grant_head_wake(log, &log->l_reserve_head, &free_bytes);
+		xlog_grant_head_wake(log, &log->l_reserve_head, &free_bytes,
+				     &wakeq);
 		spin_unlock(&log->l_reserve_head.lock);
+		wake_up_q(&wakeq);
 	}
 }
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/3] xfs: Reduce spinlock contention in log space slowpath code
  2018-08-26 20:53 [PATCH v2 0/3] xfs: Reduce spinlock contention in log space slowpath code Waiman Long
                   ` (2 preceding siblings ...)
  2018-08-26 20:53 ` [PATCH v2 3/3] xfs: Use wake_q for waking up log space waiters Waiman Long
@ 2018-08-26 23:08 ` Dave Chinner
  3 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2018-08-26 23:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Darrick J. Wong, Ingo Molnar, Peter Zijlstra, linux-xfs,
	linux-kernel

On Sun, Aug 26, 2018 at 04:53:12PM -0400, Waiman Long wrote:
> v1->v2:
>  - For patch 1, remove wake_q_empty() & add task_in_wake_q().
>  - Rewrite patch 2 after comments from Dave Chinner and break it down
>    to 2 separate patches. Now the original xfs logic was kept. The
>    patches just try to move the task wakeup calls to outside the
>    spinlock.
> 
> While running the AIM7 microbenchmark on a small xfs filesystem, it
> was found that there was a severe spinlock contention problem in the
> current XFS log space reservation code. To alleviate the problem, the

Again I'll ask: what is the performance when the log is made large
enough that your benchmark is *not hammering the slow path*? 

i.e. does running "mkfs.xfs -l size=2000m ..." instead of using the
default tiny log on your tiny test filesystem make the problem
go away? Without that information, we have no idea what the slow
path impact on peformance actually is, and whether it is worth
persuing optimising slow path behaviour that very, very few
production environments see lock contention in....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-08-28  5:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-08-26 20:53 [PATCH v2 0/3] xfs: Reduce spinlock contention in log space slowpath code Waiman Long
2018-08-26 20:53 ` [PATCH v2 1/3] sched/core: Export wake_q functions to kernel modules Waiman Long
2018-08-26 20:53 ` [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter Waiman Long
2018-08-27  0:21   ` Dave Chinner
2018-08-27  7:39     ` Christoph Hellwig
2018-08-27 21:42       ` Dave Chinner
2018-08-27 15:34     ` Waiman Long
2018-08-28  1:26       ` Dave Chinner
2018-08-26 20:53 ` [PATCH v2 3/3] xfs: Use wake_q for waking up log space waiters Waiman Long
2018-08-26 23:08 ` [PATCH v2 0/3] xfs: Reduce spinlock contention in log space slowpath code Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).