From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
Subject: Re: [PATCH] [RFC] xfs: wire up aio_fsync method
Date: Sun, 15 Jun 2014 20:58:46 -0600
Message-ID: <539E5D66.8040605@kernel.dk>
References: <1402562047-31276-1-git-send-email-david@fromorbit.com> <20140612141329.GA11676@infradead.org> <20140612234441.GT9508@dastard> <20140613162352.GB23394@infradead.org> <20140615223323.GB9508@dastard> <20140616020030.GC9508@dastard>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org
To: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>,
	Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Return-path: <linux-man-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
In-Reply-To: <20140616020030.GC9508@dastard>
Sender: linux-man-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
List-Id: linux-fsdevel.vger.kernel.org

On 2014-06-15 20:00, Dave Chinner wrote:
> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>> On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote:
>>> On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote:
>>>> On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote:
>>>>> There doesn't really seem anything XFS specific here, so instead
>>>>> of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c
>>>>> based on the workqueue and ->fsync.
>>>>
>>>> I really don't know whether the other ->fsync methods in other
>>>> filesystems can stand alone like that. I also don't have the
>>>> time to test that it works properly on all filesystems right now.
>>>
>>> Of course they can, as shown by various calls to vfs_fsync_range that
>>> is nothing but a small wrapper around ->fsync.
>>
>> Sure, but that's not getting 10,000 concurrent callers, is it? And
>> some fsync methods require journal credits, and others serialise
>> completely, and so on.
>>
>> Besides, putting an *unbound, highly concurrent* aio queue into the
>> kernel for an operation that can serialise the entire filesystem
>> seems like a pretty nasty user-level DOS vector to me.
>
> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
> running isn't anything related to XFS.  The async fsync workqueue
> results in several thousand worker threads dispatching IO
> concurrently across 16 CPUs:
>
> $ ps -ef |grep kworker |wc -l
> 4693
> $
>
> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
>
> -  51.33%  [kernel]            [k] percpu_ida_alloc
>     - percpu_ida_alloc
>        + 85.73% blk_mq_wait_for_tags
>        + 14.23% blk_mq_get_tag
> -  14.25%  [kernel]            [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 66.26% virtio_queue_rq
>           - __blk_mq_run_hw_queue
>              - 99.65% blk_mq_run_hw_queue
>                 + 99.47% blk_mq_insert_requests
>                 + 0.53% blk_mq_insert_request
> .....
> -   7.91%  [kernel]            [k] _raw_spin_unlock_irq
>     - _raw_spin_unlock_irq
>        - 69.59% __schedule
>           - 86.49% schedule
>              + 47.72% percpu_ida_alloc
>              + 21.75% worker_thread
>              + 19.12% schedule_timeout
> ....
>        + 18.06% blk_mq_make_request
>
> Runtime:
>
> real    4m1.243s
> user    0m47.724s
> sys     11m56.724s
>
> Most of the excessive CPU usage is coming from the blk-mq layer, and
> XFS is barely showing up in the profiles at all - the IDA tag
> allocator is burning 8 CPUs at about 60,000 write IOPS....
>
> I know that the tag allocator has been rewritten, so I tested
> against a current a current Linus kernel with the XFS aio-fsync
> patch. The results are all over the place - from several sequential
> runs of the same test (removing the files in between so each tests
> starts from an empty fs):
>
> Wall time	sys time	IOPS	 files/s
> 4m58.151s	11m12.648s	30,000	 13,500
> 4m35.075s	12m45.900s	45,000	 15,000
> 3m10.665s	11m15.804s	65,000	 21,000
> 3m27.384s	11m54.723s	85,000	 20,000
> 3m59.574s	11m12.012s	50,000	 16,500
> 4m12.704s	12m15.720s	50,000	 17,000
>
> The 3.15 based kernel was pretty consistent around the 4m10 mark,
> generally only +/-10s in runtime and not much change in system time.
> The files/s rate reported by fs_mark doesn't vary that much, either.
> So the new tag allocator seems to be no better in terms of IO
> dispatch scalability, yet adds significant variability to IO
> performance.
>
> What I noticed is a massive jump in context switch overhead: from
> around 250,000/s to over 800,000/s and the CPU profiles show that
> this comes from the new tag allocator:
>
> -  34.62%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 58.22% prepare_to_wait
>             100.00% bt_get
>                blk_mq_get_tag
>                __blk_mq_alloc_request
>                blk_mq_map_request
>                blk_sq_make_request
>                generic_make_request
>        - 22.51% virtio_queue_rq
>             __blk_mq_run_hw_queue
> ....
> -  21.56%  [kernel]  [k] _raw_spin_unlock_irq
>     - _raw_spin_unlock_irq
>        - 58.73% __schedule
>           - 53.42% io_schedule
>                99.88% bt_get
>                   blk_mq_get_tag
>                   __blk_mq_alloc_request
>                   blk_mq_map_request
>                   blk_sq_make_request
>                   generic_make_request
>           - 35.58% schedule
>              + 49.31% worker_thread
>              + 32.45% schedule_timeout
>              + 10.35% _xfs_log_force_lsn
>              + 3.10% xlog_cil_force_lsn
> ....
>
> The new block-mq tag allocator is hammering the waitqueues and
> that's generating a large amount of lock contention. It looks like
> the new allocator replaces CPU burn doing work in the IDA allocator
> with the same amount of CPU burn from extra context switch
> overhead....
>
> Oh, OH. Now I understand!
>
> # echo 4 > /sys/block/vdc/queue/nr_requests
>
> <run workload>
>
> 80.56%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     - _raw_spin_unlock_irqrestore
>        - 98.49% prepare_to_wait
>             bt_get
>             blk_mq_get_tag
>             __blk_mq_alloc_request
>             blk_mq_map_request
>             blk_sq_make_request
>             generic_make_request
>           + submit_bio
>        + 1.07% finish_wait
> +  13.63%  [kernel]  [k] _raw_spin_unlock_irq
> ...
>
> It's context switch bound at 800,000 context switches/s, burning all
> 16 CPUs waking up and going to sleep and doing very little real
> work. How little real work? About 3000 IOPS for 2MB/s of IO. That
> amount of IO should only take a single digit CPU percentage of one
> CPU.

With thousands of threads? I think not. Sanely submitted 3000 IOPS, 
correct, I would agree with you.

> This seems like bad behaviour to have on a congested block device,
> even a high performance one....

That is pretty much the suck. How do I reproduce this (eg what are you 
running, and what are the xfs aio fsync patches)? Even if dispatching 
thousands of threads to do IO is a bad idea (it very much is), 
gracefully handling is a must. I haven't seen any bad behavior from the 
new allocator, it seems to be well behaved (for most normal cases, 
anyway). I'd like to take a stab at ensuring this works, too.

If you tell me exactly what you are running, I'll reproduce and get this 
fixed up tomorrow.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html