From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: Re: [PATCH] [RFC] xfs: wire up aio_fsync method Date: Sun, 15 Jun 2014 20:58:46 -0600 Message-ID: <539E5D66.8040605@kernel.dk> References: <1402562047-31276-1-git-send-email-david@fromorbit.com> <20140612141329.GA11676@infradead.org> <20140612234441.GT9508@dastard> <20140613162352.GB23394@infradead.org> <20140615223323.GB9508@dastard> <20140616020030.GC9508@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org To: Dave Chinner , Christoph Hellwig Return-path: In-Reply-To: <20140616020030.GC9508@dastard> Sender: linux-man-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: linux-fsdevel.vger.kernel.org On 2014-06-15 20:00, Dave Chinner wrote: > On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote: >> On Fri, Jun 13, 2014 at 09:23:52AM -0700, Christoph Hellwig wrote: >>> On Fri, Jun 13, 2014 at 09:44:41AM +1000, Dave Chinner wrote: >>>> On Thu, Jun 12, 2014 at 07:13:29AM -0700, Christoph Hellwig wrote: >>>>> There doesn't really seem anything XFS specific here, so instead >>>>> of wiring up ->aio_fsync I'd implement IOCB_CMD_FSYNC in fs/aio.c >>>>> based on the workqueue and ->fsync. >>>> >>>> I really don't know whether the other ->fsync methods in other >>>> filesystems can stand alone like that. I also don't have the >>>> time to test that it works properly on all filesystems right now. >>> >>> Of course they can, as shown by various calls to vfs_fsync_range that >>> is nothing but a small wrapper around ->fsync. >> >> Sure, but that's not getting 10,000 concurrent callers, is it? And >> some fsync methods require journal credits, and others serialise >> completely, and so on. >> >> Besides, putting an *unbound, highly concurrent* aio queue into the >> kernel for an operation that can serialise the entire filesystem >> seems like a pretty nasty user-level DOS vector to me. > > FWIW, the non-linear system CPU overhead of a fs_mark test I've been > running isn't anything related to XFS. The async fsync workqueue > results in several thousand worker threads dispatching IO > concurrently across 16 CPUs: > > $ ps -ef |grep kworker |wc -l > 4693 > $ > > Profiles from 3.15 + xfs for-next + xfs aio_fsync show: > > - 51.33% [kernel] [k] percpu_ida_alloc > - percpu_ida_alloc > + 85.73% blk_mq_wait_for_tags > + 14.23% blk_mq_get_tag > - 14.25% [kernel] [k] _raw_spin_unlock_irqrestore > - _raw_spin_unlock_irqrestore > - 66.26% virtio_queue_rq > - __blk_mq_run_hw_queue > - 99.65% blk_mq_run_hw_queue > + 99.47% blk_mq_insert_requests > + 0.53% blk_mq_insert_request > ..... > - 7.91% [kernel] [k] _raw_spin_unlock_irq > - _raw_spin_unlock_irq > - 69.59% __schedule > - 86.49% schedule > + 47.72% percpu_ida_alloc > + 21.75% worker_thread > + 19.12% schedule_timeout > .... > + 18.06% blk_mq_make_request > > Runtime: > > real 4m1.243s > user 0m47.724s > sys 11m56.724s > > Most of the excessive CPU usage is coming from the blk-mq layer, and > XFS is barely showing up in the profiles at all - the IDA tag > allocator is burning 8 CPUs at about 60,000 write IOPS.... > > I know that the tag allocator has been rewritten, so I tested > against a current a current Linus kernel with the XFS aio-fsync > patch. The results are all over the place - from several sequential > runs of the same test (removing the files in between so each tests > starts from an empty fs): > > Wall time sys time IOPS files/s > 4m58.151s 11m12.648s 30,000 13,500 > 4m35.075s 12m45.900s 45,000 15,000 > 3m10.665s 11m15.804s 65,000 21,000 > 3m27.384s 11m54.723s 85,000 20,000 > 3m59.574s 11m12.012s 50,000 16,500 > 4m12.704s 12m15.720s 50,000 17,000 > > The 3.15 based kernel was pretty consistent around the 4m10 mark, > generally only +/-10s in runtime and not much change in system time. > The files/s rate reported by fs_mark doesn't vary that much, either. > So the new tag allocator seems to be no better in terms of IO > dispatch scalability, yet adds significant variability to IO > performance. > > What I noticed is a massive jump in context switch overhead: from > around 250,000/s to over 800,000/s and the CPU profiles show that > this comes from the new tag allocator: > > - 34.62% [kernel] [k] _raw_spin_unlock_irqrestore > - _raw_spin_unlock_irqrestore > - 58.22% prepare_to_wait > 100.00% bt_get > blk_mq_get_tag > __blk_mq_alloc_request > blk_mq_map_request > blk_sq_make_request > generic_make_request > - 22.51% virtio_queue_rq > __blk_mq_run_hw_queue > .... > - 21.56% [kernel] [k] _raw_spin_unlock_irq > - _raw_spin_unlock_irq > - 58.73% __schedule > - 53.42% io_schedule > 99.88% bt_get > blk_mq_get_tag > __blk_mq_alloc_request > blk_mq_map_request > blk_sq_make_request > generic_make_request > - 35.58% schedule > + 49.31% worker_thread > + 32.45% schedule_timeout > + 10.35% _xfs_log_force_lsn > + 3.10% xlog_cil_force_lsn > .... > > The new block-mq tag allocator is hammering the waitqueues and > that's generating a large amount of lock contention. It looks like > the new allocator replaces CPU burn doing work in the IDA allocator > with the same amount of CPU burn from extra context switch > overhead.... > > Oh, OH. Now I understand! > > # echo 4 > /sys/block/vdc/queue/nr_requests > > > > 80.56% [kernel] [k] _raw_spin_unlock_irqrestore > - _raw_spin_unlock_irqrestore > - 98.49% prepare_to_wait > bt_get > blk_mq_get_tag > __blk_mq_alloc_request > blk_mq_map_request > blk_sq_make_request > generic_make_request > + submit_bio > + 1.07% finish_wait > + 13.63% [kernel] [k] _raw_spin_unlock_irq > ... > > It's context switch bound at 800,000 context switches/s, burning all > 16 CPUs waking up and going to sleep and doing very little real > work. How little real work? About 3000 IOPS for 2MB/s of IO. That > amount of IO should only take a single digit CPU percentage of one > CPU. With thousands of threads? I think not. Sanely submitted 3000 IOPS, correct, I would agree with you. > This seems like bad behaviour to have on a congested block device, > even a high performance one.... That is pretty much the suck. How do I reproduce this (eg what are you running, and what are the xfs aio fsync patches)? Even if dispatching thousands of threads to do IO is a bad idea (it very much is), gracefully handling is a must. I haven't seen any bad behavior from the new allocator, it seems to be well behaved (for most normal cases, anyway). I'd like to take a stab at ensuring this works, too. If you tell me exactly what you are running, I'll reproduce and get this fixed up tomorrow. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html