From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 637AB7F37 for ; Wed, 7 Oct 2015 13:13:17 -0500 (CDT) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay2.corp.sgi.com (Postfix) with ESMTP id 44735304043 for ; Wed, 7 Oct 2015 11:13:14 -0700 (PDT) Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by cuda.sgi.com with ESMTP id hyFe73zgMHI6YjY9 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Wed, 07 Oct 2015 11:13:11 -0700 (PDT) Received: by wicgb1 with SMTP id gb1so223168693wic.1 for ; Wed, 07 Oct 2015 11:13:10 -0700 (PDT) Subject: Re: Question about non asynchronous aio calls. References: <20151007141833.GB11716@scylladb.com> <56152B0F.2040809@sandeen.net> <20151007150833.GB30191@bfoster.bfoster> <56153685.3040401@sandeen.net> From: Avi Kivity Message-ID: <561560B2.1080902@scylladb.com> Date: Wed, 7 Oct 2015 21:13:06 +0300 MIME-Version: 1.0 In-Reply-To: <56153685.3040401@sandeen.net> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Eric Sandeen , Brian Foster Cc: xfs@oss.sgi.com On 07/10/15 18:13, Eric Sandeen wrote: > > On 10/7/15 10:08 AM, Brian Foster wrote: >> On Wed, Oct 07, 2015 at 09:24:15AM -0500, Eric Sandeen wrote: >>> >>> On 10/7/15 9:18 AM, Gleb Natapov wrote: >>>> Hello XFS developers, >>>> >>>> We are working on scylladb[1] database which is written using seastar[2] >>>> - highly asynchronous C++ framework. The code uses aio heavily: no >>>> synchronous operation is allowed at all by the framework otherwise >>>> performance drops drastically. We noticed that the only mainstream FS >>>> in Linux that takes aio seriously is XFS. So let me start by thanking >>>> you guys for the great work! But unfortunately we also noticed that >>>> sometimes io_submit() is executed synchronously even on XFS. >>>> >>>> Looking at the code I see two cases when this is happening: unaligned >>>> IO and write past EOF. It looks like we hit both. For the first one we >>>> make special afford to never issue unaligned IO and we use XFS_IOC_DIOINFO >>>> to figure out what alignment should be, but it does not help. Looking at the >>>> code though xfs_file_dio_aio_write() checks alignment against m_blockmask which >>>> is set to be sbp->sb_blocksize - 1, so aio expects buffer to be aligned to >>>> filesystem block size not values that DIOINFO returns. Is it intentional? How >>>> should our code know what it should align buffers to? >>> /* "unaligned" here means not aligned to a filesystem block */ >>> if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask)) >>> unaligned_io = 1; >>> >>> It should be aligned to the filesystem block size. >>> >> I'm not sure exactly what kinds of races are opened if the above locking >> were absent, but I'd guess it's related to the buffer/block state >> management, block zeroing and whatnot that is buried in the depths of >> the generic dio code. > Yep: > > commit eda77982729b7170bdc9e8855f0682edf322d277 > Author: Dave Chinner > Date: Tue Jan 11 10:22:40 2011 +1100 > > xfs: serialise unaligned direct IOs > > When two concurrent unaligned, non-overlapping direct IOs are issued > to the same block, the direct Io layer will race to zero the block. > The result is that one of the concurrent IOs will overwrite data > written by the other IO with zeros. This is demonstrated by the > xfsqa test 240. > > To avoid this problem, serialise all unaligned direct IOs to an > inode with a big hammer. We need a big hammer approach as we need to > serialise AIO as well, so we can't just block writes on locks. > Hence, the big hammer is calling xfs_ioend_wait() while holding out > other unaligned direct IOs from starting. > > We don't bother trying to serialised aligned vs unaligned IOs as > they are overlapping IO and the result of concurrent overlapping IOs > is undefined - the result of either IO is a valid result so we let > them race. Hence we only penalise unaligned IO, which already has a > major overhead compared to aligned IO so this isn't a major problem. > > Signed-off-by: Dave Chinner > Reviewed-by: Alex Elder > Reviewed-by: Christoph Hellwig > > I fixed something similar in ext4 at the time, FWIW. Makes sense. Is there a way to relax this for reads? It's pretty easy to saturate the disk read bandwidth with 4K reads, and there shouldn't be a race there, at least for reads targeting already-written blocks. For us at least small reads would be sufficient. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs