From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id C09EC7F5A for ; Tue, 1 Dec 2015 15:38:39 -0600 (CST) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay3.corp.sgi.com (Postfix) with ESMTP id 5869DAC004 for ; Tue, 1 Dec 2015 13:38:39 -0800 (PST) Received: from mail-wm0-f43.google.com (mail-wm0-f43.google.com [74.125.82.43]) by cuda.sgi.com with ESMTP id 0NXuQnqnB6p5Sthr (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Tue, 01 Dec 2015 13:38:37 -0800 (PST) Received: by wmww144 with SMTP id w144so31658258wmw.0 for ; Tue, 01 Dec 2015 13:38:36 -0800 (PST) Subject: Re: sleeps and waits during io_submit References: <20151201131114.GA26129@bfoster.bfoster> <565DA784.5080003@scylladb.com> <20151201145631.GD26129@bfoster.bfoster> <565DBB3E.2010308@scylladb.com> <20151201160133.GE26129@bfoster.bfoster> <565DC613.4090608@scylladb.com> <20151201162958.GF26129@bfoster.bfoster> <565DD449.5090101@scylladb.com> <20151201180321.GA4762@redhat.com> <565DEFE2.2000308@scylladb.com> <20151201211914.GZ19199@dastard> From: Avi Kivity Message-ID: <565E1355.4020900@scylladb.com> Date: Tue, 1 Dec 2015 23:38:29 +0200 MIME-Version: 1.0 In-Reply-To: <20151201211914.GZ19199@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Glauber Costa , xfs@oss.sgi.com On 12/01/2015 11:19 PM, Dave Chinner wrote: > On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote: >> On 12/01/2015 08:03 PM, Carlos Maiolino wrote: >>> Hi Avi, >>> >>>>> else is going to execute in our place until this thread can make >>>>> progress. >>>> For us, nothing else can execute in our place, we usually have exactly one >>>> thread per logical core. So we are heavily dependent on io_submit not >>>> sleeping. >>>> >>>> The case of a contended lock is, to me, less worrying. It can be reduced by >>>> using more allocation groups, which is apparently the shared resource under >>>> contention. >>>> >>> I apologize if I misread your previous comments, but, IIRC you said you can't >>> change the directory structure your application is using, and IIRC your >>> application does not spread files across several directories. >> I miswrote somewhat: the application writes data files and commitlog >> files. The data file directory structure is fixed due to >> compatibility concerns (it is not a single directory, but some >> workloads will see most access on files in a single directory. The >> commitlog directory structure is more relaxed, and we can split it >> to a directory per shard (=cpu) or something else. >> >> If worst comes to worst, we'll hack around this and distribute the >> data files into more directories, and provide some hack for >> compatibility. >> >>> XFS spread files across the allocation groups, based on the directory these >>> files are created, >> Idea: create the files in some subdirectory, and immediately move >> them to their required location. > See xfs_fsr. Can you elaborate? I don't see how it is applicable. My hack involves creating the file in a random directory, and while it is still zero sized, move it to its final directory. This is simply to defeat the ag selection heuristic. No data is copied. >>> trying to keep files as close as possible from their >>> metadata. >> This is pointless for an SSD. Perhaps XFS should randomize the ag on >> nonrotational media instead. > Actually, no, it is not pointless. SSDs do not require optimisation > for minimal seek time, but data locality is still just as important > as spinning disks, if not moreso. Why? Because the garbage > collection routines in the SSDs are all about locality and we can't > drive garbage collection effectively via discard operations if the > filesystem is not keeping temporally related files close together in > it's block address space. In my case, files in the same directory are not temporally related. But I understand where the heuristic comes from. Maybe an ioctl to set a directory attribute "the files in this directory are not temporally related"? I imagine this will be useful for many server applications. > e.g. If the files in a directory are all close together, and the > directory is removed, we then leave a big empty contiguous region in > the filesystem free space map, and when we send discards over that > we end up with a single big trim and the drive handles that far more Would this not be defeated if a directory that happens to share the allocation group gets populated simultaneously? > effectively than lots of little trims (i.e. one per file) that the > drive cannot do anything useful with because they are all smaller > than the internal SSD page/block sizes and so get ignored. This is > one of the reasons fstrim is so much more efficient and effective > than using the discard mount option. In my use case, the files are fairly large, and there is constant rewriting (not in-place: files are read, merged, and written back). So I'm worried an fstrim can happen too late. > > And, well, XFS is designed to operate on storage devices made up of > more than one drive, so the way AGs are selected is designed to > given long term load balancing (both for space usage and > instantenous performance). With the existing algorithms we've not > had any issues with SSD lifetimes, long term performance > degradation, etc, so there's no evidence that we actually need to > change the fundamental allocation algorithms specially for SSDs. > Ok. Maybe the SSDs can deal with untrimmed overwrites efficiently, provided the io sizes are large enough. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs