From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0093AC47DDF for ; Mon, 29 Jan 2024 03:20:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8F6016B0075; Sun, 28 Jan 2024 22:20:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8A5B36B007E; Sun, 28 Jan 2024 22:20:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 76D4D6B0081; Sun, 28 Jan 2024 22:20:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 671C26B0075 for ; Sun, 28 Jan 2024 22:20:41 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id DD5C5A025A for ; Mon, 29 Jan 2024 03:20:40 +0000 (UTC) X-FDA: 81730896240.28.CDC9AEA Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf11.hostedemail.com (Postfix) with ESMTP id 08F914000B for ; Mon, 29 Jan 2024 03:20:38 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GDrhfRzE; spf=pass (imf11.hostedemail.com: domain of ming.lei@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=ming.lei@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706498439; a=rsa-sha256; cv=none; b=dCQe5LQTcgViwW6gXq3ECgu9TTkMHxwWthwuZFwhQY72kHdXww/6dtfwIa639pXonr+1MV HexWrtBZDJHoeTX4Mcu5vz3r5N82hK1GodgMCJfB01QLUmfUQSFK1vm7xdFaIWdvuePD1J OLmlTZ6Y70pG+2Xb3le1flOZtaO4ejI= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GDrhfRzE; spf=pass (imf11.hostedemail.com: domain of ming.lei@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=ming.lei@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706498439; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7G+EdxBrZWtWfEsSXmLsfmBlcaKy+iy1nng6IU24Ic8=; b=5ABkaJlduVs71bgXlmYdVulYkJiY0fnEUhkYSZ3MFCWlXWcwtr9Y+WK3Zyg3p5RaFdACf5 h/meumuYibHmjgV+gcfR1TRnXvLM1qTfe4YgFnB81R1bsdKj7Kjf3EUYar+S5k1U6CKdEc B9zpT36K0pid8XwIDgG7STkMywe5L6s= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1706498438; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=7G+EdxBrZWtWfEsSXmLsfmBlcaKy+iy1nng6IU24Ic8=; b=GDrhfRzEqL4z2Re1bUmANBt+I1dlJFIaHzLdCcAYqTAHiqhQJvJAWZUKVF7zWbG5EQIrC4 ohYT7zAOUp5LEypgavdKlDfChnJ18xaynnVq3L+VOEXrwpGIie7fdynCzg2ouws2XQSCbz Mq56WACHhvIQ6g/98DxFSuoM/kj2y3k= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-163-n_431acuNtSus-asaD7sqQ-1; Sun, 28 Jan 2024 22:20:34 -0500 X-MC-Unique: n_431acuNtSus-asaD7sqQ-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 212553C14942; Mon, 29 Jan 2024 03:20:34 +0000 (UTC) Received: from fedora (unknown [10.72.116.135]) by smtp.corp.redhat.com (Postfix) with ESMTPS id D6C94487; Mon, 29 Jan 2024 03:20:28 +0000 (UTC) Date: Mon, 29 Jan 2024 11:20:24 +0800 From: Ming Lei To: Matthew Wilcox Cc: Mike Snitzer , Andrew Morton , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Don Dutile , Raghavendra K T , Alexander Viro , Christian Brauner , ming.lei@redhat.com Subject: Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range Message-ID: References: <20240128142522.1524741-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.1 X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 08F914000B X-Stat-Signature: mqqp1rkwznsgjg8txmirhm59y6kp5uyo X-Rspam-User: X-HE-Tag: 1706498438-758754 X-HE-Meta: U2FsdGVkX1857AP/R0kHc3EOi/AgVCQ7TZXa65NFb90RCGjrlI/UNsdzHAbdYfiIfRpbVY6NOe3jztb2VgnnpRSUmQtte/Wds1Ns2Ui1LYlf51DmD0FUqT5Ir/R0CLVPn4yyuelxZ+T0HBIY8TQdpxVmyn4eq1WC06W1uHhHsrTSWqXB+wxuM4eaikIZ6FMZSVMdx6Er5MKlxivCZZF3MDvkYrAdq9w6TZIMTG4IohcL70WcdF9dRXbCO9uIIdJT7vnYDzYI/Pt1VpQzL5zpQQWTFsu+L/HGfu3009OhtQL7iN1wnW+cOMyo6TfODl9Axi8bgY9C/MqhtaerRGSxQj09NBBkMxQXo+CJaCscx6pmOMPPXHmMHohUjMGAy3NyxtW7lO89vtuxjbGAJ51aCftHhuh+pNS/HNPFIbzYdCaUWAprdyRtaGuJzrQW1UdWMv7vWzV3y32SvR0a9R0wx1HqxhBOA6j92kneU4U1aTlzuZOYTh52rBHYaGq5wsPMnvjARpe38POxTy9FsCoNYfQ/++vorravFPTtM+mQpZuNZbKYSdkPzNHj3F1gUePMIPbk2iZ7UI8yVHm6p374t9PiqUvp6d6MMr3Mf/GhDwXHhkkQpWGtLgzy0jGFQCMPdzThlnGEyeVHx8mThDGC/b347xAxLBpWnkR6+khGq1bnmj3mSmda94/G79WXAHf7TWFzFzhSE/dGG4jHvS6OPGbSriabEhJQT5wlBB34n8kDHOKrHIHSF/q6L+DF+ftiRFpKHqmavpvCsJYVnfQE2Us+F4ihGOSlXTC9jt1AuLXtdYQpkaVuX0rzY7k9+m8fOah7HbKZncVMn/hIfyxJ3RVxtXC5IYauYbVpUC7mIz4S70vSMxM8G/+C3CFWDvVe+hSKB3JyniWLf0OTuXAd7kcbK6rOtzbaij62S90QW2s34S7mumQOksE+t43a8ViwjJUZXQ759Om3vXw5hcr 6hPlapFi BMtHmYH0HDuNHeFhYmnZACXkpgBw8N/oCKa09H8AAnSIEybxPg1hl1Bcv3C4ouFAGwpkyZFakR3gisADNNQDBqfSDWgGdcMJnN7EotLi9BbjylbcO0rQ2jewI60duIjDd8vAd2N4cz07m+SdS0jbbDQyINUJ9vO+NPpdbA4i33wQA1yxB06H4nxIAo2/0P5zCuV8w4xe9mC5I+sfR92N2c71MGQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jan 29, 2024 at 12:21:16AM +0000, Matthew Wilcox wrote: > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote: > > On Sun, Jan 28 2024 at 5:02P -0500, > > Matthew Wilcox wrote: > > > > > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote: > > > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for > > > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED > > > > only tries to readahead 512 pages, and the remained part in the advised > > > > range fallback on normal readahead. > > > > > > Does the MAINTAINERS file mean nothing any more? > > > > "Ming, please use scripts/get_maintainer.pl when submitting patches." > > That's an appropriate response to a new contributor, sure. Ming has > been submitting patches since, what, 2008? Surely they know how to > submit patches by now. > > > I agree this patch's header could've worked harder to establish the > > problem that it fixes. But I'll now take a crack at backfilling the > > regression report that motivated this patch be developed: > > Thank you. > > > Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED) > > allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb. > > > > Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that > > mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024 > > and running this reproducer against a 2G file will take ~5x longer > > (depending on the system's capabilities), mmap_load_test.java follows: > > > > import java.nio.ByteBuffer; > > import java.nio.ByteOrder; > > import java.io.RandomAccessFile; > > import java.nio.MappedByteBuffer; > > import java.nio.channels.FileChannel; > > import java.io.File; > > import java.io.FileNotFoundException; > > import java.io.IOException; > > > > public class mmap_load_test { > > > > public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException { > > if (args.length == 0) { > > System.out.println("Please provide a file"); > > System.exit(0); > > } > > FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel(); > > MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size()); > > > > System.out.println("Loading the file"); > > > > long startTime = System.currentTimeMillis(); > > mem.load(); > > long endTime = System.currentTimeMillis(); > > System.out.println("Done! Loading took " + (endTime-startTime) + " ms"); > > > > } > > } > > It's good to have the original reproducer. The unfortunate part is > that being at such a high level, it doesn't really show what syscalls > the library makes on behalf of the application. I'll take your word > for it that it calls madvise(MADV_WILLNEED). An strace might not go > amiss. Yeah, it can be fadvise(WILLNEED)/readahead syscall too. > > > reproduce with: > > > > javac mmap_load_test.java > > echo 64 > /sys/block/sda/queue/read_ahead_kb > > echo 1024 > /sys/block/sda/queue/max_sectors_kb > > mkfs.xfs /dev/sda > > mount /dev/sda /mnt/test > > dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000 > > > > echo 3 > /proc/sys/vm/drop_caches > > (I prefer to unmount/mount /mnt/test; it drops the cache for > /mnt/test/2G_file without affecting the rest of the system) > > > java mmap_load_test /mnt/test/2G_file > > > > Without a fix, like the patch Ming provided, iostat will show rareq-sz > > is 64 rather than ~1024. > > Understood. But ... the application is asking for as much readahead as > possible, and the sysadmin has said "Don't readahead more than 64kB at > a time". So why will we not get a bug report in 1-15 years time saying > "I put a limit on readahead and the kernel is ignoring it"? I think > typically we allow the sysadmin to override application requests, > don't we? ra_pages is just one hint for readahead, the reality is that sysadmin can't understand how much bytes is perfect for readahead. But application often knows how much bytes it will need, so here I think application requirement should have higher priority, especially when application doesn't want kernel to readahead blindly. And madvise/fadvise(WILLNEED) syscall already reads bdi->io_pages first, and which is bigger than ra_pages. Thanks, Ming