From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9364C433F5 for ; Sat, 20 Nov 2021 11:45:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237017AbhKTLsn (ORCPT ); Sat, 20 Nov 2021 06:48:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229509AbhKTLsm (ORCPT ); Sat, 20 Nov 2021 06:48:42 -0500 Received: from nautica.notk.org (ipv6.notk.org [IPv6:2001:41d0:1:7a93::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4769FC061574 for ; Sat, 20 Nov 2021 03:45:39 -0800 (PST) Received: by nautica.notk.org (Postfix, from userid 108) id 6B137C020; Sat, 20 Nov 2021 12:45:34 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=codewreck.org; s=2; t=1637408734; bh=vePQcfQJvQBClcGAuOg032qkL+JD6uLXmp+Ryp6iyA8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=4JrVv7o9mqtwJ4motkiQ6mQeTmpPLuOAM1nmDzETHDlftTEzhTzxupIZftI5WQE+L YsZZIqVl7ESPddpJJPtvjyVnAcP2d6+J3JD/LztxMxTIPCS58B9supccEIukYetmdo 0Pq7bmVrMARO3gJoDMxiKZ3FY695n7D7p2Mr1KleEBtXFam9xB9jxJhnCj1glRECUq kKD5WHRDMVIS+KRUDdF8gjLdlLKZ84n3P3FzzF4NyRo2eDDy1ygF+Spaiu1c56/ueF m7Yn0Qs6uxcaJ9edEL5PlxKFo0gKvHCnUd0hLZIUdfKYtCOfNxdMXN8+AkEN/IF0gq xhHrA3iGJT36g== Received: from odin.codewreck.org (localhost [127.0.0.1]) by nautica.notk.org (Postfix) with ESMTPS id 0B591C009; Sat, 20 Nov 2021 12:45:30 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=codewreck.org; s=2; t=1637408733; bh=vePQcfQJvQBClcGAuOg032qkL+JD6uLXmp+Ryp6iyA8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=2yZKO7xBpuG9lfRB4cOisWRvYShc3U4g8PbTvUL9i4IOQuA41bQZyCaOVfmO65/cU O0+ua975ESijBtwLqA/ZOZUlV/I76t9fMYUjNF/99XYAsbSY1VP6yaxO5YmwhhYsE7 CSGvePBOI5e2QQSFBkWhO1rDQUDhlG/HINmTHfqbEO82WghLVM9L6/e7/t79t84E8k emLU0nOkFbVlGLjp9wZll4uyLvkNSAdv4AKt8+4269ygPua7QQJaaTAOq/Yx7nOMpt SjP1s++SnSqGD1JtbaJlp7McwuNGyGrXvqbpWVByT/PRNhfjCMrVBjb/9snfQELLT3 I3N5uNX1PLQhw== Received: from localhost (odin.codewreck.org [local]) by odin.codewreck.org (OpenSMTPD) with ESMTPA id 01277f42; Sat, 20 Nov 2021 11:45:24 +0000 (UTC) Date: Sat, 20 Nov 2021 20:45:09 +0900 From: Dominique Martinet To: Nikolay Kichukov Cc: Christian Schoenebeck , v9fs-developer@lists.sourceforge.net, netdev@vger.kernel.org, Eric Van Hensbergen , Latchesar Ionkov , Greg Kurz , Vivek Goyal Subject: Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values Message-ID: References: <8119d4d93a39758075bb83730dcb571f5d071af6.1632327421.git.linux_oss@crudebyte.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org (Thanks for restarting this thread, this had totally slipped out of my mind...) Nikolay Kichukov wrote on Sat, Nov 20, 2021 at 12:20:35PM +0100: > When the client mounts the share via virtio, requested msize is: > 10485760 or 104857600 > > however the mount succeeds with: > msize=507904 in the end as per the /proc filesystem. This is less than > the previous maximum value. (Not sure about this, I'll test these patches tomorrow, but since something failed I'm not surprised you have less than what you could have here: what do you get with a more reasonable value like 1M first?) > In addition to the above, when the kernel on the guest boots and loads > 9pfs support, the attached memory allocation failure trace is generated. > > Is anyone else seeing similar and was anybody able to get msize set to > 10MB via virtio protocol with these patches? I don't think the kernel would ever allow this with the given code, as the "common" part of 9p is not smart enough to use scatter-gather and tries to do a big allocation (almost) the size of the msize: --- clnt->fcall_cache = kmem_cache_create_usercopy("9p-fcall-cache", clnt->msize, 0, 0, P9_HDRSZ + 4, clnt->msize - (P9_HDRSZ + 4), NULL); ... fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS); --- So in practice, you will be capped at 2MB as that is the biggest the slab will be able to hand over in a single chunk. It'll also make allocation failures quite likely as soon as the system has had some uptime (depending on your workload, look at /proc/buddyinfo if your machines normally have 2MB chunks available), so I would really not recommend running with buffers bigger than e.g. 512k on real workloads -- it looks great on benchmarks, especially as it's on its own slab so as long as you're doing a lot of requests it will dish out buffers fast, but it'll likely be unreliable over time. (oh, and we allocate two of these per request...) The next step to support large buffers really would be splitting htat buffer to: 1/ not allocate huge buffers for small metadata ops, e.g. an open call certainly doesn't need to allocate 10MB of memory 2/ support splitting the buffer in some scattergather array Ideally we'd only allocate on an as-need basis, most of the protocol calls bound how much data is supposed to come back and we know how much we want to send (it's a format string actually, but we can majorate it quite easily), so one would need to adjust all protocol calls to pass this info to p9_client_rpc/p9_client_zc_rpc so it only allocates buffers as required, if necessary in multiple reasonably-sized segments (I'd love 2MB hugepages-backed folios...), and have all transports use these buffers. I have a rough idea on how to do all this but honestly less than 0 time for that, so happy to give advices or review any patch, but it's going to be a lot of work that stand in the way of really big IOs. > [ 1.527981] 9p: Installing v9fs 9p2000 file system support > [ 1.528173] ------------[ cut here ]------------ > [ 1.528174] WARNING: CPU: 1 PID: 791 at mm/page_alloc.c:5356 __alloc_pages+0x1ed/0x290 This warning is exactly what I was saying about the allocation cap: you've requested an allocation that was bigger than the max __alloc_page is willing to provide (MAX_ORDER, 11, so 2MB as I was saying) -- Dominique