From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93C75C433F5 for ; Sat, 20 Nov 2021 21:28:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236317AbhKTVbn (ORCPT ); Sat, 20 Nov 2021 16:31:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45458 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231142AbhKTVbm (ORCPT ); Sat, 20 Nov 2021 16:31:42 -0500 Received: from mail-ed1-x531.google.com (mail-ed1-x531.google.com [IPv6:2a00:1450:4864:20::531]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 04351C061574 for ; Sat, 20 Nov 2021 13:28:37 -0800 (PST) Received: by mail-ed1-x531.google.com with SMTP id v1so24953181edx.2 for ; Sat, 20 Nov 2021 13:28:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oldum-net.20210112.gappssmtp.com; s=20210112; h=message-id:subject:from:to:cc:date:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=DBAy+PDPEAV3BA9s3sfwEPfn+NmLvEn1r8upoAU9QIA=; b=CggvuFKZlZYe9QeLITiq5POizBgnFP/HWdTIfenV9TpbmGqYGGORPa8NqVfQNoOO6B pJFNgALEMU7cY4mS9Efw0V6VabfHmA3dE3fx14W6sl/7oElUGOTO6r0pn55O9gY+E0hJ d7STO9Z3YhIyrwuKrmUXwvMxw0m8vD/eydbR7TqkHJnkCtDXozP9tO4+/2C9iwojSr59 Lp3uBBaku4tYc48ZQ0IYK62LHwR7ABgacdTQy/9s3r7K02IcWRRGyvMScQceLW5bsaVA OSTmhw/6o7vo2DzU0KM9Vp0gO2HNPypemDrjgHbX0qp8cTa0SxRQo/gVsAtZpL/LP7LF 6/PA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to :references:user-agent:mime-version:content-transfer-encoding; bh=DBAy+PDPEAV3BA9s3sfwEPfn+NmLvEn1r8upoAU9QIA=; b=LJOpxyJMvhHL0ViW1cFz6E9br/m/cwjeOpcn1V24M3wjzgmt+wEcUtprBo7vJ0y+MJ z+5JCZDF3+Zn5F9Fmp6jqduwq01987qdE0RdphRyWVisnBg1z5XzET25hZ1X4Vd6xg3P 2S/M86Sqim9LLNAHuR8DyqsKvSn75MuwU6WL+l6zBTmp73anaqrTg4CIL9a7P4ZmBCxL AByUaHbQLvGp+PAo6AKZbQ4kIbFZP08f4DRjfsA58sc1GCYiG5Q11o9DR4tIfWzZ9bF9 XIZaMkKZilN73KB6S30H6ITTlZI4GHM0lb3CNjpfx9OQuMyW7tYQib3Ra0+WUt9ZP1tD B7wg== X-Gm-Message-State: AOAM532rQMlqZqlb8ikgm6QUf54tgt4WEmTn4kK4Pitqaj8SSoP2Tmr0 uV5jKcaNmWCvzxjNOW3iu9rpaR8Thv4GpgOu/n4= X-Google-Smtp-Source: ABdhPJyRNXKYN8IXkl5QyQhOkyLhMvM3JCK4IGDNjDmo2mZdMK8Xxuf5JDQr5mQMZAWrUL79IVaVUg== X-Received: by 2002:a50:e707:: with SMTP id a7mr43964686edn.352.1637443715885; Sat, 20 Nov 2021 13:28:35 -0800 (PST) Received: from [10.1.0.200] (external.oldum.net. [82.161.240.76]) by smtp.googlemail.com with ESMTPSA id j14sm1724917edw.96.2021.11.20.13.28.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 20 Nov 2021 13:28:35 -0800 (PST) Message-ID: <8d781f45b6a6fb434aa386dccb7f8f424ec1ffbe.camel@oldum.net> Subject: Re: [PATCH v3 6/7] 9p/trans_virtio: support larger msize values From: Nikolay Kichukov To: Christian Schoenebeck , Dominique Martinet Cc: v9fs-developer@lists.sourceforge.net, netdev@vger.kernel.org, Eric Van Hensbergen , Latchesar Ionkov , Greg Kurz , Vivek Goyal Date: Sat, 20 Nov 2021 22:28:35 +0100 In-Reply-To: <2717208.9V0g2NVZY4@silver> References: <2717208.9V0g2NVZY4@silver> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.42.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Sat, 2021-11-20 at 15:46 +0100, Christian Schoenebeck wrote: > On Samstag, 20. November 2021 12:45:09 CET Dominique Martinet wrote: > > (Thanks for restarting this thread, this had totally slipped out of > > my > > mind...) > > Hi guys, > > I have (more or less) silently been working on these patches on all > ends in > the meantime. If desired I try to find some time next week to > summarize > current status of this overall effort and what still needs to be done. Great, I would be more than happy to test next version of these patches. > > > Nikolay Kichukov wrote on Sat, Nov 20, 2021 at 12:20:35PM +0100: > > > When the client mounts the share via virtio, requested msize is: > > > 10485760 or 104857600 > > > > > > however the mount succeeds with: > > > msize=507904 in the end as per the /proc filesystem. This is less > > > than > > > the previous maximum value. > > > > (Not sure about this, I'll test these patches tomorrow, but since > > something failed I'm not surprised you have less than what you could > > have here: what do you get with a more reasonable value like 1M > > first?) It worked with 1MB, I can stick to this for the time being. Are the kernel patches supposed to be included in the KVM host kernel or would the guest kernel suffice? > > The highest 'msize' value possible for me with this particular version > of the > kernel patches is 4186212 (so slightly below 4MB). > > Some benchmarks, linear reading a 12 GB file: > > msize    average      notes > > 8 kB     52.0 MB/s    default msize of Linux kernel 128 kB   624.8 MB/s   default msize of Linux kernel >=v5.15 > 512 kB   1961 MB/s    current max. msize with any Linux kernel <=v5.15 > 1 MB     2551 MB/s    this msize would violate current virtio specs > 2 MB     2521 MB/s    this msize would violate current virtio specs > 4 MB     2628 MB/s    planned milestone > > That does not mean it already makes sense to use these patches in this > version > as is to increase overall performance yet, but the numbers already > suggest > that increasing msize can improve performance on large sequential I/O. > There > are still some things to do though to fix other use patterns from > slowing down > with rising msize, which I will describe in a separate email. > > I also have an experimental version of kernel patches and QEMU that > goes as > high as msize=128MB. But that's a highly hacked version that copies > buffers > between 9p client level and virtio level back and forth and with > intentional > large buffer sizes on every 9p message type. This was solely meant as > a stress > test, i.e. whether it was possible to go as high as virtio's > theoretical > protocol limit in the first place, and whether it was stable. This > stress test > was not about performance at all. And yes, I had it running with 128MB > for > weeks without issues (except of being very slow of course due to hacks > used). > > > > In addition to the above, when the kernel on the guest boots and > > > loads > > > 9pfs support, the attached memory allocation failure trace is > > > generated. > > > > > > Is anyone else seeing similar and was anybody able to get msize > > > set to > > > 10MB via virtio protocol with these patches? > > > > I don't think the kernel would ever allow this with the given code, > > as > > the "common" part of 9p is not smart enough to use scatter-gather > > and > > tries to do a big allocation (almost) the size of the msize: > > > > --- > >         clnt->fcall_cache = > >                 kmem_cache_create_usercopy("9p-fcall-cache", clnt- > > >msize, > >                                            0, 0, P9_HDRSZ + 4, > >                                            clnt->msize - (P9_HDRSZ + > > 4), > >                                            NULL); > > > > ... > >                 fc->sdata = kmem_cache_alloc(c->fcall_cache, > > GFP_NOFS); > > --- > > So in practice, you will be capped at 2MB as that is the biggest the > > slab will be able to hand over in a single chunk. > > I did not encounter a 2MB limit here. But kmalloc() clearly has a 4MB > limit, > so when trying an msize larger than 4MB it inevitably causes a memory > allocation error. In my tests this allocation error would always > happen > immediately at mount time causing an instant kernel oops. > > > It'll also make allocation failures quite likely as soon as the > > system > > has had some uptime (depending on your workload, look at > > /proc/buddyinfo > > if your machines normally have 2MB chunks available), so I would > > really > > not recommend running with buffers bigger than e.g. 512k on real > > workloads -- it looks great on benchmarks, especially as it's on its > > own > > slab so as long as you're doing a lot of requests it will dish out > > buffers fast, but it'll likely be unreliable over time. > > (oh, and we allocate two of these per request...) > > > > > > The next step to support large buffers really would be splitting > > htat > > buffer to: > >  1/ not allocate huge buffers for small metadata ops, e.g. an open > > call > > certainly doesn't need to allocate 10MB of memory > >  2/ support splitting the buffer in some scattergather array > > > > Ideally we'd only allocate on an as-need basis, most of the protocol > > calls bound how much data is supposed to come back and we know how > > much > > we want to send (it's a format string actually, but we can majorate > > it > > quite easily), so one would need to adjust all protocol calls to > > pass > > this info to p9_client_rpc/p9_client_zc_rpc so it only allocates > > buffers > > as required, if necessary in multiple reasonably-sized segments (I'd > > love 2MB hugepages-backed folios...), and have all transports use > > these > > buffers. > > It is not that bad in sense of pending work. One major thing that > needs to be > done is to cap the majority of 9p message types to allocate only as > much as > they need, which is for most message types <8k. Right now they always > simply > kmalloc(msize), which hurts with increasing msize values. That task > does not > need many changes though. > > > I have a rough idea on how to do all this but honestly less than 0 > > time > > for that, so happy to give advices or review any patch, but it's > > going > > to be a lot of work that stand in the way of really big IOs. > > Reviews of the patches would actually help a lot to bring this overall > effort > forward, but probably rather starting with the upcoming next version > of the > kernel patches, not this old v3. > > > > [    1.527981] 9p: Installing v9fs 9p2000 file system support > > > [    1.528173] ------------[ cut here ]------------ > > > [    1.528174] WARNING: CPU: 1 PID: 791 at mm/page_alloc.c:5356 > > > __alloc_pages+0x1ed/0x290 > > This warning is exactly what I was saying about the allocation cap: > > you've requested an allocation that was bigger than the max > > __alloc_page > > is willing to provide (MAX_ORDER, 11, so 2MB as I was saying) Yes, this is no longer happening when I lowered the value of the msize parameter. > > Best regards, > Christian Schoenebeck > >