From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BE5DCC48260 for ; Fri, 16 Feb 2024 08:12:29 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ratK4-000874-1R; Fri, 16 Feb 2024 03:11:52 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ratK2-00086v-Uw for qemu-devel@nongnu.org; Fri, 16 Feb 2024 03:11:50 -0500 Received: from mail-pl1-x62c.google.com ([2607:f8b0:4864:20::62c]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1ratK0-0001n8-T1 for qemu-devel@nongnu.org; Fri, 16 Feb 2024 03:11:50 -0500 Received: by mail-pl1-x62c.google.com with SMTP id d9443c01a7336-1d94b222a3aso16820665ad.2 for ; Fri, 16 Feb 2024 00:11:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1708071107; x=1708675907; darn=nongnu.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=H0nvs+DlMktNMhiA9OqUf4bQ5k7B48R9mI5195GPHO4=; b=N4gRJ1vhbb4rCEyUVYzv+KjLy9SzymrxIhL4yHnEiSs1TIxs45gn4a4dkW7LKgtidW YKZelS/FKcoFd0Sl6zghOCCaklRMiT18UmDvfwguyO2Moq9Ncv7lhp6viWWBCYO9HRmB fnFjqxPST+XG8di+cnln9FKB2qq1eXP/Xy95QkeurIYO7Loj6W34nYYk9y11LhSuRWoo fB1mKFXoNA/RjLl++V8u07iMWdOmgLDWAxeV03EVM+7VPsK+DWcwrgYHpucbyLkYRzqy K20QPtq/GsLUyV7PG2SCFwnOwGuEck5iIC870jdO6vZOHLDfkrvNzSnCAyYkCJPDxIz6 VArQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708071107; x=1708675907; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=H0nvs+DlMktNMhiA9OqUf4bQ5k7B48R9mI5195GPHO4=; b=SaEMPGqDOkgRFCxL1PyrfcaFa8JbzI2lyKgXuZWR9ZRUUuOs1tKk23JE9dmIu9eOfU oDWTgfk5YcLTBpJEaD+Ev8q2cMHI9VMCOJWvYIvSN+/1FJtjGjw6q33Q1rfaxYapoNR/ KouWr28iXCLN7ZJ0KSwkEsrDu0n1gXMSGiWLFvzHHGNcaDNSjri6NkWjcc9ooudJc0/q mb44sLhRHivmsKiMJbcATkxauy6wJQCQACxD7O5p9rJqG2/+5hPZX/2jP3Laf101RG3J IrtRVTvgysbbXbrZs1D17/DCsqjnMp4bprKq83dgd0d8XjgUMZRr4BtOzugFIaQJiC+4 foIA== X-Gm-Message-State: AOJu0YwFzBBiDqroUfcb2fdDnjLYjVxrNVrM1LJvWWG3kztkCsl8REhn aFW4goN5wlchXMNTdVUp5BAoWaWU51fUqiRM38oQBPhaSyAr45WaERjDWSOMuDA= X-Google-Smtp-Source: AGHT+IGy3bG48CrY1b12v0eZaPXVzmOGRhVEImI4csgEpZkJ1MSZ6VgCdnz+HpyPaAV3YFuino83yA== X-Received: by 2002:a17:902:e748:b0:1db:ab30:38f5 with SMTP id p8-20020a170902e74800b001dbab3038f5mr893915plf.44.1708071106657; Fri, 16 Feb 2024 00:11:46 -0800 (PST) Received: from [172.20.1.19] (173-197-098-125.biz.spectrum.com. [173.197.98.125]) by smtp.gmail.com with ESMTPSA id i19-20020a17090320d300b001db3be9eeadsm2461501plb.16.2024.02.16.00.11.45 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 16 Feb 2024 00:11:46 -0800 (PST) Message-ID: <08ce9292-75a0-4092-a12e-9a12826847b0@linaro.org> Date: Thu, 15 Feb 2024 22:11:43 -1000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 00/10] Optimize buffer_is_zero Content-Language: en-US To: Alexander Monakov Cc: qemu-devel@nongnu.org, mmromanov@ispras.ru References: <20240215081449.848220-1-richard.henderson@linaro.org> <72d0e01e-ca54-4fc1-805c-b8de7175d64f@linaro.org> <2d211317-873e-43c6-a246-bd8bf84f048b@linaro.org> From: Richard Henderson In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::62c; envelope-from=richard.henderson@linaro.org; helo=mail-pl1-x62c.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On 2/15/24 13:37, Alexander Monakov wrote: > Ah, I guess you might be running at low perf_event_paranoid setting that > allows unprivileged sampling of kernel events? In our submissions the > percentage was for perf_event_paranoid=2, i.e. relative to Qemu only, > excluding kernel time under syscalls. Ok. Eliminating kernel samples makes things easier to see. But I still do not see a 40% reduction in runtime. Just so we're on the same page: > Retrieve IE11.Win7.VirtualBox.zip from > https://archive.org/details/ie11.win7.virtualbox > and use > > unzip -p IE11.Win7.VirtualBox.zip | tar xv > > to extract 'IE11 - Win7-disk001.vmdk'. > > (Mikhail used a different image when preparing the patch) > > On this image, I get 70% in buffer_zero_sse2 on a Sandy Bridge running > > qemu-img convert 'IE11 - Win7-disk001.vmdk' -O qcow2 /tmp/t.qcow2 With this, I see virtually all of the runtime in libz.so. Therefore I converted this to raw first, to focus on the issue. For avoidance of doubt: $ ls -lsh test.raw && sha256sum test.raw 12G -rw-r--r-- 1 rth rth 40G Feb 15 21:14 test.raw 3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80 test.raw Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04) master: 57.48% qemu-img-m [.] buffer_zero_avx2 3.60% qemu-img-m [.] is_allocated_sectors.part.0 2.61% qemu-img-m [.] buffer_is_zero 63.69% -- total v3: 48.86% qemu-img-v3 [.] is_allocated_sectors.part.0 3.79% qemu-img-v3 [.] buffer_zero_avx2 52.65% -- total -17% -- reduction from master v4: 54.60% qemu-img-v4 [.] buffer_is_zero_ge256 3.30% qemu-img-v4 [.] buffer_zero_avx2 3.17% qemu-img-v4 [.] is_allocated_sectors.part.0 61.07% -- total -4% -- reduction from master v4+: 46.65% qemu-img [.] is_allocated_sectors.part.0 3.49% qemu-img [.] buffer_zero_avx2 0.05% qemu-img [.] buffer_is_zero_ge256 50.19% -- total -21% -- reduction from master The v4+ puts the 3 byte test back inline, like in your v3. Importantly, it must be as 3 short-circuting tests, where my v4 "simplified" this to (s | m | e) != 0, on the assumption that the reduced number of branches would help. Diving into perf, it becomes clear why: 57.36 │ cmpb $0x0,(%rbx) 4.02 │ ↓ jne 89 21.84 │ cmpb $0x0,0x1ff(%rbx) 0.64 │ ↓ jne 89 8.45 │ cmpb $0x0,0x100(%rbx) 0.26 │ ↓ jne 89 0.06 │ mov $0x200,%esi │ mov %rbx,%rdi 0.07 │ → call buffer_is_zero_ge256 The three bytes are on 3 different cachelines. Judging by the relative percentages, it would seem that the first byte alone eliminates slightly more than half of all blocks; the last byte eliminates more than half again; the middle byte eliminates a fair fraction of the rest. With the short-circuit, the extra cachelines are not touched. This is so important that it should be spelled out in a comment. With that settled, I guess we need to talk about how much the out-of-line implementation matters at all. I'm thinking about writing a test/bench/bufferiszero, with all-zero buffers of various sizes and alignments. With that it would be easier to talk about whether any given implementation is is an improvement for that final 4% not eliminated by the three bytes. > (which does tell us that qemu-img is doing I/O inefficiently, it shouldn't > need two seconds to read a fully cached 5 Gigabyte file) Indeed! r~