From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC9EC3D6CDA for ; Tue, 21 Apr 2026 17:55:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776794118; cv=none; b=qv+kXOVQ5UPL6f0r3zO/vIvo/HmmsXeUtO5iPGI6jiULT66qWHzGgFADW2lQOF2zNvfaKoFMUoceeZV98ceIOlVHKJ0ImvOwYPrILEmOekaQFEcOf486WGitAwkwaA/isnNZ7x5EuFfimpSUuYEy2v4EYOSiGuWAfelvv1TsC2w= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776794118; c=relaxed/simple; bh=uPiFRQzZfVvzy7zWQCW3Lj5kFGPiQvn4HfvW3TBLDLo=; h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type; b=ovprH9BAXGD6XfLDQKUD+PB+y1y5DXw5ytLnCixsOJ6jEr0MVKhouu6O3YjULgeOV4KDj8UVdLqDh96Dcp8i97otK0PUyGCqUiOTh/LNcxybgmqGtOySWZ/Ox9S7lBoIhWeVnioPIBvveh0MstnITuNOhXcJ/NzZK5j53DOrbXg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=fb.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=fb.com header.i=@fb.com header.b=veNgwGAc; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=fb.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fb.com header.i=@fb.com header.b="veNgwGAc" Received: from pps.filterd (m0001303.ppops.net [127.0.0.1]) by m0001303.ppops.net (8.18.1.11/8.18.1.11) with ESMTP id 63LGwGLr3387300 for ; Tue, 21 Apr 2026 10:55:15 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=cc :content-transfer-encoding:content-type:date:from:message-id :mime-version:subject:to; s=s2048-2025-q2; bh=0IaRGKevGSbluAH5p0 3+E69GVV+ISfPxqCNvPSEsLdU=; b=veNgwGAcMpKunMvv4kdinPuWQP1iuqTTsC QiVuNFjr/yQ3mLeS3CinWRYR2fjNCuzt2dzxDcmRNyVUKp6QrUCgyOcUjxl9K1Gv a6/pCRLF8wFe2bc6e/SFOAQf+pY4ubvdzzdFgmeN68XJ6DQf0HKP2dg2zpcNWate Le/51J1/lKR/AEq+tTnZXaDSmOmJdZxwZg0sZhzq/m0T4y/yzAiZ8lZvKdCPcXBs YqlAwnLkZBRvPYcno2ptoEbVETpgMcBlt4lhMq3c++RXrYusSKVUe9b9xrUoDpy0 mpAca9C3eJSRd1VWeFClau6ghHO3kBw7z7K10dfCvEcjHGcjKkkg== Received: from maileast.thefacebook.com ([163.114.135.16]) by m0001303.ppops.net (PPS) with ESMTPS id 4dm5m8j3tb-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Tue, 21 Apr 2026 10:55:15 -0700 (PDT) Received: from twshared4770.04.frc1.facebook.com (2620:10d:c0a8:1c::1b) by mail.thefacebook.com (2620:10d:c0a9:6f::8fd4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.2562.37; Tue, 21 Apr 2026 17:55:13 +0000 Received: by devvm14451.vll0.facebook.com (Postfix, from userid 187975) id 7BE7668155F77; Tue, 21 Apr 2026 10:55:07 -0700 (PDT) From: Jie Meng To: CC: , , , Subject: [PATCH v3 bpf-next] bpf, x86: Granlund-Montgomery optimization for 64-bit div/mod by immediate Date: Tue, 21 Apr 2026 10:55:07 -0700 Message-ID: <20260421175507.2396332-1-jmeng@fb.com> X-Mailer: git-send-email 2.52.0 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-FB-Internal: Safe Content-Type: text/plain X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDIxMDE3OCBTYWx0ZWRfXxqTrEg3GSpfs 7iIkHjMv+AWsOpC7NgltJG9Fz4KNU+abjtwSrP+YCEsItvDqQ2Kur60gqQaz5Zx1brIX4HhLgMB M8ZuH/ekOA97WYItTeBPT4Ku0AlP+ivk0MzsnPgHAk44w97ZwYpsFWmbuylLC2kuGgh4AD9TH16 52HN96mFx+2WkLvZONLYdmm2jzNkCaatgRaBCGLmNO4uBpBh13ZTW59RlbPqd7G8W0p3Pipdp+G Wwk/HePsCGXftIxKbMP4uNYkgMuN+k/gLx2FHe2Ruf0a9+D/qN8JF5Mnto71Nd3FnsF0ME6SOot kRIPPbxRkiN9BP2gWAvtgtGvXF5OPAnSvY65JSGma2oWdxcxKCzUQmWmlwxXdODFPS9FncV5DXv x5aA5vvOPDSCWTxE3iVk0z2rh9GkPO6IkAJKOYLo+LcJ8euCNn66WYZ5HOOB3qLLV9wMUh9QPhT jA1UJoOF24z+Zju5U3g== X-Authority-Analysis: v=2.4 cv=bv58wkai c=1 sm=1 tr=0 ts=69e7ba03 cx=c_pps a=MfjaFnPeirRr97d5FC5oHw==:117 a=MfjaFnPeirRr97d5FC5oHw==:17 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=7x6HtfJdh03M6CCDgxCd:22 a=_78whYxrdx1mplLwxq1U:22 a=FOH2dFAWAAAA:8 a=FWv110D_93i-94zjWioA:9 X-Proofpoint-GUID: bszFRjgj7HOtpbGcWt31PZ6rHChE_qLU X-Proofpoint-ORIG-GUID: bszFRjgj7HOtpbGcWt31PZ6rHChE_qLU X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-21_03,2026-04-21_02,2025-10-01_01 Replace slow 64-bit div/idiv instructions (10-20 cycles on latest x86, even slower with older models) with fast multiply-and-shift sequences (~4-5 cycles) for BPF_K div/mod operations, using the Granlund-Montgomery algorithm. For each 64-bit BPF_ALU64_{DIV,MOD}_K instruction with a non-zero immediate divisor, the JIT computes a magic multiplier and shift at JIT compile time via choose_multiplier_64(), then emits: Unsigned: floor(n/d) =3D mulhi(n, m) >> s Signed: trunc(n/d) =3D mulhi_signed(n, m) >> s + sign_correction Both unsigned (mul + wide correction) and signed (imul + add-n correction) variants are implemented. Remainder is computed as r =3D n - q * d. The optimization targets 64-bit operations only, as LLVM already optimizes 32-bit cases at the compiler level. The change doesn't put much effort in optimizing cases expected to be handled by LLVM, such as 32-bit division by an immediate, or power-of-2 divisors, which produce correct but suboptimal code sequence (still faster than div/idiv). BPF LLVM cannot apply the optimization for 64-bit division because it would need an instruction that can do 128-bit by 64-bit division (like x86's DIV and IDIV) - div128_u64(): inline asm helper for 128-bit division using two divq steps, avoiding libgcc dependency - choose_multiplier_64(): compute magic multiplier m, shift s, and is_wide flag for a given divisor and precision - emit_udivmod_const(): unsigned 64-bit div/mod code generation with mul/shr and optional wide correction ((n-t)/2+t) path - emit_sdivmod_const(): signed 64-bit div/mod code generation with imul/sar, add-n correction for overflowing m, sign correction (add sign bit to round toward zero), and negation for negative divisors - emit_divmod_const(): top-level dispatch; falls back to div/idiv for 32-bit, divisor=3D0, or negative unsigned immediates Test Plan: test_bpf: Summary: 1061 PASSED, 0 FAILED, [1049/1049 JIT'ed] test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jie Meng --- arch/x86/net/bpf_jit_comp.c | 286 ++++++++++++++++++++++++++++++++++++ 1 file changed, 286 insertions(+) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index ea9e707e8abf..39182355d61d 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1649,6 +1649,286 @@ static int emit_spectre_bhb_barrier(u8 **pprog, u= 8 *ip, return 0; } =20 +/* + * ---------- Division by compile-time constant optimization ---------- + * + * For 64-bit BPF_K (immediate divisor) div/mod operations, replace slow + * div/idiv instructions (15+ cycles) with fast multiply+shift sequences + * (~4 cycles). + * + * LLVM already optimizes power-of-2 divisors and 32-bit operations at t= he + * compiler level, so this JIT optimization only focuses on 64-bit divis= ion by + * general constants, both signed and unsigned. For power-of-2 divisors,= the + * algorithm generates instructions yielding correct results, but isn't = the + * most optimal for performance (still faster than div/idiv though). + */ + +struct bpf_jit_div_const { + u64 m; /* magic multiplier */ + u8 shift; /* post-shift amount */ + bool is_wide; /* unsigned: m > 2^64, needs (n-t)/2+t correction + * signed: m overflows, needs add-n correction + */ +}; + +/* + * Divide a 128-bit value (n_hi:n_lo) by a u64 divisor using x86-64 divq= . + * Do not rely on libgcc for 128-bit division. Two divq steps give a ful= l + * 128-bit quotient. + */ +static inline unsigned __int128 div128_u64(u64 n_hi, u64 n_lo, u64 d) +{ + u64 q_hi, q_lo, rem; + + asm("divq %[d]" : "=3Da"(q_hi), "=3Dd"(rem) + : "a"(n_hi), "d"(0ULL), [d] "rm"(d)); + asm("divq %[d]" : "=3Da"(q_lo), "=3Dd"(rem) + : "a"(n_lo), "d"(rem), [d] "rm"(d)); + + return ((unsigned __int128)q_hi << 64) | q_lo; +} + +/* + * Granlund-Montgomery choose_multiplier for 64-bit division by constant= . + * Computes magic multiplier m and shift s such that: + * unsigned: floor(n/d) =3D mulhi(n, m) >> s (with possible wide corr= ection) + * signed: floor(n/d) =3D mulhi_signed(n, m) >> s + sign_correction + * + * prec =3D 64 for unsigned, 63 for signed (sign bit reduces usable rang= e). + */ +static bool choose_multiplier_64(u32 d, u8 prec, struct bpf_jit_div_cons= t *c) +{ + unsigned __int128 mlow, mhigh; + u32 l, post_shift; + + if (d <=3D 1) + return false; + + l =3D fls(d - 1); + + mlow =3D div128_u64(1ULL << l, 0, d); + mhigh =3D div128_u64(1ULL << l, 1ULL << (l + 64 - prec), d); + + post_shift =3D l; + while (post_shift > 0) { + unsigned __int128 lo =3D mlow >> 1, hi =3D mhigh >> 1; + + if (lo >=3D hi) + break; + mlow =3D lo; + mhigh =3D hi; + post_shift--; + } + + c->m =3D (u64)mhigh; + c->shift =3D post_shift; + c->is_wide =3D (mhigh >> 64) !=3D 0; + return true; +} + +/* Unsigned 64-bit div/mod by constant via multiply+shift */ +static void emit_udivmod_const(u8 **pprog, u32 dst_reg, u32 d, + const struct bpf_jit_div_const *c, bool is_mod) +{ + u8 *prog =3D *pprog; + + /* For MOD, save original n for remainder =3D n - q * d */ + if (is_mod) { + maybe_emit_1mod(&prog, dst_reg, false); /* push dst */ + EMIT1(add_1reg(0x50, dst_reg)); + } + + if (dst_reg !=3D BPF_REG_0) + EMIT1(0x50); /* push rax */ + if (dst_reg !=3D BPF_REG_3) + EMIT1(0x52); /* push rdx */ + if (dst_reg !=3D BPF_REG_0) + emit_mov_reg(&prog, true, BPF_REG_0, dst_reg); /* mov rax, dst */ + if (c->is_wide) + EMIT1(0x50); /* push rax (=3D n, for wide correction) */ + + /* movabs r11, m */ + emit_mov_imm64(&prog, AUX_REG, (u32)(c->m >> 32), (u32)c->m); + + /* mul r11 -> rdx:rax =3D rax * r11 (unsigned) */ + EMIT1(add_1mod(0x48, AUX_REG)); + EMIT2(0xF7, add_1reg(0xE0, AUX_REG)); + + if (c->is_wide) { + /* + * Wide correction: quotient =3D ((n - t) >> 1 + t) >> (shift-1) + * where t =3D mulhi(n, m) in rdx + */ + EMIT2(0x41, 0x5B); /* pop r11 (=3D n) */ + /* sub r11, rdx */ + EMIT1(add_2mod(0x48, AUX_REG, BPF_REG_3)); + EMIT2(0x29, add_2reg(0xC0, AUX_REG, BPF_REG_3)); + /* shr r11, 1 */ + EMIT1(add_1mod(0x48, AUX_REG)); + EMIT2(0xD1, add_1reg(0xE8, AUX_REG)); + /* add r11, rdx */ + EMIT1(add_2mod(0x48, AUX_REG, BPF_REG_3)); + EMIT2(0x01, add_2reg(0xC0, AUX_REG, BPF_REG_3)); + /* shr r11, shift-1 */ + if (c->shift > 1) { + EMIT1(add_1mod(0x48, AUX_REG)); + EMIT3(0xC1, add_1reg(0xE8, AUX_REG), + c->shift - 1); + } + /* mov dst, r11 */ + emit_mov_reg(&prog, true, dst_reg, AUX_REG); + } else { + /* shr rdx, shift */ + if (c->shift) { + EMIT1(0x48); + EMIT3(0xC1, add_1reg(0xE8, BPF_REG_3), c->shift); + } + /* mov dst, rdx */ + if (dst_reg !=3D BPF_REG_3) + emit_mov_reg(&prog, true, dst_reg, BPF_REG_3); + } + + if (dst_reg !=3D BPF_REG_3) + EMIT1(0x5A); /* pop rdx */ + if (dst_reg !=3D BPF_REG_0) + EMIT1(0x58); /* pop rax */ + + /* dst =3D quotient; compute remainder if needed */ + if (is_mod) { + /* imul r11, dst, d -> r11 =3D quotient * d */ + maybe_emit_mod(&prog, dst_reg, AUX_REG, true); + if (is_imm8(d)) + EMIT3(0x6B, add_2reg(0xC0, dst_reg, AUX_REG), d); + else + EMIT2_off32(0x69, + add_2reg(0xC0, dst_reg, AUX_REG), d); + /* pop dst (=3D original n) */ + maybe_emit_1mod(&prog, dst_reg, false); + EMIT1(add_1reg(0x58, dst_reg)); + /* sub dst, r11 -> remainder =3D n - q*d */ + maybe_emit_mod(&prog, dst_reg, AUX_REG, true); + EMIT2(0x29, add_2reg(0xC0, dst_reg, AUX_REG)); + } + + *pprog =3D prog; +} + +/* Signed 64-bit div/mod by constant via multiply+shift */ +static void emit_sdivmod_const(u8 **pprog, u32 dst_reg, s32 imm, + const struct bpf_jit_div_const *c, bool is_mod) +{ + u8 *prog =3D *pprog; + bool neg_divisor =3D (imm < 0); + + /* For MOD, save original n for remainder =3D n - q * d */ + if (is_mod) { + maybe_emit_1mod(&prog, dst_reg, false); /* push dst */ + EMIT1(add_1reg(0x50, dst_reg)); + } + + if (dst_reg !=3D BPF_REG_0) + EMIT1(0x50); /* push rax */ + if (dst_reg !=3D BPF_REG_3) + EMIT1(0x52); /* push rdx */ + if (dst_reg !=3D BPF_REG_0) + emit_mov_reg(&prog, true, BPF_REG_0, dst_reg); /* mov rax, dst */ + if (c->is_wide) + EMIT1(0x50); /* push rax (=3D n, for wide correction) */ + + /* movabs r11, m */ + emit_mov_imm64(&prog, AUX_REG, (u32)(c->m >> 32), (u32)c->m); + + /* imul r11 -> rdx:rax =3D rax * r11 (signed) */ + EMIT1(add_1mod(0x48, AUX_REG)); + EMIT2(0xF7, add_1reg(0xE8, AUX_REG)); + + if (c->is_wide) { + /* Wide correction: rdx +=3D n (bit 63 set in m) */ + EMIT2(0x41, 0x5B); /* pop r11 (=3D n) */ + /* add rdx, r11 */ + EMIT1(add_2mod(0x48, BPF_REG_3, AUX_REG)); + EMIT2(0x01, add_2reg(0xC0, BPF_REG_3, AUX_REG)); + } + + /* sar rdx, shift */ + if (c->shift) { + EMIT1(0x48); + EMIT3(0xC1, add_1reg(0xF8, BPF_REG_3), c->shift); + } + + /* Sign correction: round toward zero by adding sign bit */ + emit_mov_reg(&prog, true, AUX_REG, BPF_REG_3); /* mov r11, rdx */ + EMIT1(add_1mod(0x48, AUX_REG)); + EMIT3(0xC1, add_1reg(0xE8, AUX_REG), 63); /* shr r11, 63 */ + EMIT1(add_2mod(0x48, BPF_REG_3, AUX_REG)); + EMIT2(0x01, add_2reg(0xC0, BPF_REG_3, AUX_REG)); /* add rdx, r11 */ + + /* mov dst, rdx */ + if (dst_reg !=3D BPF_REG_3) + emit_mov_reg(&prog, true, dst_reg, BPF_REG_3); + if (dst_reg !=3D BPF_REG_3) + EMIT1(0x5A); /* pop rdx */ + if (dst_reg !=3D BPF_REG_0) + EMIT1(0x58); /* pop rax */ + + /* neg dst (if divisor was negative) */ + if (neg_divisor) { + EMIT1(add_1mod(0x48, dst_reg)); + EMIT2(0xF7, add_1reg(0xD8, dst_reg)); + } + + /* dst =3D quotient; compute remainder if needed */ + if (is_mod) { + /* imul r11, dst, d -> r11 =3D quotient * d */ + maybe_emit_mod(&prog, dst_reg, AUX_REG, true); + if (is_imm8(imm)) + EMIT3(0x6B, add_2reg(0xC0, dst_reg, AUX_REG), imm); + else + EMIT2_off32(0x69, + add_2reg(0xC0, dst_reg, AUX_REG), imm); + /* pop dst (=3D original n) */ + maybe_emit_1mod(&prog, dst_reg, false); + EMIT1(add_1reg(0x58, dst_reg)); + /* sub dst, r11 -> remainder =3D n - q*d */ + maybe_emit_mod(&prog, dst_reg, AUX_REG, true); + EMIT2(0x29, add_2reg(0xC0, dst_reg, AUX_REG)); + } + + *pprog =3D prog; +} + +static bool emit_divmod_const(u8 **pprog, u32 dst_reg, s32 imm, bool is6= 4, + bool is_mod, bool is_signed) +{ + struct bpf_jit_div_const c; + + /* Assuming that LLVM already optimizes 32-bit */ + if (!is64 || imm =3D=3D 0) + return false; + + if (!is_signed) { + if (imm < 0) + return false; + if (!choose_multiplier_64((u32)imm, 64, &c)) + return false; + emit_udivmod_const(pprog, dst_reg, (u32)imm, &c, is_mod); + } else { + u32 ad =3D (imm > 0) ? (u32)imm : (u32)(-imm); + + if (ad <=3D 1) + return false; + if (!choose_multiplier_64(ad, 63, &c)) + return false; + /* For signed, is_wide means bit 63 is set (m appears + * negative to imul), not that it exceeds 2^64. + */ + c.is_wide =3D (c.m >> 63) !=3D 0; + emit_sdivmod_const(pprog, dst_reg, imm, &c, is_mod); + } + + return true; +} + static int do_jit(struct bpf_verifier_env *env, struct bpf_prog *bpf_pro= g, int *addrs, u8 *image, u8 *rw_image, int oldproglen, struct jit_context *ctx, bool jmp_padd= ing) { @@ -1891,6 +2171,12 @@ static int do_jit(struct bpf_verifier_env *env, st= ruct bpf_prog *bpf_prog, int * case BPF_ALU64 | BPF_DIV | BPF_K: { bool is64 =3D BPF_CLASS(insn->code) =3D=3D BPF_ALU64; =20 + if (BPF_SRC(insn->code) =3D=3D BPF_K && + emit_divmod_const(&prog, dst_reg, imm32, is64, + BPF_OP(insn->code) =3D=3D BPF_MOD, + insn->off !=3D 0)) + break; + if (dst_reg !=3D BPF_REG_0) EMIT1(0x50); /* push rax */ if (dst_reg !=3D BPF_REG_3) --=20 2.52.0