From: Sasha Levin <sashal@kernel.org>
To: tools@kernel.org
Cc: linux-kernel@vger.kernel.org, torvalds@linux-foundation.org,
broonie@kernel.org, Sasha Levin <sashal@kernel.org>
Subject: [RFC v2 2/7] LLMinus: Add vectorize command with fastembed
Date: Sun, 11 Jan 2026 16:29:10 -0500 [thread overview]
Message-ID: <20260111212915.195056-3-sashal@kernel.org> (raw)
In-Reply-To: <20260111212915.195056-1-sashal@kernel.org>
Add the vectorize command that generates embeddings for stored conflict
resolutions using the BGE-small-en-v1.5 model via fastembed. The model
produces 384-dimensional vectors. Processing is batched with incremental
saves after each batch for crash recovery. Resolutions with existing
embeddings are skipped.
This enables RAG-based similarity search for finding historical conflict
resolutions similar to current merge conflicts. Also adds cosine_similarity()
and init_embedding_model() helpers with corresponding tests.
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
tools/llminus/Cargo.toml | 1 +
tools/llminus/src/main.rs | 157 ++++++++++++++++++++++++++++++++++++++
2 files changed, 158 insertions(+)
diff --git a/tools/llminus/Cargo.toml b/tools/llminus/Cargo.toml
index bdb42561a056..86740174de59 100644
--- a/tools/llminus/Cargo.toml
+++ b/tools/llminus/Cargo.toml
@@ -10,6 +10,7 @@ repository = "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[dependencies]
anyhow = "1"
clap = { version = "4", features = ["derive"] }
+fastembed = "5"
rayon = "1"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
diff --git a/tools/llminus/src/main.rs b/tools/llminus/src/main.rs
index 508bdc085173..b97505d0cd99 100644
--- a/tools/llminus/src/main.rs
+++ b/tools/llminus/src/main.rs
@@ -2,6 +2,7 @@
use anyhow::{bail, Context, Result};
use clap::{Parser, Subcommand};
+use fastembed::{EmbeddingModel, InitOptions, TextEmbedding};
use rayon::prelude::*;
use serde::{Deserialize, Serialize};
use std::collections::HashSet;
@@ -28,6 +29,12 @@ enum Commands {
/// Git revision range (e.g., "v6.0..v6.1"). If not specified, learns from entire history.
range: Option<String>,
},
+ /// Generate embeddings for stored resolutions (for RAG similarity search)
+ Vectorize {
+ /// Batch size for embedding generation (default: 64)
+ #[arg(short, long, default_value = "64")]
+ batch_size: usize,
+ },
}
/// A single diff hunk representing a change region
@@ -588,11 +595,118 @@ fn learn(range: Option<&str>) -> Result<()> {
Ok(())
}
+/// Compute cosine similarity between two vectors
+fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
+ if a.len() != b.len() || a.is_empty() {
+ return 0.0;
+ }
+
+ let dot: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
+ let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
+ let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
+
+ if norm_a == 0.0 || norm_b == 0.0 {
+ return 0.0;
+ }
+
+ dot / (norm_a * norm_b)
+}
+
+/// Initialize the BGE-small embedding model
+fn init_embedding_model() -> Result<TextEmbedding> {
+ TextEmbedding::try_new(
+ InitOptions::new(EmbeddingModel::BGESmallENV15)
+ .with_show_download_progress(true),
+ ).context("Failed to initialize embedding model")
+}
+
+fn vectorize(batch_size: usize) -> Result<()> {
+ let store_path = Path::new(STORE_PATH);
+
+ if !store_path.exists() {
+ bail!("No resolutions found. Run 'llminus learn' first.");
+ }
+
+ let mut store = ResolutionStore::load(store_path)?;
+
+ // Count how many need embeddings
+ let need_embedding: Vec<usize> = store
+ .resolutions
+ .iter()
+ .enumerate()
+ .filter(|(_, r)| r.embedding.is_none())
+ .map(|(i, _)| i)
+ .collect();
+
+ if need_embedding.is_empty() {
+ println!("All {} resolutions already have embeddings.", store.resolutions.len());
+ return Ok(());
+ }
+
+ println!("Found {} resolutions needing embeddings", need_embedding.len());
+ println!("Initializing embedding model (BGE-small-en, ~33MB download on first run)...");
+
+ // Initialize the embedding model
+ let mut model = init_embedding_model()?;
+
+ println!("Model loaded. Generating embeddings...\n");
+
+ // Process in batches
+ let total_batches = need_embedding.len().div_ceil(batch_size);
+
+ for (batch_num, chunk) in need_embedding.chunks(batch_size).enumerate() {
+ // Collect texts for this batch
+ let texts: Vec<String> = chunk
+ .iter()
+ .map(|&i| store.resolutions[i].to_embedding_text())
+ .collect();
+
+ // Generate embeddings
+ let embeddings = model
+ .embed(texts, None)
+ .context("Failed to generate embeddings")?;
+
+ // Assign embeddings back to resolutions
+ for (j, &idx) in chunk.iter().enumerate() {
+ store.resolutions[idx].embedding = Some(embeddings[j].clone());
+ }
+
+ // Progress report
+ let done = (batch_num + 1) * batch_size.min(chunk.len());
+ let pct = (done as f64 / need_embedding.len() as f64 * 100.0).min(100.0);
+ println!(
+ " Batch {}/{}: {:.1}% ({}/{})",
+ batch_num + 1,
+ total_batches,
+ pct,
+ done.min(need_embedding.len()),
+ need_embedding.len()
+ );
+
+ // Save after each batch (incremental progress)
+ store.save(store_path)?;
+ }
+
+ // Final stats
+ let json_size = std::fs::metadata(store_path).map(|m| m.len()).unwrap_or(0);
+ let with_embeddings = store.resolutions.iter().filter(|r| r.embedding.is_some()).count();
+
+ println!("\nResults:");
+ println!(" Total resolutions: {}", store.resolutions.len());
+ println!(" With embeddings: {}", with_embeddings);
+ println!(" Embedding dimensions: 384");
+ println!(" Output size: {:.2} MB", json_size as f64 / 1024.0 / 1024.0);
+ println!("\nEmbeddings saved to: {}", store_path.display());
+
+ Ok(())
+}
+
fn main() -> Result<()> {
let cli = Cli::parse();
match cli.command {
Commands::Learn { range } => learn(range.as_deref()),
+ Commands::Vectorize { batch_size } => vectorize(batch_size),
}
}
@@ -613,6 +727,7 @@ fn test_learn_command_parses() {
let cli = Cli::try_parse_from(["llminus", "learn"]).unwrap();
match cli.command {
Commands::Learn { range } => assert!(range.is_none()),
+ _ => panic!("Expected Learn command"),
}
}
@@ -621,9 +736,51 @@ fn test_learn_command_with_range() {
let cli = Cli::try_parse_from(["llminus", "learn", "v6.0..v6.1"]).unwrap();
match cli.command {
Commands::Learn { range } => assert_eq!(range, Some("v6.0..v6.1".to_string())),
+ _ => panic!("Expected Learn command"),
}
}
+ #[test]
+ fn test_vectorize_command_parses() {
+ let cli = Cli::try_parse_from(["llminus", "vectorize"]).unwrap();
+ match cli.command {
+ Commands::Vectorize { batch_size } => assert_eq!(batch_size, 64),
+ _ => panic!("Expected Vectorize command"),
+ }
+ }
+
+ #[test]
+ fn test_vectorize_command_with_batch_size() {
+ let cli = Cli::try_parse_from(["llminus", "vectorize", "-b", "128"]).unwrap();
+ match cli.command {
+ Commands::Vectorize { batch_size } => assert_eq!(batch_size, 128),
+ _ => panic!("Expected Vectorize command"),
+ }
+ }
+
+ #[test]
+ fn test_cosine_similarity() {
+ // Identical vectors should have similarity 1.0
+ let a = vec![1.0, 0.0, 0.0];
+ let b = vec![1.0, 0.0, 0.0];
+ assert!((cosine_similarity(&a, &b) - 1.0).abs() < 0.0001);
+
+ // Orthogonal vectors should have similarity 0.0
+ let a = vec![1.0, 0.0, 0.0];
+ let b = vec![0.0, 1.0, 0.0];
+ assert!((cosine_similarity(&a, &b) - 0.0).abs() < 0.0001);
+
+ // Opposite vectors should have similarity -1.0
+ let a = vec![1.0, 0.0, 0.0];
+ let b = vec![-1.0, 0.0, 0.0];
+ assert!((cosine_similarity(&a, &b) - (-1.0)).abs() < 0.0001);
+
+ // Different length vectors return 0
+ let a = vec![1.0, 0.0];
+ let b = vec![1.0, 0.0, 0.0];
+ assert_eq!(cosine_similarity(&a, &b), 0.0);
+ }
+
#[test]
fn test_get_file_type() {
assert_eq!(get_file_type("foo/bar.c"), "c");
--
2.51.0
next prev parent reply other threads:[~2026-01-11 21:29 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-19 18:16 [RFC 0/5] LLMinus: LLM-Assisted Merge Conflict Resolution Sasha Levin
2025-12-19 18:16 ` [RFC 1/5] LLMinus: Add skeleton project with learn command Sasha Levin
2025-12-19 18:16 ` [RFC 2/5] LLMinus: Add vectorize command with fastembed Sasha Levin
2025-12-19 18:16 ` [RFC 3/5] LLMinus: Add find command for similarity search Sasha Levin
2025-12-19 18:16 ` [RFC 4/5] LLMinus: Add resolve command for LLM-assisted conflict resolution Sasha Levin
2025-12-19 18:16 ` [RFC 5/5] LLMinus: Add pull command for LLM-assisted kernel pull request merging Sasha Levin
2025-12-21 16:10 ` [RFC 0/5] LLMinus: LLM-Assisted Merge Conflict Resolution Sasha Levin
2025-12-22 14:50 ` Mark Brown
2025-12-23 12:36 ` Sasha Levin
2025-12-23 17:47 ` Mark Brown
2026-01-05 18:00 ` Sasha Levin
2026-01-05 18:30 ` Mark Brown
2026-01-11 21:29 ` [RFC v2 0/7] " Sasha Levin
2026-01-11 21:29 ` [RFC v2 1/7] LLMinus: Add skeleton project with learn command Sasha Levin
2026-01-11 21:29 ` Sasha Levin [this message]
2026-01-11 21:29 ` [RFC v2 3/7] LLMinus: Add find command for similarity search Sasha Levin
2026-01-11 21:29 ` [RFC v2 4/7] LLMinus: Add resolve command for LLM-assisted conflict resolution Sasha Levin
2026-01-11 21:29 ` [RFC v2 5/7] LLMinus: Add pull command for LLM-assisted kernel pull request merging Sasha Levin
2026-01-11 21:29 ` [RFC v2 6/7] LLMinus: Add prompt token limit enforcement Sasha Levin
2026-01-11 21:29 ` [RFC v2 7/7] LLMinus: Add build test integration for semantic conflicts Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260111212915.195056-3-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=broonie@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tools@kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox