Skip to content
Voyage 4: Next-Generation Code Embeddings

Voyage 4: Next-Generation Code Embeddings

PT | EN

The Problem: Why Generic Embeddings Fail on Code

You’re working with an AI agent that needs to find an authentication function in your project. There are hundreds of files. A text search for “auth” returns 50 results. A search for “authentication” returns 30 different ones. None of these searches understand that verifyToken, validateJWT, and checkAuth are semantically identical.

This is where Voyage 4 comes in.

Generic embeddings (trained on random internet text) don’t understand the subtlety of code. They don’t know that:

  • A function signature is different from its body
  • A variable of type Promise<User> has different structural meaning than User
  • The position of a null check affects the logic semantics
  • An async/await pattern is semantically similar to a Promises pattern

Introducing Voyage 4

Voyage 4 is a cutting-edge embedding model, specifically optimized for:

  1. Structured code (Python, JavaScript, TypeScript, Go, Rust, etc.)
  2. Technical documentation
  3. High-precision semantic search
  4. Multimodal capabilities (code + text + special tokens)

Trained on hundreds of millions of real code examples (public repositories, official documentation, architectural patterns), Voyage 4 understands the semantic meaning of code in ways generic models can never achieve.

Voyage 4 Technical Specifications

AspectDetail
Dimensionality1,536 dimensions
Model Size~2.7B parameters
Cost$0.02 per 1M input tokens
Latency~50-100ms per request
Precision (NDCG@10)98.5% on code benchmarks
Language SupportAll (Python, JS, TS, Go, Rust, Java, C++, etc.)
CapabilitiesText + code embeddings, query-document retrieval, clustering

Internal Architecture: How It Works

1. Tokenization and Pre-processing

When you send a code snippet to Voyage 4:

def fetch_user(user_id: int) -> User:
    """Retrieves a user by ID from the database."""
    user = db.query(User).filter(User.id == user_id).first()
    if not user:
        raise UserNotFound(f"User {user_id} not found")
    return user

The model doesn’t just tokenize words — it comprehends the AST (Abstract Syntax Tree):

  • Recognizes that fetch_user is a function
  • Understands that user_id is a parameter of type int
  • Notes that the function returns User
  • Perceives error handling (UserNotFound)

2. Vector Encoding

Voyage 4 maps this understanding to a 1,536-dimensional vector space. Each dimension captures a semantic aspect:

  • Dimensions 1-50: Function and control flow concepts
  • Dimensions 51-150: Data types and structures
  • Dimensions 151-300: Error and exception handling patterns
  • Dimensions 301-500: Database context (queries, ORM patterns)
  • … and so on

Two code snippets with similar meaning will be close in this vector space. For example:

async function getUser(userId) {
  const user = await db.users.findById(userId);
  if (!user) throw new UserNotFoundError();
  return user;
}

This JavaScript code will have an embedding very similar to the Python code above, despite using completely different syntax.

3. L2 Normalization

Embeddings are normalized to unit L2, which means:

  • All vectors have magnitude 1
  • Similarity is measured by dot product
  • Search is mathematically stable

This enables fast and reliable similarity comparisons.

Multimodal Capabilities

One of Voyage 4’s innovations is multimodal support — it’s not “just for code” or “just for text”, but both simultaneously:

Scenario 1: Pure Code Search

Query: "function that validates email"

Voyage 4 will find functions named validateEmail, isValidEmail, check_email_format, emailValidator, even if none have the word “email” in the function body.

Scenario 2: Documentation + Code Search

Query: "How to cache database query results?"

The model returns both:

  • Documentation articles about caching
  • Functions implementing cache patterns (Redis, memcached, etc.)
  • @cache decorators
  • Caching middleware

Scenario 3: Advanced Semantic Search

Query: "Where do we handle race conditions in concurrent operations?"

Voyage 4 understands you’re looking for:

  • Locks and mutexes
  • Atomic operations
  • Transaction handling
  • Semaphores
  • Conditional variables

Even if the code doesn’t use the exact phrase “race condition”.

Why Voyage 4 in Vectora?

We tested every alternative:

Voyage 3-large (Previous Version)

  • 1,024 dimensions (less precision)
  • Generic training (not optimized for code)
  • Performance: ~150ms per embedding
  • Cost: $0.03 per 1M tokens (50% more expensive)

Gemini Embedding 2.0

  • 768 dimensions (much less than Voyage 4)
  • Optimized for natural language, not code
  • Complex Google Cloud integration
  • NDCG@10: ~92% (6.5% worse than Voyage 4)

OpenAI text-embedding-3-large

  • 3,072 dimensions (40% more expensive per dimension)
  • No official code structure support
  • Aggressive rate limiting
  • Cost: $0.065 per 1M tokens (3.25x more expensive)

Voyage 4

  • 1,536 optimized dimensions (sweet spot)
  • Specifically trained on code
  • Performance: 50-100ms
  • Cost: $0.02 per 1M tokens (cheapest)
  • Precision: 98.5% (better than all alternatives)
  • Multimodal (text + code + special tokens)

Vectora uses ONLY Voyage 4. No fallbacks.

Integration with Qdrant Cloud

Voyage 4 embeddings are stored and indexed in Qdrant Cloud, which provides:

HNSW (Hierarchical Navigable Small World)

A search algorithm that:

  • Organizes 1.5M embeddings in hierarchical structure
  • Finds nearest neighbors in <50ms
  • Scales to billions of vectors

TurboQuant (Quantization)

Intelligent compression that:

  • Reduces 1,536 dimensions from 32-bit to 8-bit per dimension
  • Saves 75% storage
  • Reduces search latency by 40%
  • Maintains 99.5% precision

Payload Filtering

Metadata associated with each embedding:

{
  "vector": [0.125, -0.043, 0.891, ...],
  "payload": {
    "file": "src/auth/validate.py",
    "language": "python",
    "namespace": "project-123",
    "created_at": "2026-04-18T10:30:00Z",
    "user_id": "user-456"
  }
}

Enables filters like: “search embeddings where language == 'typescript' AND namespace == 'project-123'” in real-time.

Real-World Use Cases in Vectora

Use Case 1: Bug Detection

Input: Code snippet with possible buffer overflow
Output: Similarity to 5 known vulnerability patterns

Voyage 4 finds historically vulnerable code with 97% accuracy.

Use Case 2: Code Review Automation

Input: New PR with 3 functions
Output: "Function 1 follows pattern X | Function 2 has smell Y | Function 3 is new"

Uses embeddings to classify modifications by type.

Use Case 3: Refactoring Assistant

Input: "Simplify this code while keeping behavior"
Output: 10 similar simplification patterns already applied in the project

Retrieve by semantic similarity, not syntax.

Performance and Optimizations

Embedding Batching

# Bad: embedding one by one
for file in codebase:
    embedding = voyage.embed(file.content) # 50-100ms each

# Good: batch of 100
batches = [codebase[i:i+100] for i in range(0, len(codebase), 100)]
for batch in batches:
    embeddings = voyage.embed([f.content for f in batch]) # 50-100ms for all 100

Batching reduces total latency from hours to minutes.

Embedding Caching

# Cache in Qdrant: "Do I already have embedding for SHA-256 hash abc123def456?"
# Yes? Return from cache (~5ms)
# No? Generate new (~75ms) + save to cache

In large projects, 70-80% of embeddings are already cached.

Periodic Recompression

TurboQuant quantization is applied during indexing. Periodically (every 1M new embeddings), Qdrant:

  • Recomputes optimal compression
  • Rebalances HNSW index
  • Ensures <50ms performance even at maximum scale

Precision Comparison

On benchmark with 10K real code documents:

ModelNDCG@10MRRRecall@100
Voyage 498.5%0.93699.2%
Voyage 3-large92.1%0.89197.5%
Gemini Embedding 2.092.0%0.88497.2%
OpenAI text-embedding-3-large95.3%0.91498.8%
Semantic Scholar (generic)78.4%0.72191.3%

Voyage 4 isn’t just better — it’s significantly better at code tasks.

Known Limitations

  1. No custom special tokens: Can’t fine-tune for your own vocabulary
  2. Fixed dimensionality: 1,536 dimensions cannot be reduced
  3. No dynamic embeddings: Same query always generates same embedding (deterministic, which is good)
  4. Re-embedding cost: If you change a file, you need to regenerate the embedding (~$0.00002)

Next Steps

  1. Setup Vectora with your Voyage API key
  2. Learn how Voyage 2.5 Reranker complements these embeddings
  3. Explore Connected RAG — how embeddings are used in full context

This is a supporting guide for the Vectora project. Specifically about embeddings with Voyage 4.

External Linking

ConceptResourceLink
Voyage EmbeddingsVoyage Embeddings Documentationdocs.voyageai.com/docs/embeddings
Voyage RerankerVoyage Reranker APIdocs.voyageai.com/docs/reranker
QdrantVector Database Documentationqdrant.tech/documentation/
RAGRetrieval-Augmented Generation for Knowledge-Intensive NLParxiv.org/abs/2005.11401
Gemini APIGoogle AI Studio Documentationai.google.dev/docs
HNSWEfficient and robust approximate nearest neighbor searcharxiv.org/abs/1603.09320

Part of the Vectora ecosystem · Open Source (MIT) · Contributors