Building a Fair and Scalable File Distribution System with CRUSH

Building a Fair and Scalable File Distribution System with CRUSH

Introduction

When building a decentralized storage network with thousands of miners, one of the biggest challenges is: how do you fairly distribute files across the network while ensuring reliability, performance, and preventing any single miner from dominating the system?

This is the problem we solved by implementing a CRUSH-inspired algorithm (Controlled Replication Under Scalable Hashing) for Hippius. In this article, we’ll walk you through how it works, why we built it, and the challenges we overcame.


The Problem: Fair Distribution at Scale

Imagine you have:

  • :package: 96,000+ files that need to be stored
  • :desktop_computer: 4,000+ miner families ready to store them
  • :counterclockwise_arrows_button: 5 replicas per file for redundancy
  • :balance_scale: A requirement for fair distribution so no miner gets overwhelmed

Traditional approaches like random selection or round-robin quickly break down:

  • Random selection can lead to hot spots where some miners get overloaded
  • Round-robin doesn’t account for miner capacity, performance, or reliability
  • Simple hashing creates imbalance when miners join or leave

We needed something smarter.


What is CRUSH?

CRUSH (Controlled Replication Under Scalable Hashing) is a pseudo-random data distribution algorithm originally developed for Ceph, a distributed storage system. The key innovation is that CRUSH is:

  1. Deterministic: Given the same inputs, it always produces the same output
  2. Pseudo-random: The distribution appears random but is mathematically controlled
  3. Decentralized: No central coordinator needed
  4. Scalable: Works efficiently with thousands of nodes
  5. Adaptive: Handles node failures and additions gracefully

Our Implementation: CRUSH for Hippius

We adapted CRUSH for our decentralized storage network with several key enhancements:

Core Components

graph TB
    A[File Upload] --> B[CRUSH Algorithm]
    B --> C{Select Miner Families}
    C --> D[Capacity Check]
    C --> E[Performance Score]
    C --> F[Fairness Penalty]
    D --> G[Final Selection]
    E --> G
    F --> G
    G --> H[5 Miner Families]
    H --> I[File Replicated]
    
    style B fill:#4A90E2
    style G fill:#50C878
    style I fill:#FFB84D

The Algorithm Flow

sequenceDiagram
    participant User
    participant Scheduler
    participant CRUSH
    participant Database
    participant Miners
    
    User->>Scheduler: Upload File
    Scheduler->>Database: Query Active Families
    Database-->>Scheduler: Return 4000+ Families
    Scheduler->>CRUSH: Select Families(file_id, replicas=5)
    
    loop For Each Candidate Family
        CRUSH->>CRUSH: Calculate Hash Score
        CRUSH->>CRUSH: Apply Capacity Bias
        CRUSH->>CRUSH: Apply Performance Bias
        CRUSH->>CRUSH: Apply Fairness Penalty
    end
    
    CRUSH-->>Scheduler: Return 5 Best Families
    Scheduler->>Miners: Assign Replicas
    Miners-->>Scheduler: Confirm Storage
    Scheduler->>User: Upload Complete

Key Features

1. Multi-Factor Scoring

Each miner family is scored based on multiple factors:

graph LR
    A[Miner Family] --> B[Base Hash Score]
    B --> C[× Capacity Headroom]
    B --> D[× Performance Metrics]
    B --> E[× Fairness Penalty]
    C --> F[Final Score]
    D --> F
    E --> F
    
    style F fill:#50C878,stroke:#2E8B57,stroke-width:3px

Base Hash Score: A deterministic hash combining file ID and family ID ensures consistent replica placement.

Capacity Headroom: Miners with more available space get a boost:

capacity_boost = available_space / total_capacity

Performance Metrics: We factor in:

  • Provider success ratio (file retrieval reliability)
  • Ping reliability (network stability)
  • Uptime percentage
  • Historical fault count

Fairness Penalty: This is crucial - we penalize families that already have too many files:

if current_files > fair_share * 5:
    exclude_from_selection()
else if current_files > fair_share * 3:
    penalty = 0.7  # 30% reduction

2. Hard Capacity Limits

We enforce two types of capacity limits:

  • Soft Cap: Progressive capacity trust based on performance history
  • Hard Cap: Respects the ipfs_storage_max declared by each miner
graph TD
    A[New Miner Joins] --> B[Initial Soft Cap: 100 GB]
    B --> C{Proves Reliability?}
    C -->|Yes| D[Increase Soft Cap]
    C -->|No| E[Keep Current Cap]
    D --> F{Reaches Hard Cap?}
    F -->|Yes| G[Stop Assignments]
    F -->|No| H[Continue Growth]
    E --> I[Periodic Re-evaluation]
    I --> C
    
    style G fill:#FF6B6B
    style H fill:#50C878

3. Bootstrap Allocation

New miners face a “chicken-and-egg” problem: they need files to prove their reliability, but they can’t get files without a good reliability score.

Our solution: Bootstrap allocation allows new miners with fewer than 100 files to receive assignments even with a 0% success ratio.

if miner.total_files < 100:
    allow_bootstrap_allocation()
else:
    require_minimum_success_ratio(60%)

4. Dynamic Rebalancing

The network self-heals and rebalances automatically:

graph TD
    A[Monitor Loop] --> B{Check Distribution}
    B -->|Balanced| C[Continue]
    B -->|Imbalanced| D[Identify Dominant Families]
    D --> E[Apply Stronger Fairness Penalty]
    E --> F[Scheduler Selects Alternatives]
    F --> G{Within Fair Share?}
    G -->|No| H[Remove Excess Replicas]
    G -->|Yes| I[Normal Operation]
    H --> F
    I --> A
    C --> A
    
    style D fill:#FFB84D
    style E fill:#4A90E2
    style I fill:#50C878

The Fair Share Formula

The concept of “fair share” is central to our system. It ensures that no family dominates the network:

total_replicas = total_files × 5  (since we create 5 replicas per file)
total_families = count of active miner families
fair_share = total_replicas / total_families
max_allowed = fair_share × 5  (matching 5 replicas per file)

For example, with 96,000 files and 4,000 families:

  • Total replicas: 96,000 × 5 = 480,000
  • Fair share per family: 480,000 / 4,000 = 120 replicas
  • Maximum allowed: 120 × 5 = 600 replicas per family

Families exceeding this cap are hard excluded from receiving new assignments until they’re back within limits.


Ranking and Performance Integration

We integrate external ranking data to further refine selection:

graph LR
    A[External Ranking API] --> B[Fetch Rankings]
    B --> C[Cache for 1 Minute]
    C --> D[Apply to CRUSH Scores]
    D --> E{Is Top Performer?}
    E -->|Yes, but overloaded| F[Apply Fairness Penalty]
    E -->|Yes, balanced| G[Boost Selection Chance]
    E -->|No| H[Standard Selection]
    F --> I[Select Alternative]
    G --> J[Assign File]
    H --> J
    
    style F fill:#FF6B6B
    style G fill:#50C878

Rankings are refreshed frequently (every 1 minute) to ensure the system reacts quickly to changes in miner dominance.


Real-World Results

After implementing CRUSH with fairness controls:

Before CRUSH

  • :cross_mark: Top 5 families: 54,000+ replicas (11% of total)
  • :cross_mark: Small families: < 10 replicas (nearly invisible)
  • :cross_mark: Imbalance ratio: ~5000:1

After CRUSH + Rebalancing

  • :white_check_mark: Top families: ~14,000 replicas (2.9% of total) - 73% reduction
  • :white_check_mark: Fair share cap: ~600 replicas per family
  • :white_check_mark: Imbalance ratio: ~23:1 (continuously improving)
  • :white_check_mark: Network growth: New miners can compete fairly
graph TD
    A[Network State] --> B{Distribution}
    B --> C[Top 5 Families: 2.9%]
    B --> D[Mid-tier: 40%]
    B --> E[Small miners: 57%]
    
    C --> F[Within Fair Share]
    D --> G[Healthy Growth]
    E --> H[Bootstrap Allocation]
    
    style F fill:#50C878
    style G fill:#4A90E2
    style H fill:#FFB84D

Automatic Rebalancing in Action

When we first deployed CRUSH, some families had already accumulated a disproportionate number of replicas from the previous system. CRUSH’s automatic rebalancing mechanisms kicked in immediately:

  1. Detection: The fairness penalty system identified families exceeding their fair share
  2. Hard Exclusion: Families above the 5x fair share cap were automatically excluded from new assignments
  3. Gradual Redistribution: As files needed replication or rebalancing, CRUSH naturally selected underutilized families
  4. Continuous Monitoring: The 1-minute ranking refresh ensures rapid response to any emerging imbalances

Over time, the system naturally redistributed 42,000+ replicas from dominant families to smaller ones, all without manual intervention. The CRUSH algorithm continuously maintains balance through proactive fairness penalties and intelligent selection.


Technical Challenges We Solved

Challenge 1: Database Deadlocks

When removing thousands of replicas, we encountered deadlocks due to concurrent trigger executions.

Solution: Batch deletions with smaller sizes (5,000 instead of 10,000) and introduced pauses between batches.

Challenge 2: Stale Ranking Data

Rankings cached for 5 minutes meant the system was slow to react to dominant families.

Solution: Reduced cache duration to 1 minute for faster reaction times.

Challenge 3: Bootstrap Paradox

New miners couldn’t get files because they had no performance history.

Solution: Implemented bootstrap allocation allowing up to 100 files without performance requirements.

Challenge 4: Inactive Miner Cleanup

Inactive miners were keeping replicas, inflating metrics.

Solution: Automatically delete replicas when miners go inactive, triggering immediate re-scheduling.


Why This Matters

Building a truly decentralized storage network isn’t just about storing files - it’s about creating a fair, sustainable ecosystem where:

  1. Small miners can compete with large operators
  2. Performance is rewarded but not at the expense of fairness
  3. The network self-heals automatically when miners join or leave
  4. Capacity grows organically based on proven reliability

CRUSH gives us all of this while maintaining the deterministic guarantees needed for a production storage system.


What’s Next?

We’re continuously improving the system:

  • :counterclockwise_arrows_button: Geo-diversity: Preferring miners in different geographic regions
  • :bullseye: Performance tiers: Different rules for hot vs. cold storage
  • :bar_chart: Predictive scaling: Using ML to predict capacity needs
  • :locked_with_key: Privacy-aware placement: Keeping replicas in specific jurisdictions

Conclusion

CRUSH isn’t just an algorithm - it’s a philosophy of fairness through mathematics. By combining deterministic hashing with adaptive penalties and performance metrics, we’ve built a system that:

  • Scales to thousands of nodes
  • Remains fair under load
  • Adapts to changing network conditions
  • Requires no central coordination

If you’re building a decentralized system that needs fair resource distribution, CRUSH might be the answer you’re looking for.


Technical Specifications Summary

Metric Value
Total Files 96,000+
Miner Families 4,000+
Replicas per File 5
Fair Share Multiplier 5x
Ranking Cache Duration 60 seconds
Bootstrap File Limit 100 files
Minimum Success Ratio 60% (after bootstrap)
Rebalancing Frequency Continuous (1-minute refresh)

Built with :heart: by the Hippius team

1 Like