Building a Fair and Scalable File Distribution System with CRUSH
Introduction
When building a decentralized storage network with thousands of miners, one of the biggest challenges is: how do you fairly distribute files across the network while ensuring reliability, performance, and preventing any single miner from dominating the system?
This is the problem we solved by implementing a CRUSH-inspired algorithm (Controlled Replication Under Scalable Hashing) for Hippius. In this article, we’ll walk you through how it works, why we built it, and the challenges we overcame.
The Problem: Fair Distribution at Scale
Imagine you have:
96,000+ files that need to be stored
4,000+ miner families ready to store them
5 replicas per file for redundancy
A requirement for fair distribution so no miner gets overwhelmed
Traditional approaches like random selection or round-robin quickly break down:
- Random selection can lead to hot spots where some miners get overloaded
- Round-robin doesn’t account for miner capacity, performance, or reliability
- Simple hashing creates imbalance when miners join or leave
We needed something smarter.
What is CRUSH?
CRUSH (Controlled Replication Under Scalable Hashing) is a pseudo-random data distribution algorithm originally developed for Ceph, a distributed storage system. The key innovation is that CRUSH is:
- Deterministic: Given the same inputs, it always produces the same output
- Pseudo-random: The distribution appears random but is mathematically controlled
- Decentralized: No central coordinator needed
- Scalable: Works efficiently with thousands of nodes
- Adaptive: Handles node failures and additions gracefully
Our Implementation: CRUSH for Hippius
We adapted CRUSH for our decentralized storage network with several key enhancements:
Core Components
graph TB
A[File Upload] --> B[CRUSH Algorithm]
B --> C{Select Miner Families}
C --> D[Capacity Check]
C --> E[Performance Score]
C --> F[Fairness Penalty]
D --> G[Final Selection]
E --> G
F --> G
G --> H[5 Miner Families]
H --> I[File Replicated]
style B fill:#4A90E2
style G fill:#50C878
style I fill:#FFB84D
The Algorithm Flow
sequenceDiagram
participant User
participant Scheduler
participant CRUSH
participant Database
participant Miners
User->>Scheduler: Upload File
Scheduler->>Database: Query Active Families
Database-->>Scheduler: Return 4000+ Families
Scheduler->>CRUSH: Select Families(file_id, replicas=5)
loop For Each Candidate Family
CRUSH->>CRUSH: Calculate Hash Score
CRUSH->>CRUSH: Apply Capacity Bias
CRUSH->>CRUSH: Apply Performance Bias
CRUSH->>CRUSH: Apply Fairness Penalty
end
CRUSH-->>Scheduler: Return 5 Best Families
Scheduler->>Miners: Assign Replicas
Miners-->>Scheduler: Confirm Storage
Scheduler->>User: Upload Complete
Key Features
1. Multi-Factor Scoring
Each miner family is scored based on multiple factors:
graph LR
A[Miner Family] --> B[Base Hash Score]
B --> C[× Capacity Headroom]
B --> D[× Performance Metrics]
B --> E[× Fairness Penalty]
C --> F[Final Score]
D --> F
E --> F
style F fill:#50C878,stroke:#2E8B57,stroke-width:3px
Base Hash Score: A deterministic hash combining file ID and family ID ensures consistent replica placement.
Capacity Headroom: Miners with more available space get a boost:
capacity_boost = available_space / total_capacity
Performance Metrics: We factor in:
- Provider success ratio (file retrieval reliability)
- Ping reliability (network stability)
- Uptime percentage
- Historical fault count
Fairness Penalty: This is crucial - we penalize families that already have too many files:
if current_files > fair_share * 5:
exclude_from_selection()
else if current_files > fair_share * 3:
penalty = 0.7 # 30% reduction
2. Hard Capacity Limits
We enforce two types of capacity limits:
- Soft Cap: Progressive capacity trust based on performance history
- Hard Cap: Respects the
ipfs_storage_maxdeclared by each miner
graph TD
A[New Miner Joins] --> B[Initial Soft Cap: 100 GB]
B --> C{Proves Reliability?}
C -->|Yes| D[Increase Soft Cap]
C -->|No| E[Keep Current Cap]
D --> F{Reaches Hard Cap?}
F -->|Yes| G[Stop Assignments]
F -->|No| H[Continue Growth]
E --> I[Periodic Re-evaluation]
I --> C
style G fill:#FF6B6B
style H fill:#50C878
3. Bootstrap Allocation
New miners face a “chicken-and-egg” problem: they need files to prove their reliability, but they can’t get files without a good reliability score.
Our solution: Bootstrap allocation allows new miners with fewer than 100 files to receive assignments even with a 0% success ratio.
if miner.total_files < 100:
allow_bootstrap_allocation()
else:
require_minimum_success_ratio(60%)
4. Dynamic Rebalancing
The network self-heals and rebalances automatically:
graph TD
A[Monitor Loop] --> B{Check Distribution}
B -->|Balanced| C[Continue]
B -->|Imbalanced| D[Identify Dominant Families]
D --> E[Apply Stronger Fairness Penalty]
E --> F[Scheduler Selects Alternatives]
F --> G{Within Fair Share?}
G -->|No| H[Remove Excess Replicas]
G -->|Yes| I[Normal Operation]
H --> F
I --> A
C --> A
style D fill:#FFB84D
style E fill:#4A90E2
style I fill:#50C878
The Fair Share Formula
The concept of “fair share” is central to our system. It ensures that no family dominates the network:
total_replicas = total_files × 5 (since we create 5 replicas per file)
total_families = count of active miner families
fair_share = total_replicas / total_families
max_allowed = fair_share × 5 (matching 5 replicas per file)
For example, with 96,000 files and 4,000 families:
- Total replicas: 96,000 × 5 = 480,000
- Fair share per family: 480,000 / 4,000 = 120 replicas
- Maximum allowed: 120 × 5 = 600 replicas per family
Families exceeding this cap are hard excluded from receiving new assignments until they’re back within limits.
Ranking and Performance Integration
We integrate external ranking data to further refine selection:
graph LR
A[External Ranking API] --> B[Fetch Rankings]
B --> C[Cache for 1 Minute]
C --> D[Apply to CRUSH Scores]
D --> E{Is Top Performer?}
E -->|Yes, but overloaded| F[Apply Fairness Penalty]
E -->|Yes, balanced| G[Boost Selection Chance]
E -->|No| H[Standard Selection]
F --> I[Select Alternative]
G --> J[Assign File]
H --> J
style F fill:#FF6B6B
style G fill:#50C878
Rankings are refreshed frequently (every 1 minute) to ensure the system reacts quickly to changes in miner dominance.
Real-World Results
After implementing CRUSH with fairness controls:
Before CRUSH
Top 5 families: 54,000+ replicas (11% of total)
Small families: < 10 replicas (nearly invisible)
Imbalance ratio: ~5000:1
After CRUSH + Rebalancing
Top families: ~14,000 replicas (2.9% of total) - 73% reduction
Fair share cap: ~600 replicas per family
Imbalance ratio: ~23:1 (continuously improving)
Network growth: New miners can compete fairly
graph TD
A[Network State] --> B{Distribution}
B --> C[Top 5 Families: 2.9%]
B --> D[Mid-tier: 40%]
B --> E[Small miners: 57%]
C --> F[Within Fair Share]
D --> G[Healthy Growth]
E --> H[Bootstrap Allocation]
style F fill:#50C878
style G fill:#4A90E2
style H fill:#FFB84D
Automatic Rebalancing in Action
When we first deployed CRUSH, some families had already accumulated a disproportionate number of replicas from the previous system. CRUSH’s automatic rebalancing mechanisms kicked in immediately:
- Detection: The fairness penalty system identified families exceeding their fair share
- Hard Exclusion: Families above the 5x fair share cap were automatically excluded from new assignments
- Gradual Redistribution: As files needed replication or rebalancing, CRUSH naturally selected underutilized families
- Continuous Monitoring: The 1-minute ranking refresh ensures rapid response to any emerging imbalances
Over time, the system naturally redistributed 42,000+ replicas from dominant families to smaller ones, all without manual intervention. The CRUSH algorithm continuously maintains balance through proactive fairness penalties and intelligent selection.
Technical Challenges We Solved
Challenge 1: Database Deadlocks
When removing thousands of replicas, we encountered deadlocks due to concurrent trigger executions.
Solution: Batch deletions with smaller sizes (5,000 instead of 10,000) and introduced pauses between batches.
Challenge 2: Stale Ranking Data
Rankings cached for 5 minutes meant the system was slow to react to dominant families.
Solution: Reduced cache duration to 1 minute for faster reaction times.
Challenge 3: Bootstrap Paradox
New miners couldn’t get files because they had no performance history.
Solution: Implemented bootstrap allocation allowing up to 100 files without performance requirements.
Challenge 4: Inactive Miner Cleanup
Inactive miners were keeping replicas, inflating metrics.
Solution: Automatically delete replicas when miners go inactive, triggering immediate re-scheduling.
Why This Matters
Building a truly decentralized storage network isn’t just about storing files - it’s about creating a fair, sustainable ecosystem where:
- Small miners can compete with large operators
- Performance is rewarded but not at the expense of fairness
- The network self-heals automatically when miners join or leave
- Capacity grows organically based on proven reliability
CRUSH gives us all of this while maintaining the deterministic guarantees needed for a production storage system.
What’s Next?
We’re continuously improving the system:
Geo-diversity: Preferring miners in different geographic regions
Performance tiers: Different rules for hot vs. cold storage
Predictive scaling: Using ML to predict capacity needs
Privacy-aware placement: Keeping replicas in specific jurisdictions
Conclusion
CRUSH isn’t just an algorithm - it’s a philosophy of fairness through mathematics. By combining deterministic hashing with adaptive penalties and performance metrics, we’ve built a system that:
- Scales to thousands of nodes
- Remains fair under load
- Adapts to changing network conditions
- Requires no central coordination
If you’re building a decentralized system that needs fair resource distribution, CRUSH might be the answer you’re looking for.
Technical Specifications Summary
| Metric | Value |
|---|---|
| Total Files | 96,000+ |
| Miner Families | 4,000+ |
| Replicas per File | 5 |
| Fair Share Multiplier | 5x |
| Ranking Cache Duration | 60 seconds |
| Bootstrap File Limit | 100 files |
| Minimum Success Ratio | 60% (after bootstrap) |
| Rebalancing Frequency | Continuous (1-minute refresh) |
Built with
by the Hippius team