Prevalence Settings
Prevalence Settings
Prevalence tracking helps identify rare and potentially suspicious artifacts in your environment by monitoring how frequently file hashes, domains, and IP addresses appear across different hosts. This page explains how to configure prevalence tracking settings to optimize detection accuracy and system performance.
Overview
Prevalence tracking works by:
- Collecting artifacts from ingested logs (file hashes, domains, IP addresses)
- Counting occurrences across unique hosts over time
- Flagging rare artifacts that appear on fewer hosts than your configured threshold
- Providing context for security analysts during investigations
Configuration Options
Rarity Threshold
The rarity threshold determines when an artifact is flagged as "rare" based on the number of unique hosts where it has been observed.
Setting: Host count threshold (1-1000)
Default: 3 hosts
Recommendation:
- Small environments (< 100 hosts): 2-3 hosts
- Medium environments (100-1000 hosts): 3-5 hosts
- Large environments (> 1000 hosts): 5-10 hosts
Example: With a threshold of 3, a file hash seen on only 2 hosts will be flagged as rare, while one seen on 4+ hosts will be considered common.
Artifact Tracking Options
Control which types of artifacts are tracked for prevalence analysis.
Hash Tracking
Purpose: Track MD5 and SHA256 file hashes
Default: Enabled
Use cases:
- Identify rare executables and scripts
- Detect unique malware samples
- Monitor file distribution patterns
When to disable: If you have high-volume file activity and want to reduce storage overhead.
Domain Tracking
Purpose: Track domains and subdomains from network logs
Default: Enabled
Use cases:
- Identify suspicious or newly registered domains
- Monitor DNS tunneling attempts
- Track command and control infrastructure
When to disable: If you have extensive web browsing logs and want to focus on other artifact types.
IP Address Tracking
Purpose: Track source and destination IP addresses
Default: Enabled
Use cases:
- Identify connections to rare external IPs
- Monitor internal lateral movement patterns
- Detect communication with suspicious infrastructure
When to disable: If you have high-volume network logs or want to focus on application-layer artifacts.
Data Retention
Controls how long prevalence data is stored in the system.
Setting: Retention period (1-365 days)
Default: 90 days
Considerations:
- Longer retention: Better baseline for identifying truly rare artifacts
- Shorter retention: Reduced storage requirements, faster queries
- Recommended: 30-90 days for most environments
Storage impact: Each day of retention stores approximately:
- Hash data: ~1MB per 10,000 unique hashes
- Domain data: ~500KB per 10,000 unique domains
- IP data: ~2MB per 10,000 unique IPs
Cache Settings
Controls how long prevalence query results are cached to improve performance.
Setting: Cache TTL (0-3600 seconds)
Default: 60 seconds
Options:
- 0 seconds: Disable caching (always query fresh data)
- 60 seconds: Good balance of performance and freshness
- 300+ seconds: Better performance for high-query environments
When to adjust:
- Increase TTL: High query volume, acceptable data staleness
- Decrease TTL: Need real-time prevalence data, low query volume
- Disable caching: Critical real-time analysis requirements
Performance Considerations
Query Performance
Prevalence queries can impact system performance, especially with large datasets:
Optimization strategies:
- Enable selective tracking: Disable artifact types you don't need
- Adjust retention: Shorter retention = faster queries
- Use appropriate caching: Balance freshness vs. performance
- Monitor query patterns: Identify and optimize frequent queries
Storage Requirements
Prevalence data storage grows with:
- Number of unique artifacts
- Retention period
- Tracking scope (which artifact types are enabled)
Estimation formula:
Daily storage ≈ (unique_hashes × 100 bytes) + (unique_domains × 50 bytes) + (unique_ips × 200 bytes)
Total storage ≈ Daily storage × retention_daysMemory Usage
The prevalence cache uses system memory:
- Cache size: Proportional to query frequency and TTL
- Memory per entry: ~500 bytes per cached result
- Typical usage: 10-100MB for most environments
Best Practices
Initial Configuration
- Start with defaults: Use default settings initially
- Monitor performance: Watch query times and storage growth
- Adjust gradually: Make incremental changes based on observations
- Test thresholds: Validate rarity detection accuracy
Ongoing Management
- Regular review: Check settings monthly or quarterly
- Performance monitoring: Track query response times
- Storage cleanup: Monitor disk usage trends
- Threshold tuning: Adjust based on false positive rates
Integration with Detections
Prevalence data is most effective when integrated with detection rules:
# Example detection rule using prevalence
name: "Rare Executable Execution"
query: |
SELECT * FROM events
WHERE event_type = 'process_creation'
AND file_hash_prevalence < 3
condition: "count > 0"Troubleshooting
Common issues and solutions:
| Issue | Symptoms | Solution |
|---|---|---|
| Slow queries | High response times | Reduce retention period or enable caching |
| High storage usage | Disk space alerts | Lower retention or disable unused tracking |
| Too many rare alerts | High false positive rate | Increase rarity threshold |
| Missing prevalence data | No rare artifacts found | Check artifact extraction and tracking settings |
API Configuration
You can also configure prevalence settings programmatically:
# Get current settings
curl -X GET "http://localhost:8080/api/settings/prevalence"
# Update rarity threshold
curl -X PUT "http://localhost:8080/api/settings/prevalence" \
-H "Content-Type: application/json" \
-d '{"rarity_threshold": 5}'
# Update multiple settings
curl -X PUT "http://localhost:8080/api/settings/prevalence" \
-H "Content-Type: application/json" \
-d '{
"rarity_threshold": 5,
"enable_hash_tracking": true,
"enable_domain_tracking": true,
"enable_ip_tracking": false,
"retention_days": 60,
"cache_ttl_seconds": 120
}'Related Documentation
- Prevalence User Guide - How to use prevalence data in investigations
- Detection Rules - Integrating prevalence into detection logic
- Risk-Based Alerting - Using prevalence for risk scoring