Prevalence Settings

Prevalence tracking helps identify rare and potentially suspicious artifacts in your environment by monitoring how frequently file hashes, domains, and IP addresses appear across different hosts. This page explains how to configure prevalence tracking settings to optimize detection accuracy and system performance.

Overview

Prevalence tracking works by:

Collecting artifacts from ingested logs (file hashes, domains, IP addresses)
Counting occurrences across unique hosts over time
Flagging rare artifacts that appear on fewer hosts than your configured threshold
Providing context for security analysts during investigations

Configuration Options

Rarity Threshold

The rarity threshold determines when an artifact is flagged as "rare" based on the number of unique hosts where it has been observed.

Setting: Host count threshold (1-1000)
Default: 3 hosts
Recommendation:

Small environments (< 100 hosts): 2-3 hosts
Medium environments (100-1000 hosts): 3-5 hosts
Large environments (> 1000 hosts): 5-10 hosts

Example: With a threshold of 3, a file hash seen on only 2 hosts will be flagged as rare, while one seen on 4+ hosts will be considered common.

Artifact Tracking Options

Control which types of artifacts are tracked for prevalence analysis.

Hash Tracking

Purpose: Track MD5 and SHA256 file hashes
Default: Enabled
Use cases:

Identify rare executables and scripts
Detect unique malware samples
Monitor file distribution patterns

When to disable: If you have high-volume file activity and want to reduce storage overhead.

Domain Tracking

Purpose: Track domains and subdomains from network logs
Default: Enabled
Use cases:

Identify suspicious or newly registered domains
Monitor DNS tunneling attempts
Track command and control infrastructure

When to disable: If you have extensive web browsing logs and want to focus on other artifact types.

IP Address Tracking

Purpose: Track source and destination IP addresses
Default: Enabled
Use cases:

Identify connections to rare external IPs
Monitor internal lateral movement patterns
Detect communication with suspicious infrastructure

When to disable: If you have high-volume network logs or want to focus on application-layer artifacts.

Data Retention

Controls how long prevalence data is stored in the system.

Setting: Retention period (1-365 days)
Default: 90 days
Considerations:

Longer retention: Better baseline for identifying truly rare artifacts
Shorter retention: Reduced storage requirements, faster queries
Recommended: 30-90 days for most environments

Storage impact: Each day of retention stores approximately:

Hash data: ~1MB per 10,000 unique hashes
Domain data: ~500KB per 10,000 unique domains
IP data: ~2MB per 10,000 unique IPs

Cache Settings

Controls how long prevalence query results are cached to improve performance.

Setting: Cache TTL (0-3600 seconds)
Default: 60 seconds
Options:

0 seconds: Disable caching (always query fresh data)
60 seconds: Good balance of performance and freshness
300+ seconds: Better performance for high-query environments

When to adjust:

Increase TTL: High query volume, acceptable data staleness
Decrease TTL: Need real-time prevalence data, low query volume
Disable caching: Critical real-time analysis requirements

Performance Considerations

Query Performance

Prevalence queries can impact system performance, especially with large datasets:

Optimization strategies:

Enable selective tracking: Disable artifact types you don't need
Adjust retention: Shorter retention = faster queries
Use appropriate caching: Balance freshness vs. performance
Monitor query patterns: Identify and optimize frequent queries

Storage Requirements

Prevalence data storage grows with:

Number of unique artifacts
Retention period
Tracking scope (which artifact types are enabled)

Estimation formula:

Daily storage ≈ (unique_hashes × 100 bytes) + (unique_domains × 50 bytes) + (unique_ips × 200 bytes)
Total storage ≈ Daily storage × retention_days

Memory Usage

The prevalence cache uses system memory:

Cache size: Proportional to query frequency and TTL
Memory per entry: ~500 bytes per cached result
Typical usage: 10-100MB for most environments

Best Practices

Initial Configuration

Start with defaults: Use default settings initially
Monitor performance: Watch query times and storage growth
Adjust gradually: Make incremental changes based on observations
Test thresholds: Validate rarity detection accuracy

Ongoing Management

Regular review: Check settings monthly or quarterly
Performance monitoring: Track query response times
Storage cleanup: Monitor disk usage trends
Threshold tuning: Adjust based on false positive rates

Integration with Detections

Prevalence data is most effective when integrated with detection rules:

# Example detection rule using prevalence
name: "Rare Executable Execution"
query: |
  SELECT * FROM events 
  WHERE event_type = 'process_creation'
  AND file_hash_prevalence < 3
condition: "count > 0"

Troubleshooting

Common issues and solutions:

Issue	Symptoms	Solution
Slow queries	High response times	Reduce retention period or enable caching
High storage usage	Disk space alerts	Lower retention or disable unused tracking
Too many rare alerts	High false positive rate	Increase rarity threshold
Missing prevalence data	No rare artifacts found	Check artifact extraction and tracking settings

API Configuration

You can also configure prevalence settings programmatically:

# Get current settings
curl -X GET "http://localhost:8080/api/settings/prevalence"

# Update rarity threshold
curl -X PUT "http://localhost:8080/api/settings/prevalence" \
  -H "Content-Type: application/json" \
  -d '{"rarity_threshold": 5}'

# Update multiple settings
curl -X PUT "http://localhost:8080/api/settings/prevalence" \
  -H "Content-Type: application/json" \
  -d '{
    "rarity_threshold": 5,
    "enable_hash_tracking": true,
    "enable_domain_tracking": true,
    "enable_ip_tracking": false,
    "retention_days": 60,
    "cache_ttl_seconds": 120
  }'

Prevalence User Guide - How to use prevalence data in investigations
Detection Rules - Integrating prevalence into detection logic
Risk-Based Alerting - Using prevalence for risk scoring

Prevalence Settings

On this page