Prevalence Tracking
Prevalence Tracking
Prevalence tracking is a powerful security analysis technique that identifies rare or unusual artifacts in your environment. By tracking how frequently file hashes, domains, and IP addresses appear across your infrastructure, you can quickly spot potential threats that stand out from normal activity patterns.
Two Ways to Use Prevalence
nano provides one unified prevalence system with two complementary access patterns. Both return the same metric: host count (number of unique hosts that have seen an artifact).
Quick Comparison
| Capability | Columns (prevalence_*) | Command (|prevalence) |
|---|---|---|
| Filter by host count | ✅ WHERE prevalence_min < 5 | ✅ | prevalence hash_prevalence < 5 |
| Filter by first_seen | ❌ | ✅ | prevalence domain_first_seen > now() - 24h |
| Enrich with metadata | ❌ | ✅ | prevalence enrich=true |
| Variable time window | ❌ (fixed 30 days) | ✅ window=1h|24h|7d|30d |
| Zero query cost | ✅ Pre-computed at ingest | ❌ On-demand query |
| Use anywhere in query | ✅ WHERE, stats, eval, etc. | ❌ Pipeline stage only |
When to Use Each
Use columns (prevalence_*) for fast, simple filtering:
# Zero cost - pre-computed at ingest
prevalence_min < 5 | stats count by dest_host
# Works in any WHERE clause
* | where prevalence_dest_domain < 10 AND prevalence_file_hash < 5Use the command (|prevalence) for rich analysis:
# Filter by first_seen (columns can't do this)
* | prevalence domain_first_seen > now() - 24h
# Enrich results with full prevalence metadata
* | prevalence enrich=true | table file_hash, hash_prevalence, hash_first_seen
# Use a different time window
* | prevalence hash_prevalence < 5 window=7dCombine both for powerful hunting:
# Fast pre-filter with columns, then temporal analysis with command
prevalence_min < 100 | prevalence domain_first_seen > now() - 24h enrich=trueThis finds events with somewhat rare artifacts, then filters to domains first seen in the last 24 hours and shows full prevalence details.
Overview
nano's prevalence tracking system automatically monitors and analyzes:
- File Hashes (MD5, SHA1, SHA256) - Identify rare executables and files
- Domains - Detect connections to uncommon or suspicious domains
- IP Addresses - Spot communications with rare external IPs
The system uses ClickHouse materialized views to provide real-time prevalence calculations with minimal performance impact on log ingestion.
How It Works
Data Collection
Prevalence data is automatically extracted from your log streams using materialized views that trigger on every log entry:
- Hash Extraction: File hashes are captured from
file_hashandprocess_hashfields - Domain Extraction: Domains are captured from
dest_host,query(DNS), andurl_domainfields - IP Extraction: IP addresses are captured from
dest_ipandsrc_ipfields - Host Counting: The system tracks unique source hosts that have seen each artifact
Aggregation Process
Data is aggregated into hourly time buckets with the following metrics:
- Host Count: Number of unique hosts that observed the artifact (using HyperLogLog for efficiency)
- Total Occurrences: Total number of times the artifact was seen
- First Seen: Timestamp when the artifact was first observed
- Last Seen: Timestamp when the artifact was most recently observed
Rarity Classification
Artifacts are classified as "rare" based on host count and a configurable threshold:
- Rare: Below the rarity threshold (default: 5 hosts)
- Common: At or above the threshold
- N/A: Field is empty or not applicable (indicated by value 65535)
Artifact Types
File Hashes
The system automatically detects hash types based on length and validates hex format:
- MD5: 32 hexadecimal characters
- SHA1: 40 hexadecimal characters
- SHA256: 64 hexadecimal characters
Hash prevalence is particularly useful for:
- Identifying rare malware samples
- Detecting lateral movement with uncommon tools
- Spotting process execution anomalies
Domains
Domain validation includes comprehensive checks:
- Must contain at least one dot (domain.tld format)
- Cannot be an IP address
- Must have valid domain characters (alphanumeric, hyphens, dots)
- TLD must be at least 2 characters and non-numeric
- Maximum length of 253 characters
Domain prevalence helps identify:
- Command and control (C2) communications
- Phishing and malicious websites
- DNS tunneling attempts
- Newly registered domains (NRDs)
IP Addresses
IP addresses are categorized as:
- Public IPs: Internet-routable addresses
- Private IPs: RFC1918 internal addresses (10.x.x.x, 172.16-31.x.x, 192.168.x.x)
IP prevalence is valuable for:
- Detecting connections to rare external services
- Identifying potential data exfiltration targets
- Spotting scanning and reconnaissance activity
Storage Architecture
ClickHouse Tables
Prevalence data is stored in AggregatingMergeTree tables optimized for real-time aggregation:
-- Hash prevalence aggregation
CREATE TABLE hash_prevalence_agg (
file_hash String,
hash_type LowCardinality(String),
time_bucket DateTime,
host_count AggregateFunction(uniq, String),
first_seen SimpleAggregateFunction(min, DateTime64(6)),
last_seen SimpleAggregateFunction(max, DateTime64(6)),
total_count SimpleAggregateFunction(sum, UInt64)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(time_bucket)
ORDER BY (file_hash, time_bucket);Materialized Views
Real-time aggregation is handled by materialized views that process incoming logs:
-- Automatically aggregate hash prevalence
CREATE MATERIALIZED VIEW hash_prevalence_mv TO hash_prevalence_agg AS
SELECT
lower(file_hash) AS file_hash,
multiIf(
length(file_hash) = 32, 'md5',
length(file_hash) = 40, 'sha1',
length(file_hash) = 64, 'sha256',
'unknown'
) AS hash_type,
toStartOfHour(timestamp) AS time_bucket,
uniqState(coalesce(src_host, src_ip, 'unknown')) AS host_count,
min(timestamp) AS first_seen,
max(timestamp) AS last_seen,
count() AS total_count
FROM logs
WHERE file_hash != '' AND match(file_hash, '^[a-fA-F0-9]+$')
GROUP BY file_hash, hash_type, time_bucket;Data Retention
- Default Retention: 90 days (configurable)
- Partitioning: Monthly partitions for efficient data management
- TTL: Automatic cleanup of old data
- Compression: ClickHouse compression reduces storage requirements
Querying Prevalence
Single Artifact Lookup
Query prevalence for a specific artifact:
# Hash prevalence
curl -X GET "/api/prevalence/hash/d41d8cd98f00b204e9800998ecf8427e?window=24h"
# Domain prevalence
curl -X GET "/api/prevalence/domain/suspicious.example.com?window=7d"Bulk Lookups
Query multiple artifacts efficiently (up to 100 per request):
curl -X POST "/api/prevalence/bulk" \
-H "Content-Type: application/json" \
-d '{
"artifacts": [
"d41d8cd98f00b204e9800998ecf8427e",
"suspicious.example.com",
"192.168.1.100"
],
"window": "24h"
}'Time Windows
Supported time windows for prevalence calculations:
1h- Last 1 hour24h- Last 24 hours7d- Last 7 days30d- Last 30 days (default - aligns with ingest-time columns)
Response Format
Prevalence responses include comprehensive metadata:
{
"data": {
"artifact": "suspicious.example.com",
"artifact_type": "domain",
"host_count": 2,
"total_occurrences": 15,
"first_seen": "2024-01-15T10:30:00Z",
"last_seen": "2024-01-15T14:22:00Z",
"is_rare": true
}
}Ingest-Time Prevalence Enrichment
nano enriches logs with prevalence scores at ingest time using ClickHouse dictionaries. This captures "prevalence at time of event" rather than current prevalence, enabling accurate historical threat hunting.
How It Works
When logs are ingested, ClickHouse automatically looks up prevalence scores from in-memory dictionaries and stores them directly in the logs table:
Logs Insert → dictGetOrDefault() in DEFAULT columns → Prevalence scores frozen with event
↓
Dictionary (in-memory, auto-refresh every 5-10 min)
↓
Source: *_prevalence_agg tables (filtered to rare artifacts only)Prevalence Fields
The following fields are automatically populated at ingest time:
| Field | Type | Description |
|---|---|---|
prevalence_file_hash | UInt16 | Host count for file_hash |
prevalence_process_hash | UInt16 | Host count for process_hash |
prevalence_dest_domain | UInt16 | Host count for dest_host (domains only, not IPs) |
prevalence_dest_ip | UInt16 | Host count for dest_ip (public IPs only) |
prevalence_min | UInt16 | Minimum host count across all fields (for filtering) |
Value interpretation:
- 1-9998: Actual host count (lower = rarer)
- 9999: Common (artifact seen on 1000+ hosts, not tracked individually)
- 65535: N/A (field is empty or not applicable)
Querying Prevalence Fields
These fields can be used directly in search queries:
# Find events with rare process hashes (seen on fewer than 5 hosts)
sourcetype = "sysmon" AND prevalence_process_hash < 5
# Find events with any rare artifact
prevalence_min < 10
# Combine with other filters
dest_ip != "" AND prevalence_dest_ip < 20 | table timestamp, dest_ip, prevalence_dest_ipPrevalence Slider
The search interface includes a Prevalence Filter slider that filters results by prevalence_min:
- Location: Next to the search button in the search bar
- Range: 1-100+ (left = very rare, right = show all)
- Presets: Very Rare (fewer than 5), Rare (fewer than 10), Uncommon (fewer than 50), All
When the slider is active:
- The query automatically appends
| where prevalence_min < X - Results are filtered to only show events with rare artifacts
- The slider value is persisted in the URL for shareable links
Dictionary Architecture
Prevalence dictionaries are stored in ClickHouse memory for fast lookups:
-- Example: Hash prevalence dictionary
CREATE DICTIONARY nanosiem.hash_prevalence_dict
(
file_hash String,
host_count UInt16 DEFAULT 9999
)
PRIMARY KEY file_hash
SOURCE(CLICKHOUSE(
QUERY 'SELECT file_hash, toUInt16(least(9998, host_count)) as host_count
FROM (
SELECT lower(file_hash) as file_hash, uniqMerge(host_count) as host_count
FROM nanosiem.hash_prevalence_agg
WHERE time_bucket >= now() - INTERVAL 30 DAY
GROUP BY file_hash
) WHERE host_count < 1000'
))
LIFETIME(MIN 300 MAX 600) -- Refresh every 5-10 minutes
LAYOUT(SPARSE_HASHED());Key design decisions:
- Only artifacts seen by fewer than 1000 hosts are tracked (others return default 9999 = "common")
- 30-day lookback window matches the
|prevalencecommand default - Dictionary refreshes every 5-10 minutes for near-real-time data
Memory usage: ~100-200MB total for all dictionaries (only rare artifacts are loaded)
Enrichment Integration
Search Results
Prevalence data is automatically displayed in search results:
- Prevalence Fields: Shown in emerald/green color for easy identification
- Score Display: Values 0-100 shown, 255 (N/A) hidden automatically
- Tooltips: Hover for field description
Detection Rules
Use prevalence fields in detection logic:
# Example: Detect rare process execution (using pre-computed columns)
name: "Rare Process Execution"
query: |
sourcetype = "sysmon" AND event_id = 1
| where prevalence_process_hash < 5Or use the prevalence command for temporal filtering:
# Example: Detect newly seen rare domains
name: "New Rare Domain Connection"
query: |
dest_host != ""
| prevalence domain_prevalence < 10 domain_first_seen > now() - 24hAlert Context
Prevalence information is included in alert context to help analysts assess threat severity.
Visualization
Prevalence Dashboard
The dedicated prevalence dashboard provides:
- Rare Artifacts Table: Real-time list of artifacts below the rarity threshold
- New Artifacts Table: Recently observed artifacts (first seen in last 24h)
- Statistics Cards: Summary metrics and counts
- Export Functionality: CSV/JSON export of prevalence data
Scatter Plot Analysis
Interactive scatter plots show prevalence patterns:
- X-axis: First seen timestamp
- Y-axis: Host count (prevalence)
- Color Coding: Rarity classification
- Filtering: By artifact type and time window
- Selection: Click artifacts to filter search results
Search Integration
Prevalence panels in search results show:
- Artifact Extraction: Automatic identification of hashes, domains, and IPs
- Prevalence Overlay: Visual indicators on search results
- Drill-down: Click prevalence badges to filter by artifact
Configuration
Settings Management
Prevalence tracking is configurable through the settings API:
# Get current settings
curl -X GET "/api/settings/prevalence"
# Update rarity threshold
curl -X PUT "/api/settings/prevalence" \
-H "Content-Type: application/json" \
-d '{"rarity_threshold": 5}'Configuration Options
| Setting | Description | Default | Range |
|---|---|---|---|
rarity_threshold | Host count below which artifacts are "rare" | 3 | 1-1000 |
enable_hash_tracking | Enable file hash prevalence tracking | true | boolean |
enable_domain_tracking | Enable domain prevalence tracking | true | boolean |
enable_ip_tracking | Enable IP address prevalence tracking | true | boolean |
retention_days | Days to retain prevalence data | 90 | 1-365 |
cache_ttl_seconds | Cache TTL for prevalence queries | 60 | 0-3600 |
Hot Reload
Configuration changes are applied immediately without service restart. The system automatically:
- Updates rarity thresholds for new queries
- Enables/disables tracking for new log entries
- Adjusts cache TTL for subsequent requests
- Applies retention policies on next cleanup cycle
Performance Considerations
Ingestion Impact
Prevalence tracking adds minimal overhead to log ingestion:
- Materialized Views: Process data during insert, not query time
- Incremental Updates: Only new data triggers aggregation
- Efficient Storage: AggregatingMergeTree optimizes for aggregation queries
Query Performance
Prevalence queries are optimized for speed:
- Bloom Filters: Fast artifact lookups using bloom filter indexes
- Partitioning: Monthly partitions reduce scan scope
- Caching: In-memory LRU cache with configurable TTL
- Bulk Operations: Single queries for multiple artifacts
Memory Usage
The system uses memory-efficient data structures:
- HyperLogLog: Approximate unique counting with ±2% error
- Compression: ClickHouse compression reduces memory footprint
- Cache Limits: Configurable cache size prevents memory exhaustion
Use Cases
Threat Hunting
Prevalence data enables effective threat hunting:
- Rare Hash Analysis: Identify uncommon executables that may be malware
- Domain Reputation: Spot connections to newly registered or rare domains
- Lateral Movement: Detect rare tools used for privilege escalation
- Data Exfiltration: Identify connections to unusual external IPs
Incident Response
During incident response, prevalence helps:
- Scope Assessment: Determine how widespread an artifact is
- Timeline Analysis: Understand when artifacts first appeared
- Impact Evaluation: Count affected hosts and occurrences
- IOC Validation: Verify if indicators are truly malicious or common
Security Operations
For daily security operations:
- Alert Triage: Prioritize alerts involving rare artifacts
- False Positive Reduction: Filter out common, benign artifacts
- Baseline Establishment: Understand normal prevalence patterns
- Anomaly Detection: Spot deviations from expected prevalence
Best Practices
Threshold Tuning
- Start with default threshold (3 hosts) and adjust based on environment size
- Larger environments may need higher thresholds to reduce noise
- Monitor rare artifact counts and adjust to maintain manageable volumes
Data Quality
- Ensure consistent field mapping across log sources
- Validate hash formats and domain names in parsing rules
- Use normalized field names (UDM) for consistent prevalence tracking
Performance Optimization
- Monitor ClickHouse resource usage during high ingestion periods
- Adjust cache TTL based on query patterns and data freshness needs
- Use bulk APIs for multiple artifact lookups to reduce query overhead
Integration Patterns
- Include prevalence checks in automated analysis workflows
- Export prevalence data for integration with external threat intelligence
- Use prevalence scores in risk-based alerting and scoring systems
Troubleshooting
Common Issues
No prevalence data appearing:
- Verify ClickHouse is running and accessible
- Check materialized view creation and data flow
- Ensure log fields contain valid hashes/domains/IPs
High memory usage:
- Reduce cache TTL or size limits
- Check for runaway queries or bulk operations
- Monitor ClickHouse memory usage and tune accordingly
Slow prevalence queries:
- Verify bloom filter indexes are created
- Check time window scope (shorter windows are faster)
- Monitor ClickHouse query performance metrics
Monitoring
Key metrics to monitor:
- Ingestion Rate: Logs processed per second by materialized views
- Cache Hit Rate: Percentage of queries served from cache
- Query Latency: Response time for prevalence API calls
- Storage Growth: Disk usage of prevalence aggregation tables
The prevalence tracking system provides powerful capabilities for identifying rare and suspicious artifacts in your environment. By understanding normal prevalence patterns, security teams can quickly spot anomalies that may indicate threats or malicious activity.