Prevalence Tracking

Prevalence tracking is a powerful security analysis technique that identifies rare or unusual artifacts in your environment. By tracking how frequently file hashes, domains, and IP addresses appear across your infrastructure, you can quickly spot potential threats that stand out from normal activity patterns.

Two Ways to Use Prevalence

nano provides one unified prevalence system with two complementary access patterns. Both return the same metric: host count (number of unique hosts that have seen an artifact).

Quick Comparison

Capability	Columns (`prevalence_*`)	Command (`\|prevalence`)
Filter by host count	✅ `WHERE prevalence_min < 5`	✅ `\| prevalence hash_prevalence < 5`
Filter by first_seen	❌	✅ `\| prevalence domain_first_seen > now() - 24h`
Enrich with metadata	❌	✅ `\| prevalence enrich=true`
Variable time window	❌ (fixed 30 days)	✅ `window=1h\|24h\|7d\|30d`
Zero query cost	✅ Pre-computed at ingest	❌ On-demand query
Use anywhere in query	✅ WHERE, stats, eval, etc.	❌ Pipeline stage only

When to Use Each

Use columns (prevalence_*) for fast, simple filtering:

# Zero cost - pre-computed at ingest
prevalence_min < 5 | stats count by dest_host

# Works in any WHERE clause
* | where prevalence_dest_domain < 10 AND prevalence_file_hash < 5

Use the command (|prevalence) for rich analysis:

# Filter by first_seen (columns can't do this)
* | prevalence domain_first_seen > now() - 24h

# Enrich results with full prevalence metadata
* | prevalence enrich=true | table file_hash, hash_prevalence, hash_first_seen

# Use a different time window
* | prevalence hash_prevalence < 5 window=7d

Combine both for powerful hunting:

# Fast pre-filter with columns, then temporal analysis with command
prevalence_min < 100 | prevalence domain_first_seen > now() - 24h enrich=true

This finds events with somewhat rare artifacts, then filters to domains first seen in the last 24 hours and shows full prevalence details.

Overview

nano's prevalence tracking system automatically monitors and analyzes:

File Hashes (MD5, SHA1, SHA256) - Identify rare executables and files
Domains - Detect connections to uncommon or suspicious domains
IP Addresses - Spot communications with rare external IPs

The system uses ClickHouse materialized views to provide real-time prevalence calculations with minimal performance impact on log ingestion.

How It Works

Data Collection

Prevalence data is automatically extracted from your log streams using materialized views that trigger on every log entry:

Hash Extraction: File hashes are captured from file_hash and process_hash fields
Domain Extraction: Domains are captured from dest_host, query (DNS), and url_domain fields
IP Extraction: IP addresses are captured from dest_ip and src_ip fields
Host Counting: The system tracks unique source hosts that have seen each artifact

Aggregation Process

Data is aggregated into hourly time buckets with the following metrics:

Host Count: Number of unique hosts that observed the artifact (using HyperLogLog for efficiency)
Total Occurrences: Total number of times the artifact was seen
First Seen: Timestamp when the artifact was first observed
Last Seen: Timestamp when the artifact was most recently observed

Rarity Classification

Artifacts are classified as "rare" based on host count and a configurable threshold:

Rare: Below the rarity threshold (default: 5 hosts)
Common: At or above the threshold
N/A: Field is empty or not applicable (indicated by value 65535)

Artifact Types

File Hashes

The system automatically detects hash types based on length and validates hex format:

MD5: 32 hexadecimal characters
SHA1: 40 hexadecimal characters
SHA256: 64 hexadecimal characters

Hash prevalence is particularly useful for:

Identifying rare malware samples
Detecting lateral movement with uncommon tools
Spotting process execution anomalies

Domains

Domain validation includes comprehensive checks:

Must contain at least one dot (domain.tld format)
Cannot be an IP address
Must have valid domain characters (alphanumeric, hyphens, dots)
TLD must be at least 2 characters and non-numeric
Maximum length of 253 characters

Domain prevalence helps identify:

Command and control (C2) communications
Phishing and malicious websites
DNS tunneling attempts
Newly registered domains (NRDs)

IP Addresses

IP addresses are categorized as:

Public IPs: Internet-routable addresses
Private IPs: RFC1918 internal addresses (10.x.x.x, 172.16-31.x.x, 192.168.x.x)

IP prevalence is valuable for:

Detecting connections to rare external services
Identifying potential data exfiltration targets
Spotting scanning and reconnaissance activity

Storage Architecture

ClickHouse Tables

Prevalence data is stored in AggregatingMergeTree tables optimized for real-time aggregation:

-- Hash prevalence aggregation
CREATE TABLE hash_prevalence_agg (
    file_hash String,
    hash_type LowCardinality(String),
    time_bucket DateTime,
    host_count AggregateFunction(uniq, String),
    first_seen SimpleAggregateFunction(min, DateTime64(6)),
    last_seen SimpleAggregateFunction(max, DateTime64(6)),
    total_count SimpleAggregateFunction(sum, UInt64)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(time_bucket)
ORDER BY (file_hash, time_bucket);

Materialized Views

Real-time aggregation is handled by materialized views that process incoming logs:

-- Automatically aggregate hash prevalence
CREATE MATERIALIZED VIEW hash_prevalence_mv TO hash_prevalence_agg AS
SELECT
    lower(file_hash) AS file_hash,
    multiIf(
        length(file_hash) = 32, 'md5',
        length(file_hash) = 40, 'sha1', 
        length(file_hash) = 64, 'sha256',
        'unknown'
    ) AS hash_type,
    toStartOfHour(timestamp) AS time_bucket,
    uniqState(coalesce(src_host, src_ip, 'unknown')) AS host_count,
    min(timestamp) AS first_seen,
    max(timestamp) AS last_seen,
    count() AS total_count
FROM logs
WHERE file_hash != '' AND match(file_hash, '^[a-fA-F0-9]+$')
GROUP BY file_hash, hash_type, time_bucket;

Data Retention

Default Retention: 90 days (configurable)
Partitioning: Monthly partitions for efficient data management
TTL: Automatic cleanup of old data
Compression: ClickHouse compression reduces storage requirements

Querying Prevalence

Single Artifact Lookup

Query prevalence for a specific artifact:

# Hash prevalence
curl -X GET "/api/prevalence/hash/d41d8cd98f00b204e9800998ecf8427e?window=24h"

# Domain prevalence  
curl -X GET "/api/prevalence/domain/suspicious.example.com?window=7d"

Bulk Lookups

Query multiple artifacts efficiently (up to 100 per request):

curl -X POST "/api/prevalence/bulk" \
  -H "Content-Type: application/json" \
  -d '{
    "artifacts": [
      "d41d8cd98f00b204e9800998ecf8427e",
      "suspicious.example.com",
      "192.168.1.100"
    ],
    "window": "24h"
  }'

Time Windows

Supported time windows for prevalence calculations:

1h - Last 1 hour
24h - Last 24 hours
7d - Last 7 days
30d - Last 30 days (default - aligns with ingest-time columns)

Response Format

Prevalence responses include comprehensive metadata:

{
  "data": {
    "artifact": "suspicious.example.com",
    "artifact_type": "domain",
    "host_count": 2,
    "total_occurrences": 15,
    "first_seen": "2024-01-15T10:30:00Z",
    "last_seen": "2024-01-15T14:22:00Z",
    "is_rare": true
  }
}

Ingest-Time Prevalence Enrichment

nano enriches logs with prevalence scores at ingest time using ClickHouse dictionaries. This captures "prevalence at time of event" rather than current prevalence, enabling accurate historical threat hunting.

How It Works

When logs are ingested, ClickHouse automatically looks up prevalence scores from in-memory dictionaries and stores them directly in the logs table:

Logs Insert → dictGetOrDefault() in DEFAULT columns → Prevalence scores frozen with event
                    ↓
         Dictionary (in-memory, auto-refresh every 5-10 min)
                    ↓
         Source: *_prevalence_agg tables (filtered to rare artifacts only)

Prevalence Fields

The following fields are automatically populated at ingest time:

Field	Type	Description
`prevalence_file_hash`	UInt16	Host count for `file_hash`
`prevalence_process_hash`	UInt16	Host count for `process_hash`
`prevalence_dest_domain`	UInt16	Host count for `dest_host` (domains only, not IPs)
`prevalence_dest_ip`	UInt16	Host count for `dest_ip` (public IPs only)
`prevalence_min`	UInt16	Minimum host count across all fields (for filtering)

Value interpretation:

1-9998: Actual host count (lower = rarer)
9999: Common (artifact seen on 1000+ hosts, not tracked individually)
65535: N/A (field is empty or not applicable)

Querying Prevalence Fields

These fields can be used directly in search queries:

# Find events with rare process hashes (seen on fewer than 5 hosts)
sourcetype = "sysmon" AND prevalence_process_hash < 5

# Find events with any rare artifact
prevalence_min < 10

# Combine with other filters
dest_ip != "" AND prevalence_dest_ip < 20 | table timestamp, dest_ip, prevalence_dest_ip

Prevalence Slider

The search interface includes a Prevalence Filter slider that filters results by prevalence_min:

Location: Next to the search button in the search bar
Range: 1-100+ (left = very rare, right = show all)
Presets: Very Rare (fewer than 5), Rare (fewer than 10), Uncommon (fewer than 50), All

When the slider is active:

The query automatically appends | where prevalence_min < X
Results are filtered to only show events with rare artifacts
The slider value is persisted in the URL for shareable links

Dictionary Architecture

Prevalence dictionaries are stored in ClickHouse memory for fast lookups:

-- Example: Hash prevalence dictionary
CREATE DICTIONARY nanosiem.hash_prevalence_dict
(
    file_hash String,
    host_count UInt16 DEFAULT 9999
)
PRIMARY KEY file_hash
SOURCE(CLICKHOUSE(
    QUERY 'SELECT file_hash, toUInt16(least(9998, host_count)) as host_count
           FROM (
               SELECT lower(file_hash) as file_hash, uniqMerge(host_count) as host_count
               FROM nanosiem.hash_prevalence_agg
               WHERE time_bucket >= now() - INTERVAL 30 DAY
               GROUP BY file_hash
           ) WHERE host_count < 1000'
))
LIFETIME(MIN 300 MAX 600)  -- Refresh every 5-10 minutes
LAYOUT(SPARSE_HASHED());

Key design decisions:

Only artifacts seen by fewer than 1000 hosts are tracked (others return default 9999 = "common")
30-day lookback window matches the |prevalence command default
Dictionary refreshes every 5-10 minutes for near-real-time data

Memory usage: ~100-200MB total for all dictionaries (only rare artifacts are loaded)

Enrichment Integration

Search Results

Prevalence data is automatically displayed in search results:

Prevalence Fields: Shown in emerald/green color for easy identification
Score Display: Values 0-100 shown, 255 (N/A) hidden automatically
Tooltips: Hover for field description

Detection Rules

Use prevalence fields in detection logic:

# Example: Detect rare process execution (using pre-computed columns)
name: "Rare Process Execution"
query: |
  sourcetype = "sysmon" AND event_id = 1
  | where prevalence_process_hash < 5

Or use the prevalence command for temporal filtering:

# Example: Detect newly seen rare domains
name: "New Rare Domain Connection"
query: |
  dest_host != ""
  | prevalence domain_prevalence < 10 domain_first_seen > now() - 24h

Alert Context

Prevalence information is included in alert context to help analysts assess threat severity.

Visualization

Prevalence Dashboard

The dedicated prevalence dashboard provides:

Rare Artifacts Table: Real-time list of artifacts below the rarity threshold
New Artifacts Table: Recently observed artifacts (first seen in last 24h)
Statistics Cards: Summary metrics and counts
Export Functionality: CSV/JSON export of prevalence data

Scatter Plot Analysis

Interactive scatter plots show prevalence patterns:

X-axis: First seen timestamp
Y-axis: Host count (prevalence)
Color Coding: Rarity classification
Filtering: By artifact type and time window
Selection: Click artifacts to filter search results

Search Integration

Prevalence panels in search results show:

Artifact Extraction: Automatic identification of hashes, domains, and IPs
Prevalence Overlay: Visual indicators on search results
Drill-down: Click prevalence badges to filter by artifact

Configuration

Settings Management

Prevalence tracking is configurable through the settings API:

# Get current settings
curl -X GET "/api/settings/prevalence"

# Update rarity threshold
curl -X PUT "/api/settings/prevalence" \
  -H "Content-Type: application/json" \
  -d '{"rarity_threshold": 5}'

Configuration Options

Setting	Description	Default	Range
`rarity_threshold`	Host count below which artifacts are "rare"	3	1-1000
`enable_hash_tracking`	Enable file hash prevalence tracking	true	boolean
`enable_domain_tracking`	Enable domain prevalence tracking	true	boolean
`enable_ip_tracking`	Enable IP address prevalence tracking	true	boolean
`retention_days`	Days to retain prevalence data	90	1-365
`cache_ttl_seconds`	Cache TTL for prevalence queries	60	0-3600

Hot Reload

Configuration changes are applied immediately without service restart. The system automatically:

Updates rarity thresholds for new queries
Enables/disables tracking for new log entries
Adjusts cache TTL for subsequent requests
Applies retention policies on next cleanup cycle

Performance Considerations

Ingestion Impact

Prevalence tracking adds minimal overhead to log ingestion:

Materialized Views: Process data during insert, not query time
Incremental Updates: Only new data triggers aggregation
Efficient Storage: AggregatingMergeTree optimizes for aggregation queries

Query Performance

Prevalence queries are optimized for speed:

Bloom Filters: Fast artifact lookups using bloom filter indexes
Partitioning: Monthly partitions reduce scan scope
Caching: In-memory LRU cache with configurable TTL
Bulk Operations: Single queries for multiple artifacts

Memory Usage

The system uses memory-efficient data structures:

HyperLogLog: Approximate unique counting with ±2% error
Compression: ClickHouse compression reduces memory footprint
Cache Limits: Configurable cache size prevents memory exhaustion

Use Cases

Threat Hunting

Prevalence data enables effective threat hunting:

Rare Hash Analysis: Identify uncommon executables that may be malware
Domain Reputation: Spot connections to newly registered or rare domains
Lateral Movement: Detect rare tools used for privilege escalation
Data Exfiltration: Identify connections to unusual external IPs

Incident Response

During incident response, prevalence helps:

Scope Assessment: Determine how widespread an artifact is
Timeline Analysis: Understand when artifacts first appeared
Impact Evaluation: Count affected hosts and occurrences
IOC Validation: Verify if indicators are truly malicious or common

Security Operations

For daily security operations:

Alert Triage: Prioritize alerts involving rare artifacts
False Positive Reduction: Filter out common, benign artifacts
Baseline Establishment: Understand normal prevalence patterns
Anomaly Detection: Spot deviations from expected prevalence

Best Practices

Threshold Tuning

Start with default threshold (3 hosts) and adjust based on environment size
Larger environments may need higher thresholds to reduce noise
Monitor rare artifact counts and adjust to maintain manageable volumes

Data Quality

Ensure consistent field mapping across log sources
Validate hash formats and domain names in parsing rules
Use normalized field names (UDM) for consistent prevalence tracking

Performance Optimization

Monitor ClickHouse resource usage during high ingestion periods
Adjust cache TTL based on query patterns and data freshness needs
Use bulk APIs for multiple artifact lookups to reduce query overhead

Integration Patterns

Include prevalence checks in automated analysis workflows
Export prevalence data for integration with external threat intelligence
Use prevalence scores in risk-based alerting and scoring systems

Troubleshooting

Common Issues

No prevalence data appearing:

Verify ClickHouse is running and accessible
Check materialized view creation and data flow
Ensure log fields contain valid hashes/domains/IPs

High memory usage:

Reduce cache TTL or size limits
Check for runaway queries or bulk operations
Monitor ClickHouse memory usage and tune accordingly

Slow prevalence queries:

Verify bloom filter indexes are created
Check time window scope (shorter windows are faster)
Monitor ClickHouse query performance metrics

Monitoring

Key metrics to monitor:

Ingestion Rate: Logs processed per second by materialized views
Cache Hit Rate: Percentage of queries served from cache
Query Latency: Response time for prevalence API calls
Storage Growth: Disk usage of prevalence aggregation tables

The prevalence tracking system provides powerful capabilities for identifying rare and suspicious artifacts in your environment. By understanding normal prevalence patterns, security teams can quickly spot anomalies that may indicate threats or malicious activity.

Prevalence Tracking

On this page