nano SIEM
User Guide

Prevalence Tracking

Prevalence Tracking

Prevalence tracking is a powerful security analysis technique that identifies rare or unusual artifacts in your environment. By tracking how frequently file hashes, domains, and IP addresses appear across your infrastructure, you can quickly spot potential threats that stand out from normal activity patterns.

Two Ways to Use Prevalence

nano provides one unified prevalence system with two complementary access patterns. Both return the same metric: host count (number of unique hosts that have seen an artifact).

Quick Comparison

CapabilityColumns (prevalence_*)Command (|prevalence)
Filter by host countWHERE prevalence_min < 5| prevalence hash_prevalence < 5
Filter by first_seen| prevalence domain_first_seen > now() - 24h
Enrich with metadata| prevalence enrich=true
Variable time window❌ (fixed 30 days)window=1h|24h|7d|30d
Zero query cost✅ Pre-computed at ingest❌ On-demand query
Use anywhere in query✅ WHERE, stats, eval, etc.❌ Pipeline stage only

When to Use Each

Use columns (prevalence_*) for fast, simple filtering:

# Zero cost - pre-computed at ingest
prevalence_min < 5 | stats count by dest_host

# Works in any WHERE clause
* | where prevalence_dest_domain < 10 AND prevalence_file_hash < 5

Use the command (|prevalence) for rich analysis:

# Filter by first_seen (columns can't do this)
* | prevalence domain_first_seen > now() - 24h

# Enrich results with full prevalence metadata
* | prevalence enrich=true | table file_hash, hash_prevalence, hash_first_seen

# Use a different time window
* | prevalence hash_prevalence < 5 window=7d

Combine both for powerful hunting:

# Fast pre-filter with columns, then temporal analysis with command
prevalence_min < 100 | prevalence domain_first_seen > now() - 24h enrich=true

This finds events with somewhat rare artifacts, then filters to domains first seen in the last 24 hours and shows full prevalence details.

Overview

nano's prevalence tracking system automatically monitors and analyzes:

  • File Hashes (MD5, SHA1, SHA256) - Identify rare executables and files
  • Domains - Detect connections to uncommon or suspicious domains
  • IP Addresses - Spot communications with rare external IPs

The system uses ClickHouse materialized views to provide real-time prevalence calculations with minimal performance impact on log ingestion.

How It Works

Data Collection

Prevalence data is automatically extracted from your log streams using materialized views that trigger on every log entry:

  1. Hash Extraction: File hashes are captured from file_hash and process_hash fields
  2. Domain Extraction: Domains are captured from dest_host, query (DNS), and url_domain fields
  3. IP Extraction: IP addresses are captured from dest_ip and src_ip fields
  4. Host Counting: The system tracks unique source hosts that have seen each artifact

Aggregation Process

Data is aggregated into hourly time buckets with the following metrics:

  • Host Count: Number of unique hosts that observed the artifact (using HyperLogLog for efficiency)
  • Total Occurrences: Total number of times the artifact was seen
  • First Seen: Timestamp when the artifact was first observed
  • Last Seen: Timestamp when the artifact was most recently observed

Rarity Classification

Artifacts are classified as "rare" based on host count and a configurable threshold:

  • Rare: Below the rarity threshold (default: 5 hosts)
  • Common: At or above the threshold
  • N/A: Field is empty or not applicable (indicated by value 65535)

Artifact Types

File Hashes

The system automatically detects hash types based on length and validates hex format:

  • MD5: 32 hexadecimal characters
  • SHA1: 40 hexadecimal characters
  • SHA256: 64 hexadecimal characters

Hash prevalence is particularly useful for:

  • Identifying rare malware samples
  • Detecting lateral movement with uncommon tools
  • Spotting process execution anomalies

Domains

Domain validation includes comprehensive checks:

  • Must contain at least one dot (domain.tld format)
  • Cannot be an IP address
  • Must have valid domain characters (alphanumeric, hyphens, dots)
  • TLD must be at least 2 characters and non-numeric
  • Maximum length of 253 characters

Domain prevalence helps identify:

  • Command and control (C2) communications
  • Phishing and malicious websites
  • DNS tunneling attempts
  • Newly registered domains (NRDs)

IP Addresses

IP addresses are categorized as:

  • Public IPs: Internet-routable addresses
  • Private IPs: RFC1918 internal addresses (10.x.x.x, 172.16-31.x.x, 192.168.x.x)

IP prevalence is valuable for:

  • Detecting connections to rare external services
  • Identifying potential data exfiltration targets
  • Spotting scanning and reconnaissance activity

Storage Architecture

ClickHouse Tables

Prevalence data is stored in AggregatingMergeTree tables optimized for real-time aggregation:

-- Hash prevalence aggregation
CREATE TABLE hash_prevalence_agg (
    file_hash String,
    hash_type LowCardinality(String),
    time_bucket DateTime,
    host_count AggregateFunction(uniq, String),
    first_seen SimpleAggregateFunction(min, DateTime64(6)),
    last_seen SimpleAggregateFunction(max, DateTime64(6)),
    total_count SimpleAggregateFunction(sum, UInt64)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(time_bucket)
ORDER BY (file_hash, time_bucket);

Materialized Views

Real-time aggregation is handled by materialized views that process incoming logs:

-- Automatically aggregate hash prevalence
CREATE MATERIALIZED VIEW hash_prevalence_mv TO hash_prevalence_agg AS
SELECT
    lower(file_hash) AS file_hash,
    multiIf(
        length(file_hash) = 32, 'md5',
        length(file_hash) = 40, 'sha1', 
        length(file_hash) = 64, 'sha256',
        'unknown'
    ) AS hash_type,
    toStartOfHour(timestamp) AS time_bucket,
    uniqState(coalesce(src_host, src_ip, 'unknown')) AS host_count,
    min(timestamp) AS first_seen,
    max(timestamp) AS last_seen,
    count() AS total_count
FROM logs
WHERE file_hash != '' AND match(file_hash, '^[a-fA-F0-9]+$')
GROUP BY file_hash, hash_type, time_bucket;

Data Retention

  • Default Retention: 90 days (configurable)
  • Partitioning: Monthly partitions for efficient data management
  • TTL: Automatic cleanup of old data
  • Compression: ClickHouse compression reduces storage requirements

Querying Prevalence

Single Artifact Lookup

Query prevalence for a specific artifact:

# Hash prevalence
curl -X GET "/api/prevalence/hash/d41d8cd98f00b204e9800998ecf8427e?window=24h"

# Domain prevalence  
curl -X GET "/api/prevalence/domain/suspicious.example.com?window=7d"

Bulk Lookups

Query multiple artifacts efficiently (up to 100 per request):

curl -X POST "/api/prevalence/bulk" \
  -H "Content-Type: application/json" \
  -d '{
    "artifacts": [
      "d41d8cd98f00b204e9800998ecf8427e",
      "suspicious.example.com",
      "192.168.1.100"
    ],
    "window": "24h"
  }'

Time Windows

Supported time windows for prevalence calculations:

  • 1h - Last 1 hour
  • 24h - Last 24 hours
  • 7d - Last 7 days
  • 30d - Last 30 days (default - aligns with ingest-time columns)

Response Format

Prevalence responses include comprehensive metadata:

{
  "data": {
    "artifact": "suspicious.example.com",
    "artifact_type": "domain",
    "host_count": 2,
    "total_occurrences": 15,
    "first_seen": "2024-01-15T10:30:00Z",
    "last_seen": "2024-01-15T14:22:00Z",
    "is_rare": true
  }
}

Ingest-Time Prevalence Enrichment

nano enriches logs with prevalence scores at ingest time using ClickHouse dictionaries. This captures "prevalence at time of event" rather than current prevalence, enabling accurate historical threat hunting.

How It Works

When logs are ingested, ClickHouse automatically looks up prevalence scores from in-memory dictionaries and stores them directly in the logs table:

Logs Insert → dictGetOrDefault() in DEFAULT columns → Prevalence scores frozen with event

         Dictionary (in-memory, auto-refresh every 5-10 min)

         Source: *_prevalence_agg tables (filtered to rare artifacts only)

Prevalence Fields

The following fields are automatically populated at ingest time:

FieldTypeDescription
prevalence_file_hashUInt16Host count for file_hash
prevalence_process_hashUInt16Host count for process_hash
prevalence_dest_domainUInt16Host count for dest_host (domains only, not IPs)
prevalence_dest_ipUInt16Host count for dest_ip (public IPs only)
prevalence_minUInt16Minimum host count across all fields (for filtering)

Value interpretation:

  • 1-9998: Actual host count (lower = rarer)
  • 9999: Common (artifact seen on 1000+ hosts, not tracked individually)
  • 65535: N/A (field is empty or not applicable)

Querying Prevalence Fields

These fields can be used directly in search queries:

# Find events with rare process hashes (seen on fewer than 5 hosts)
sourcetype = "sysmon" AND prevalence_process_hash < 5

# Find events with any rare artifact
prevalence_min < 10

# Combine with other filters
dest_ip != "" AND prevalence_dest_ip < 20 | table timestamp, dest_ip, prevalence_dest_ip

Prevalence Slider

The search interface includes a Prevalence Filter slider that filters results by prevalence_min:

  • Location: Next to the search button in the search bar
  • Range: 1-100+ (left = very rare, right = show all)
  • Presets: Very Rare (fewer than 5), Rare (fewer than 10), Uncommon (fewer than 50), All

When the slider is active:

  1. The query automatically appends | where prevalence_min < X
  2. Results are filtered to only show events with rare artifacts
  3. The slider value is persisted in the URL for shareable links

Dictionary Architecture

Prevalence dictionaries are stored in ClickHouse memory for fast lookups:

-- Example: Hash prevalence dictionary
CREATE DICTIONARY nanosiem.hash_prevalence_dict
(
    file_hash String,
    host_count UInt16 DEFAULT 9999
)
PRIMARY KEY file_hash
SOURCE(CLICKHOUSE(
    QUERY 'SELECT file_hash, toUInt16(least(9998, host_count)) as host_count
           FROM (
               SELECT lower(file_hash) as file_hash, uniqMerge(host_count) as host_count
               FROM nanosiem.hash_prevalence_agg
               WHERE time_bucket >= now() - INTERVAL 30 DAY
               GROUP BY file_hash
           ) WHERE host_count < 1000'
))
LIFETIME(MIN 300 MAX 600)  -- Refresh every 5-10 minutes
LAYOUT(SPARSE_HASHED());

Key design decisions:

  • Only artifacts seen by fewer than 1000 hosts are tracked (others return default 9999 = "common")
  • 30-day lookback window matches the |prevalence command default
  • Dictionary refreshes every 5-10 minutes for near-real-time data

Memory usage: ~100-200MB total for all dictionaries (only rare artifacts are loaded)

Enrichment Integration

Search Results

Prevalence data is automatically displayed in search results:

  • Prevalence Fields: Shown in emerald/green color for easy identification
  • Score Display: Values 0-100 shown, 255 (N/A) hidden automatically
  • Tooltips: Hover for field description

Detection Rules

Use prevalence fields in detection logic:

# Example: Detect rare process execution (using pre-computed columns)
name: "Rare Process Execution"
query: |
  sourcetype = "sysmon" AND event_id = 1
  | where prevalence_process_hash < 5

Or use the prevalence command for temporal filtering:

# Example: Detect newly seen rare domains
name: "New Rare Domain Connection"
query: |
  dest_host != ""
  | prevalence domain_prevalence < 10 domain_first_seen > now() - 24h

Alert Context

Prevalence information is included in alert context to help analysts assess threat severity.

Visualization

Prevalence Dashboard

The dedicated prevalence dashboard provides:

  • Rare Artifacts Table: Real-time list of artifacts below the rarity threshold
  • New Artifacts Table: Recently observed artifacts (first seen in last 24h)
  • Statistics Cards: Summary metrics and counts
  • Export Functionality: CSV/JSON export of prevalence data

Scatter Plot Analysis

Interactive scatter plots show prevalence patterns:

  • X-axis: First seen timestamp
  • Y-axis: Host count (prevalence)
  • Color Coding: Rarity classification
  • Filtering: By artifact type and time window
  • Selection: Click artifacts to filter search results

Search Integration

Prevalence panels in search results show:

  • Artifact Extraction: Automatic identification of hashes, domains, and IPs
  • Prevalence Overlay: Visual indicators on search results
  • Drill-down: Click prevalence badges to filter by artifact

Configuration

Settings Management

Prevalence tracking is configurable through the settings API:

# Get current settings
curl -X GET "/api/settings/prevalence"

# Update rarity threshold
curl -X PUT "/api/settings/prevalence" \
  -H "Content-Type: application/json" \
  -d '{"rarity_threshold": 5}'

Configuration Options

SettingDescriptionDefaultRange
rarity_thresholdHost count below which artifacts are "rare"31-1000
enable_hash_trackingEnable file hash prevalence trackingtrueboolean
enable_domain_trackingEnable domain prevalence trackingtrueboolean
enable_ip_trackingEnable IP address prevalence trackingtrueboolean
retention_daysDays to retain prevalence data901-365
cache_ttl_secondsCache TTL for prevalence queries600-3600

Hot Reload

Configuration changes are applied immediately without service restart. The system automatically:

  • Updates rarity thresholds for new queries
  • Enables/disables tracking for new log entries
  • Adjusts cache TTL for subsequent requests
  • Applies retention policies on next cleanup cycle

Performance Considerations

Ingestion Impact

Prevalence tracking adds minimal overhead to log ingestion:

  • Materialized Views: Process data during insert, not query time
  • Incremental Updates: Only new data triggers aggregation
  • Efficient Storage: AggregatingMergeTree optimizes for aggregation queries

Query Performance

Prevalence queries are optimized for speed:

  • Bloom Filters: Fast artifact lookups using bloom filter indexes
  • Partitioning: Monthly partitions reduce scan scope
  • Caching: In-memory LRU cache with configurable TTL
  • Bulk Operations: Single queries for multiple artifacts

Memory Usage

The system uses memory-efficient data structures:

  • HyperLogLog: Approximate unique counting with ±2% error
  • Compression: ClickHouse compression reduces memory footprint
  • Cache Limits: Configurable cache size prevents memory exhaustion

Use Cases

Threat Hunting

Prevalence data enables effective threat hunting:

  1. Rare Hash Analysis: Identify uncommon executables that may be malware
  2. Domain Reputation: Spot connections to newly registered or rare domains
  3. Lateral Movement: Detect rare tools used for privilege escalation
  4. Data Exfiltration: Identify connections to unusual external IPs

Incident Response

During incident response, prevalence helps:

  1. Scope Assessment: Determine how widespread an artifact is
  2. Timeline Analysis: Understand when artifacts first appeared
  3. Impact Evaluation: Count affected hosts and occurrences
  4. IOC Validation: Verify if indicators are truly malicious or common

Security Operations

For daily security operations:

  1. Alert Triage: Prioritize alerts involving rare artifacts
  2. False Positive Reduction: Filter out common, benign artifacts
  3. Baseline Establishment: Understand normal prevalence patterns
  4. Anomaly Detection: Spot deviations from expected prevalence

Best Practices

Threshold Tuning

  • Start with default threshold (3 hosts) and adjust based on environment size
  • Larger environments may need higher thresholds to reduce noise
  • Monitor rare artifact counts and adjust to maintain manageable volumes

Data Quality

  • Ensure consistent field mapping across log sources
  • Validate hash formats and domain names in parsing rules
  • Use normalized field names (UDM) for consistent prevalence tracking

Performance Optimization

  • Monitor ClickHouse resource usage during high ingestion periods
  • Adjust cache TTL based on query patterns and data freshness needs
  • Use bulk APIs for multiple artifact lookups to reduce query overhead

Integration Patterns

  • Include prevalence checks in automated analysis workflows
  • Export prevalence data for integration with external threat intelligence
  • Use prevalence scores in risk-based alerting and scoring systems

Troubleshooting

Common Issues

No prevalence data appearing:

  • Verify ClickHouse is running and accessible
  • Check materialized view creation and data flow
  • Ensure log fields contain valid hashes/domains/IPs

High memory usage:

  • Reduce cache TTL or size limits
  • Check for runaway queries or bulk operations
  • Monitor ClickHouse memory usage and tune accordingly

Slow prevalence queries:

  • Verify bloom filter indexes are created
  • Check time window scope (shorter windows are faster)
  • Monitor ClickHouse query performance metrics

Monitoring

Key metrics to monitor:

  • Ingestion Rate: Logs processed per second by materialized views
  • Cache Hit Rate: Percentage of queries served from cache
  • Query Latency: Response time for prevalence API calls
  • Storage Growth: Disk usage of prevalence aggregation tables

The prevalence tracking system provides powerful capabilities for identifying rare and suspicious artifacts in your environment. By understanding normal prevalence patterns, security teams can quickly spot anomalies that may indicate threats or malicious activity.

On this page

On this page