nano SIEM
User Guide

Data Ingestion

Data Ingestion

nano provides flexible log ingestion capabilities with support for multiple data sources, real-time processing, and automatic field normalization. This guide covers how ingestion works and provides step-by-step instructions for onboarding new data feeds.

How Ingestion Works

Architecture Overview

Processing Pipeline

  1. Authentication: Bearer token validation for secure ingestion
  2. Rate Limiting: Per-client throttling to prevent abuse
  3. Content Detection: Automatic format detection (JSON, raw text, etc.)
  4. Source Type Routing: Route logs to appropriate parsers based on source type
  5. Field Normalization: Transform logs into Unified Data Model (UDM) format
  6. Enrichment: Add geolocation, ASN, and threat intelligence data
  7. Storage: Store in ClickHouse for analytics and PostgreSQL for metadata

Internal Vector Pipeline

The diagram below shows the internal Vector pipeline components. Static components ship with the base ConfigMap on each Vector pod. Dynamic components are generated by the API when log sources are deployed and synced to pods via S3/GCS.

Supported Data Sources

Sources that support source_type identification natively:

Source TypeDescriptionHow source_type is definedUse Case
HTTP/WebhooksDirect HTTP POST ingestionX-Source-Type headerApplications, webhooks, log shippers
Vector NativeVector-to-Vector forwarding.source_type field in eventOn-prem aggregators
KafkaApache Kafka consumerFeed configurationHigh-volume streaming data (setup guide)
GCP Pub/SubGoogle Cloud Pub/SubFeed configurationGoogle Cloud logs (setup guide)
AWS S3/SQSS3 via SQS notificationsFeed configurationAWS CloudTrail, VPC Flow Logs (setup guide)

Sources that require a Vector aggregator (cannot connect directly):

Source TypeWhy aggregator neededDocumentation
SyslogNo source_type in protocolVector Aggregator guide
OpenTelemetryNo source_type in OTLPVector Aggregator guide
FluentTags don't map to source_typeVector Aggregator guide
FileNo metadata in raw filesVector Aggregator guide

Feed Onboarding Wizard

nano provides a guided wizard to help you onboard new data sources. The wizard offers three different paths based on your needs:

Best for: New log formats or when you have sample data

  1. Navigate to Feeds

    • Go to FeedsNew Feed or use the Log Source Wizard
    • Select "I have sample logs"
  2. Provide Sample Data

    • Paste 3-5 representative log entries (one per line)
    • The AI will analyze the format and structure
    • Supports JSON, syslog, CEF, and custom formats
  3. AI Parser Generation

    • The system automatically detects the log format
    • Generates VRL (Vector Remap Language) parser code
    • Extracts fields into UDM format
    • Validates syntax and tests against samples
  4. Configure Metadata

    • Set feed name (e.g., aws_cloudtrail)
    • Choose category (Network, Cloud, Security, etc.)
    • Add vendor and product information
    • Select icon and color for visualization
  5. Publish and Test

    • Publish the parser to create a versioned snapshot and deploy to Vector
    • Test with live data
    • Monitor ingestion metrics on the overview tab's event volume chart

Option 2: Existing Data Sampling

Best for: When you already have data in nano

  1. Select Source Type

    • Choose "Sample from existing data"
    • Browse discovered source types from your data
    • Select the source type you want to create a parser for
  2. Auto-Sample Generation

    • System fetches recent log samples automatically
    • No need to manually copy/paste logs
    • Ensures samples are representative of current data
  3. Follow Standard Flow

    • Continue with AI parser generation
    • Configure metadata and publish

Option 3: Manual Configuration

Best for: Advanced users or custom requirements

  1. Choose Source Type

    • Select "Manual setup"
    • Pick from available source types (HTTP, Kafka, etc.)
  2. Write Custom Parser

    • Write VRL code manually
    • Full control over field extraction and transformation
    • Access to all Vector functions and capabilities
  3. Configure Connection

    • Set up source-specific configuration
    • Add cloud credentials if needed
    • Configure routing and matching rules

Cloud Credentials Setup

For cloud-based ingestion (Kafka, GCP Pub/Sub, AWS S3), you'll need to configure credentials securely. For complete end-to-end guides including cloud provider infrastructure setup and IAM permissions, see AWS S3 via SQS and GCP Pub/Sub.

AWS S3/SQS Credentials

  1. Navigate to Credentials

    • Go to SettingsCloud Credentials
    • Click Add Credential
  2. Configure AWS Access

    Provider: AWS S3
    Name: Production AWS Logs
    Region: us-east-1
    Access Key ID: AKIA...
    Secret Access Key: [your secret key]
  3. IAM Permissions Required

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "sqs:ReceiveMessage",
            "sqs:DeleteMessage",
            "s3:GetObject"
          ],
          "Resource": [
            "arn:aws:sqs:*:*:your-log-queue",
            "arn:aws:s3:::your-log-bucket/*"
          ]
        }
      ]
    }

GCP Pub/Sub Credentials

  1. Create Service Account

    • In Google Cloud Console, create a service account
    • Grant Pub/Sub Subscriber role
    • Download JSON key file
  2. Add to nano

    Provider: GCP Pub/Sub
    Name: GCP Security Logs
    Service Account JSON: [paste full JSON content]
  3. Required Permissions

    • pubsub.subscriptions.consume
    • pubsub.messages.ack

Kafka Credentials (Optional)

  1. SASL Authentication

    Provider: Kafka
    Name: Production Kafka
    SASL Mechanism: SCRAM-SHA-256
    Username: [kafka username]
    Password: [kafka password]
    TLS Enabled: Yes
  2. Unauthenticated Kafka

    • Leave SASL fields empty for open Kafka clusters
    • Only enable TLS if your cluster requires it

Publishing & Version Management

Once you've created a log source and written a parser, nano uses a publish workflow to manage changes safely. This gives you version history with rollback capability.

How It Works

Parser changes follow a draft → publish → deploy lifecycle:

  1. Edit — Changes are saved to a working draft. The draft does not affect live parsing.
  2. Publish — Creates an immutable versioned snapshot and immediately deploys to Vector. The previous version is preserved in history.
  3. Rollback — Revert to any previous version. This updates your working draft, which you can then review and publish.

Draft Changes Banner

When your working copy differs from the currently deployed version, an Unpublished changes banner appears at the top of the log source detail page. From here you can:

  • Publish — Create a new version and deploy it
  • Discard Draft — Reset your working copy to match the active deployed version

Version History

The Versions tab shows all published versions with:

  • Version number and creation date
  • Change type — Published, Reverted, or initial creation
  • Active version indicator showing which version is currently deployed
  • Expandable diff view — Click any version to see the full VRL code and a line-by-line diff against the previous version
  • Revert — One-click to load a previous version's parser into your draft for review before re-publishing

Event Volume Chart

The Overview tab includes an Event Volume (24h) area chart showing hourly event counts for the log source. Use this to:

  • Verify data is flowing after publishing a new parser
  • Spot ingestion gaps or anomalies
  • Confirm expected volume patterns

Event Deduplication

Deduplication is always on and requires no configuration. nano automatically removes duplicate events that can occur from:

  • Retried HTTP POSTs (e.g., client timeout + server-side success)
  • Redundant forwarders sending the same log
  • Agent reconnect double-sends

Events are matched on message + source_type + timestamp — if all three are identical within a short window, the duplicate is dropped. This happens in the Vector pipeline between the normalize and prepare_output stages.

Event Sampling

Per-log-source setting to keep a percentage of events, reducing storage costs for high-volume, low-value sources.

When to use

  • High-volume firewall allow logs where 100% coverage isn't needed for detection
  • Health check and heartbeat logs
  • Debug-level application logs
  • Any source generating millions of events/day where statistical sampling is sufficient

How to configure

  1. Navigate to Feeds and select the log source
  2. Go to the Configuration tab and click Edit Configuration
  3. Scroll to the Event Sampling section
  4. Set the sampling percentage (e.g., 10 = keep 10% of events)
  5. Set an exclude condition (VRL expression) for events that should always be kept
  6. Save and Publish to apply changes

Example: Firewall allow logs

For a firewall source generating 50M events/day where most are routine allow traffic:

  • Sampling percentage: 10% (keeps ~5M events/day)
  • Exclude condition: .action != "allow" — all deny, drop, and reset actions are always kept at 100%

This gives you statistical visibility into allow traffic while preserving every security-relevant event.

How it works

When sampling is enabled for a log source, a Vector sample transform is inserted between the parser output and the combiner. The transform randomly keeps events at the configured ratio. Events matching the exclude condition bypass sampling entirely.

Step-by-Step Onboarding Examples

Example 1: AWS CloudTrail via S3

  1. Set up AWS Infrastructure

    # Create SQS queue for S3 notifications
    aws sqs create-queue --queue-name cloudtrail-logs
    
    # Configure S3 bucket to send notifications to SQS
    aws s3api put-bucket-notification-configuration \
      --bucket your-cloudtrail-bucket \
      --notification-configuration file://notification.json
  2. Add AWS Credentials in nano

    • Provider: AWS S3
    • Region: us-east-1
    • Access Key ID and Secret Key
  3. Create Log Source

    • Use wizard with sample CloudTrail JSON
    • AI will detect AWS CloudTrail format
    • Configure source type as aws_s3
    • Set SQS queue URL in source configuration
  4. Publish and Monitor

    • Publish the parser to create a version and deploy to Vector
    • Monitor ingestion on the overview tab's event volume chart
    • Verify data appears in Search

Example 2: Application Logs via HTTP

  1. Get Authentication Token

    # Default token (change in production)
    export VECTOR_AUTH_TOKEN="your-secure-token"
  2. Send Sample Logs

    curl -X POST http://your-nanosiem:8080/ \
      -H "Authorization: Bearer your-secure-token" \
      -H "X-Source-Type: my_app" \
      -H "Content-Type: application/json" \
      -d '{"timestamp": "2024-01-01T12:00:00Z", "level": "INFO", "message": "User login", "user": "john"}'
  3. Create Feed via Wizard

    • Use "Sample from existing data" option
    • Select my_app source type
    • AI generates parser for your JSON format
    • Publish and start receiving structured data

Example 3: Syslog from Network Devices

  1. Deploy Vector Aggregator On-Premises

    • Install Vector on a server in your network
    • Configure it to receive syslog and tag with source_type
    • Forward to nano via Vector-to-Vector protocol
  2. Point Devices to Vector Aggregator

    # Example Cisco configuration
    logging host 192.168.1.100  # Vector aggregator IP
    logging facility local0
    logging severity informational
  3. Create Feed in nano

    • Once logs flow through with source_type set, create a feed for each type
    • Use sample data in wizard to build parsers
    • Publish and monitor

Ingestion Monitoring

Health Metrics

Monitor ingestion health in the Feeds dashboard:

  • Event Count: Total events received
  • Ingestion Rate: Events per second/minute
  • Parse Errors: Failed parsing attempts
  • Last Event: Timestamp of most recent log
  • Health Status: Healthy, Stale, No Data, Error

Troubleshooting Common Issues

No Data Appearing

  1. Check Authentication

    # Test with curl
    curl -X POST http://your-nanosiem:8080/ \
      -H "Authorization: Bearer wrong-token" \
      -d "test"
    # Should return 401 if auth is working
  2. Verify Source Type Routing

    • Ensure X-Source-Type header matches feed configuration
    • Check Vector logs for routing decisions
  3. Check Parser Deployment

    • Go to Log Sources[Your Source]Overview tab
    • Check deployment status and event volume chart
    • If changes are pending, publish to deploy

Parse Errors

  1. Review Error Logs

    • Check SystemIngestion Errors
    • Look for VRL syntax or field extraction issues
  2. Test Parser

    • Use Test function in parser editor
    • Validate against actual log samples
    • Adjust VRL code as needed

Performance Issues

  1. Check Rate Limits

    • Default: 1000 events/second per client
    • Increase in Vector configuration if needed
  2. Monitor Resource Usage

    • ClickHouse disk space and memory
    • Vector CPU and memory usage
    • Network bandwidth

Best Practices

Security

  • Use Strong Authentication Tokens: Generate random 32+ character tokens
  • Enable TLS: Use HTTPS for production deployments
  • Rotate Credentials: Regularly update cloud credentials
  • Network Segmentation: Restrict ingestion endpoints to trusted networks

Performance

  • Batch Ingestion: Send multiple logs in single requests when possible
  • Use Appropriate Source Types: Choose the right ingestion method for your volume
  • Monitor Parsing Efficiency: Complex VRL can impact performance
  • Optimize Field Extraction: Only extract fields you need for detection and search

Data Quality

  • Validate Sample Data: Ensure samples represent all log variations
  • Test Edge Cases: Include malformed or unusual log entries in testing
  • Monitor Parse Success Rate: Aim for >95% successful parsing
  • Regular Parser Updates: Update parsers when log formats change
  • Use Version History: Review diffs between versions to track parser evolution and simplify rollbacks

Operational

  • Document Feed Configurations: Maintain documentation for each data source
  • Set Up Alerting: Monitor for ingestion failures or data gaps
  • Regular Health Checks: Review feed health metrics weekly
  • Capacity Planning: Monitor growth trends and plan for scaling

Next Steps

After setting up ingestion:

  1. Configure Enrichment: Add IP geolocation and threat intelligence
  2. Create Detections: Build rules to identify security events
  3. Set Up Dashboards: Visualize your ingested data
  4. Enable Alerting: Get notified of critical security events

For more advanced configuration options, see:

On this page

On this page