Data Ingestion

nano provides flexible log ingestion capabilities with support for multiple data sources, real-time processing, and automatic field normalization. This guide covers how ingestion works and provides step-by-step instructions for onboarding new data feeds.

Source Type Required: Every log event must have a source_type defined for nano to route it to the correct parser. See Supported Data Sources for which ingestion methods support this natively and which require a Vector aggregator.

How Ingestion Works

Architecture Overview

Processing Pipeline

Authentication: Bearer token validation for secure ingestion
Rate Limiting: Per-client throttling to prevent abuse
Content Detection: Automatic format detection (JSON, raw text, etc.)
Source Type Routing: Route logs to appropriate parsers based on source type
Field Normalization: Transform logs into Unified Data Model (UDM) format
Enrichment: Add geolocation, ASN, and threat intelligence data
Storage: Store in ClickHouse for analytics and PostgreSQL for metadata

Internal Vector Pipeline

The diagram below shows the internal Vector pipeline components. Static components ship with the base ConfigMap on each Vector pod. Dynamic components are generated by the API when log sources are deployed and synced to pods via S3/GCS.

Supported Data Sources

Sources that support source_type identification natively:

Source Type	Description	How source_type is defined	Use Case
HTTP/Webhooks	Direct HTTP POST ingestion	`X-Source-Type` header	Applications, webhooks, log shippers
Vector Native	Vector-to-Vector forwarding	`.source_type` field in event	On-prem aggregators
Kafka	Apache Kafka consumer	Feed configuration	High-volume streaming data (setup guide)
GCP Pub/Sub	Google Cloud Pub/Sub	Feed configuration	Google Cloud logs (setup guide)
AWS S3/SQS	S3 via SQS notifications	Feed configuration	AWS CloudTrail, VPC Flow Logs (setup guide)

Sources that require a Vector aggregator (cannot connect directly):

Source Type	Why aggregator needed	Documentation
Syslog	No source_type in protocol	Vector Aggregator guide
OpenTelemetry	No source_type in OTLP	Vector Aggregator guide
Fluent	Tags don't map to source_type	Vector Aggregator guide
File	No metadata in raw files	Vector Aggregator guide

Feed Onboarding Wizard

nano provides a guided wizard to help you onboard new data sources. The wizard offers three different paths based on your needs:

Option 1: Sample-Based Setup (Recommended)

Best for: New log formats or when you have sample data

Navigate to Feeds
- Go to Feeds → New Feed or use the Log Source Wizard
- Select "I have sample logs"
Provide Sample Data
- Paste 3-5 representative log entries (one per line)
- The AI will analyze the format and structure
- Supports JSON, syslog, CEF, and custom formats
AI Parser Generation
- The system automatically detects the log format
- Generates VRL (Vector Remap Language) parser code
- Extracts fields into UDM format
- Validates syntax and tests against samples
Configure Metadata
- Set feed name (e.g., aws_cloudtrail)
- Choose category (Network, Cloud, Security, etc.)
- Add vendor and product information
- Select icon and color for visualization
Publish and Test
- Publish the parser to create a versioned snapshot and deploy to Vector
- Test with live data
- Monitor ingestion metrics on the overview tab's event volume chart

Option 2: Existing Data Sampling

Best for: When you already have data in nano

Select Source Type
- Choose "Sample from existing data"
- Browse discovered source types from your data
- Select the source type you want to create a parser for
Auto-Sample Generation
- System fetches recent log samples automatically
- No need to manually copy/paste logs
- Ensures samples are representative of current data
Follow Standard Flow
- Continue with AI parser generation
- Configure metadata and publish

Option 3: Manual Configuration

Best for: Advanced users or custom requirements

Choose Source Type
- Select "Manual setup"
- Pick from available source types (HTTP, Kafka, etc.)
Write Custom Parser
- Write VRL code manually
- Full control over field extraction and transformation
- Access to all Vector functions and capabilities
Configure Connection
- Set up source-specific configuration
- Add cloud credentials if needed
- Configure routing and matching rules

Cloud Credentials Setup

For cloud-based ingestion (Kafka, GCP Pub/Sub, AWS S3), you'll need to configure credentials securely. For complete end-to-end guides including cloud provider infrastructure setup and IAM permissions, see AWS S3 via SQS and GCP Pub/Sub.

AWS S3/SQS Credentials

Navigate to Credentials
- Go to Settings → Cloud Credentials
- Click Add Credential

Configure AWS Access

Provider: AWS S3
Name: Production AWS Logs
Region: us-east-1
Access Key ID: AKIA...
Secret Access Key: [your secret key]

IAM Permissions Required

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:sqs:*:*:your-log-queue",
        "arn:aws:s3:::your-log-bucket/*"
      ]
    }
  ]
}

GCP Pub/Sub Credentials

Create Service Account
- In Google Cloud Console, create a service account
- Grant Pub/Sub Subscriber role
- Download JSON key file

Add to nano

Provider: GCP Pub/Sub
Name: GCP Security Logs
Service Account JSON: [paste full JSON content]

Required Permissions
- pubsub.subscriptions.consume
- pubsub.messages.ack

Kafka Credentials (Optional)

SASL Authentication

Provider: Kafka
Name: Production Kafka
SASL Mechanism: SCRAM-SHA-256
Username: [kafka username]
Password: [kafka password]
TLS Enabled: Yes

Unauthenticated Kafka
- Leave SASL fields empty for open Kafka clusters
- Only enable TLS if your cluster requires it

Publishing & Version Management

Once you've created a log source and written a parser, nano uses a publish workflow to manage changes safely. This gives you version history with rollback capability.

How It Works

Parser changes follow a draft → publish → deploy lifecycle:

Edit — Changes are saved to a working draft. The draft does not affect live parsing.
Publish — Creates an immutable versioned snapshot and immediately deploys to Vector. The previous version is preserved in history.
Rollback — Revert to any previous version. This updates your working draft, which you can then review and publish.

Publish replaces Deploy: You no longer need to deploy parsers manually. Publishing handles both versioning and deployment in a single step.

When your working copy differs from the currently deployed version, an Unpublished changes banner appears at the top of the log source detail page. From here you can:

Publish — Create a new version and deploy it
Discard Draft — Reset your working copy to match the active deployed version

Version History

The Versions tab shows all published versions with:

Version number and creation date
Change type — Published, Reverted, or initial creation
Active version indicator showing which version is currently deployed
Expandable diff view — Click any version to see the full VRL code and a line-by-line diff against the previous version
Revert — One-click to load a previous version's parser into your draft for review before re-publishing

Event Volume Chart

The Overview tab includes an Event Volume (24h) area chart showing hourly event counts for the log source. Use this to:

Verify data is flowing after publishing a new parser
Spot ingestion gaps or anomalies
Confirm expected volume patterns

Event Deduplication

Deduplication is always on and requires no configuration. nano automatically removes duplicate events that can occur from:

Retried HTTP POSTs (e.g., client timeout + server-side success)
Redundant forwarders sending the same log
Agent reconnect double-sends

Events are matched on message + source_type + timestamp — if all three are identical within a short window, the duplicate is dropped. This happens in the Vector pipeline between the normalize and prepare_output stages.

Event Sampling

Per-log-source setting to keep a percentage of events, reducing storage costs for high-volume, low-value sources.

When to use

High-volume firewall allow logs where 100% coverage isn't needed for detection
Health check and heartbeat logs
Debug-level application logs
Any source generating millions of events/day where statistical sampling is sufficient

Do NOT enable sampling on security-critical sources. Authentication logs, deny/block actions, endpoint detection events, and any source used in detection rules should always keep 100% of events. Sampling these sources creates blind spots that attackers can exploit.

How to configure

Navigate to Feeds and select the log source
Go to the Configuration tab and click Edit Configuration
Scroll to the Event Sampling section
Set the sampling percentage (e.g., 10 = keep 10% of events)
Set an exclude condition (VRL expression) for events that should always be kept
Save and Publish to apply changes

Example: Firewall allow logs

For a firewall source generating 50M events/day where most are routine allow traffic:

Sampling percentage: 10% (keeps ~5M events/day)
Exclude condition: .action != "allow" — all deny, drop, and reset actions are always kept at 100%

This gives you statistical visibility into allow traffic while preserving every security-relevant event.

How it works

When sampling is enabled for a log source, a Vector sample transform is inserted between the parser output and the combiner. The transform randomly keeps events at the configured ratio. Events matching the exclude condition bypass sampling entirely.

Step-by-Step Onboarding Examples

Example 1: AWS CloudTrail via S3

Set up AWS Infrastructure

# Create SQS queue for S3 notifications
aws sqs create-queue --queue-name cloudtrail-logs

# Configure S3 bucket to send notifications to SQS
aws s3api put-bucket-notification-configuration \
  --bucket your-cloudtrail-bucket \
  --notification-configuration file://notification.json

Add AWS Credentials in nano
- Provider: AWS S3
- Region: us-east-1
- Access Key ID and Secret Key
Create Log Source
- Use wizard with sample CloudTrail JSON
- AI will detect AWS CloudTrail format
- Configure source type as aws_s3
- Set SQS queue URL in source configuration
Publish and Monitor
- Publish the parser to create a version and deploy to Vector
- Monitor ingestion on the overview tab's event volume chart
- Verify data appears in Search

Example 2: Application Logs via HTTP

Get Authentication Token

# Default token (change in production)
export VECTOR_AUTH_TOKEN="your-secure-token"

Send Sample Logs

curl -X POST http://your-nanosiem:8080/ \
  -H "Authorization: Bearer your-secure-token" \
  -H "X-Source-Type: my_app" \
  -H "Content-Type: application/json" \
  -d '{"timestamp": "2024-01-01T12:00:00Z", "level": "INFO", "message": "User login", "user": "john"}'

Create Feed via Wizard
- Use "Sample from existing data" option
- Select my_app source type
- AI generates parser for your JSON format
- Publish and start receiving structured data

Example 3: Syslog from Network Devices

Syslog requires a Vector aggregator to set source_type before forwarding to nano. See the complete syslog setup guide.

Deploy Vector Aggregator On-Premises
- Install Vector on a server in your network
- Configure it to receive syslog and tag with source_type
- Forward to nano via Vector-to-Vector protocol

Point Devices to Vector Aggregator

# Example Cisco configuration
logging host 192.168.1.100  # Vector aggregator IP
logging facility local0
logging severity informational

Create Feed in nano
- Once logs flow through with source_type set, create a feed for each type
- Use sample data in wizard to build parsers
- Publish and monitor

Ingestion Monitoring

Health Metrics

Monitor ingestion health in the Feeds dashboard:

Event Count: Total events received
Ingestion Rate: Events per second/minute
Parse Errors: Failed parsing attempts
Last Event: Timestamp of most recent log
Health Status: Healthy, Stale, No Data, Error

Troubleshooting Common Issues

No Data Appearing

Check Authentication

# Test with curl
curl -X POST http://your-nanosiem:8080/ \
  -H "Authorization: Bearer wrong-token" \
  -d "test"
# Should return 401 if auth is working

Verify Source Type Routing
- Ensure X-Source-Type header matches feed configuration
- Check Vector logs for routing decisions
Check Parser Deployment
- Go to Log Sources → [Your Source] → Overview tab
- Check deployment status and event volume chart
- If changes are pending, publish to deploy

Parse Errors

Review Error Logs
- Check System → Ingestion Errors
- Look for VRL syntax or field extraction issues
Test Parser
- Use Test function in parser editor
- Validate against actual log samples
- Adjust VRL code as needed

Performance Issues

Check Rate Limits
- Default: 1000 events/second per client
- Increase in Vector configuration if needed
Monitor Resource Usage
- ClickHouse disk space and memory
- Vector CPU and memory usage
- Network bandwidth

Best Practices

Security

Use Strong Authentication Tokens: Generate random 32+ character tokens
Enable TLS: Use HTTPS for production deployments
Rotate Credentials: Regularly update cloud credentials
Network Segmentation: Restrict ingestion endpoints to trusted networks

Performance

Batch Ingestion: Send multiple logs in single requests when possible
Use Appropriate Source Types: Choose the right ingestion method for your volume
Monitor Parsing Efficiency: Complex VRL can impact performance
Optimize Field Extraction: Only extract fields you need for detection and search

Data Quality

Validate Sample Data: Ensure samples represent all log variations
Test Edge Cases: Include malformed or unusual log entries in testing
Monitor Parse Success Rate: Aim for >95% successful parsing
Regular Parser Updates: Update parsers when log formats change
Use Version History: Review diffs between versions to track parser evolution and simplify rollbacks

Operational

Document Feed Configurations: Maintain documentation for each data source
Set Up Alerting: Monitor for ingestion failures or data gaps
Regular Health Checks: Review feed health metrics weekly
Capacity Planning: Monitor growth trends and plan for scaling

Next Steps

After setting up ingestion:

Configure Enrichment: Add IP geolocation and threat intelligence
Create Detections: Build rules to identify security events
Set Up Dashboards: Visualize your ingested data
Enable Alerting: Get notified of critical security events

For more advanced configuration options, see:

UDM Fields - Complete UDM field documentation
Enrichments - IP geolocation and threat intelligence
Detections - Creating security detection rules

Data Ingestion

On this page