Data Ingestion
Data Ingestion
nano provides flexible log ingestion capabilities with support for multiple data sources, real-time processing, and automatic field normalization. This guide covers how ingestion works and provides step-by-step instructions for onboarding new data feeds.
Source Type Required: Every log event must have a source_type defined for nano to route it to the correct parser. See Supported Data Sources for which ingestion methods support this natively and which require a Vector aggregator.
How Ingestion Works
Architecture Overview
Processing Pipeline
- Authentication: Bearer token validation for secure ingestion
- Rate Limiting: Per-client throttling to prevent abuse
- Content Detection: Automatic format detection (JSON, raw text, etc.)
- Source Type Routing: Route logs to appropriate parsers based on source type
- Field Normalization: Transform logs into Unified Data Model (UDM) format
- Enrichment: Add geolocation, ASN, and threat intelligence data
- Storage: Store in ClickHouse for analytics and PostgreSQL for metadata
Internal Vector Pipeline
The diagram below shows the internal Vector pipeline components. Static components ship with the base ConfigMap on each Vector pod. Dynamic components are generated by the API when log sources are deployed and synced to pods via S3/GCS.
Supported Data Sources
Sources that support source_type identification natively:
| Source Type | Description | How source_type is defined | Use Case |
|---|---|---|---|
| HTTP/Webhooks | Direct HTTP POST ingestion | X-Source-Type header | Applications, webhooks, log shippers |
| Vector Native | Vector-to-Vector forwarding | .source_type field in event | On-prem aggregators |
| Kafka | Apache Kafka consumer | Feed configuration | High-volume streaming data (setup guide) |
| GCP Pub/Sub | Google Cloud Pub/Sub | Feed configuration | Google Cloud logs (setup guide) |
| AWS S3/SQS | S3 via SQS notifications | Feed configuration | AWS CloudTrail, VPC Flow Logs (setup guide) |
Sources that require a Vector aggregator (cannot connect directly):
| Source Type | Why aggregator needed | Documentation |
|---|---|---|
| Syslog | No source_type in protocol | Vector Aggregator guide |
| OpenTelemetry | No source_type in OTLP | Vector Aggregator guide |
| Fluent | Tags don't map to source_type | Vector Aggregator guide |
| File | No metadata in raw files | Vector Aggregator guide |
Feed Onboarding Wizard
nano provides a guided wizard to help you onboard new data sources. The wizard offers three different paths based on your needs:
Option 1: Sample-Based Setup (Recommended)
Best for: New log formats or when you have sample data
-
Navigate to Feeds
- Go to Feeds → New Feed or use the Log Source Wizard
- Select "I have sample logs"
-
Provide Sample Data
- Paste 3-5 representative log entries (one per line)
- The AI will analyze the format and structure
- Supports JSON, syslog, CEF, and custom formats
-
AI Parser Generation
- The system automatically detects the log format
- Generates VRL (Vector Remap Language) parser code
- Extracts fields into UDM format
- Validates syntax and tests against samples
-
Configure Metadata
- Set feed name (e.g.,
aws_cloudtrail) - Choose category (Network, Cloud, Security, etc.)
- Add vendor and product information
- Select icon and color for visualization
- Set feed name (e.g.,
-
Publish and Test
- Publish the parser to create a versioned snapshot and deploy to Vector
- Test with live data
- Monitor ingestion metrics on the overview tab's event volume chart
Option 2: Existing Data Sampling
Best for: When you already have data in nano
-
Select Source Type
- Choose "Sample from existing data"
- Browse discovered source types from your data
- Select the source type you want to create a parser for
-
Auto-Sample Generation
- System fetches recent log samples automatically
- No need to manually copy/paste logs
- Ensures samples are representative of current data
-
Follow Standard Flow
- Continue with AI parser generation
- Configure metadata and publish
Option 3: Manual Configuration
Best for: Advanced users or custom requirements
-
Choose Source Type
- Select "Manual setup"
- Pick from available source types (HTTP, Kafka, etc.)
-
Write Custom Parser
- Write VRL code manually
- Full control over field extraction and transformation
- Access to all Vector functions and capabilities
-
Configure Connection
- Set up source-specific configuration
- Add cloud credentials if needed
- Configure routing and matching rules
Cloud Credentials Setup
For cloud-based ingestion (Kafka, GCP Pub/Sub, AWS S3), you'll need to configure credentials securely. For complete end-to-end guides including cloud provider infrastructure setup and IAM permissions, see AWS S3 via SQS and GCP Pub/Sub.
AWS S3/SQS Credentials
-
Navigate to Credentials
- Go to Settings → Cloud Credentials
- Click Add Credential
-
Configure AWS Access
Provider: AWS S3 Name: Production AWS Logs Region: us-east-1 Access Key ID: AKIA... Secret Access Key: [your secret key] -
IAM Permissions Required
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sqs:ReceiveMessage", "sqs:DeleteMessage", "s3:GetObject" ], "Resource": [ "arn:aws:sqs:*:*:your-log-queue", "arn:aws:s3:::your-log-bucket/*" ] } ] }
GCP Pub/Sub Credentials
-
Create Service Account
- In Google Cloud Console, create a service account
- Grant Pub/Sub Subscriber role
- Download JSON key file
-
Add to nano
Provider: GCP Pub/Sub Name: GCP Security Logs Service Account JSON: [paste full JSON content] -
Required Permissions
pubsub.subscriptions.consumepubsub.messages.ack
Kafka Credentials (Optional)
-
SASL Authentication
Provider: Kafka Name: Production Kafka SASL Mechanism: SCRAM-SHA-256 Username: [kafka username] Password: [kafka password] TLS Enabled: Yes -
Unauthenticated Kafka
- Leave SASL fields empty for open Kafka clusters
- Only enable TLS if your cluster requires it
Publishing & Version Management
Once you've created a log source and written a parser, nano uses a publish workflow to manage changes safely. This gives you version history with rollback capability.
How It Works
Parser changes follow a draft → publish → deploy lifecycle:
- Edit — Changes are saved to a working draft. The draft does not affect live parsing.
- Publish — Creates an immutable versioned snapshot and immediately deploys to Vector. The previous version is preserved in history.
- Rollback — Revert to any previous version. This updates your working draft, which you can then review and publish.
Publish replaces Deploy: You no longer need to deploy parsers manually. Publishing handles both versioning and deployment in a single step.
Draft Changes Banner
When your working copy differs from the currently deployed version, an Unpublished changes banner appears at the top of the log source detail page. From here you can:
- Publish — Create a new version and deploy it
- Discard Draft — Reset your working copy to match the active deployed version
Version History
The Versions tab shows all published versions with:
- Version number and creation date
- Change type — Published, Reverted, or initial creation
- Active version indicator showing which version is currently deployed
- Expandable diff view — Click any version to see the full VRL code and a line-by-line diff against the previous version
- Revert — One-click to load a previous version's parser into your draft for review before re-publishing
Event Volume Chart
The Overview tab includes an Event Volume (24h) area chart showing hourly event counts for the log source. Use this to:
- Verify data is flowing after publishing a new parser
- Spot ingestion gaps or anomalies
- Confirm expected volume patterns
Event Deduplication
Deduplication is always on and requires no configuration. nano automatically removes duplicate events that can occur from:
- Retried HTTP POSTs (e.g., client timeout + server-side success)
- Redundant forwarders sending the same log
- Agent reconnect double-sends
Events are matched on message + source_type + timestamp — if all three are identical within a short window, the duplicate is dropped. This happens in the Vector pipeline between the normalize and prepare_output stages.
Event Sampling
Per-log-source setting to keep a percentage of events, reducing storage costs for high-volume, low-value sources.
When to use
- High-volume firewall allow logs where 100% coverage isn't needed for detection
- Health check and heartbeat logs
- Debug-level application logs
- Any source generating millions of events/day where statistical sampling is sufficient
Do NOT enable sampling on security-critical sources. Authentication logs, deny/block actions, endpoint detection events, and any source used in detection rules should always keep 100% of events. Sampling these sources creates blind spots that attackers can exploit.
How to configure
- Navigate to Feeds and select the log source
- Go to the Configuration tab and click Edit Configuration
- Scroll to the Event Sampling section
- Set the sampling percentage (e.g., 10 = keep 10% of events)
- Set an exclude condition (VRL expression) for events that should always be kept
- Save and Publish to apply changes
Example: Firewall allow logs
For a firewall source generating 50M events/day where most are routine allow traffic:
- Sampling percentage: 10% (keeps ~5M events/day)
- Exclude condition:
.action != "allow"— all deny, drop, and reset actions are always kept at 100%
This gives you statistical visibility into allow traffic while preserving every security-relevant event.
How it works
When sampling is enabled for a log source, a Vector sample transform is inserted between the parser output and the combiner. The transform randomly keeps events at the configured ratio. Events matching the exclude condition bypass sampling entirely.
Step-by-Step Onboarding Examples
Example 1: AWS CloudTrail via S3
-
Set up AWS Infrastructure
# Create SQS queue for S3 notifications aws sqs create-queue --queue-name cloudtrail-logs # Configure S3 bucket to send notifications to SQS aws s3api put-bucket-notification-configuration \ --bucket your-cloudtrail-bucket \ --notification-configuration file://notification.json -
Add AWS Credentials in nano
- Provider: AWS S3
- Region: us-east-1
- Access Key ID and Secret Key
-
Create Log Source
- Use wizard with sample CloudTrail JSON
- AI will detect AWS CloudTrail format
- Configure source type as
aws_s3 - Set SQS queue URL in source configuration
-
Publish and Monitor
- Publish the parser to create a version and deploy to Vector
- Monitor ingestion on the overview tab's event volume chart
- Verify data appears in Search
Example 2: Application Logs via HTTP
-
Get Authentication Token
# Default token (change in production) export VECTOR_AUTH_TOKEN="your-secure-token" -
Send Sample Logs
curl -X POST http://your-nanosiem:8080/ \ -H "Authorization: Bearer your-secure-token" \ -H "X-Source-Type: my_app" \ -H "Content-Type: application/json" \ -d '{"timestamp": "2024-01-01T12:00:00Z", "level": "INFO", "message": "User login", "user": "john"}' -
Create Feed via Wizard
- Use "Sample from existing data" option
- Select
my_appsource type - AI generates parser for your JSON format
- Publish and start receiving structured data
Example 3: Syslog from Network Devices
Syslog requires a Vector aggregator to set source_type before forwarding to nano. See the complete syslog setup guide.
-
Deploy Vector Aggregator On-Premises
- Install Vector on a server in your network
- Configure it to receive syslog and tag with
source_type - Forward to nano via Vector-to-Vector protocol
-
Point Devices to Vector Aggregator
# Example Cisco configuration logging host 192.168.1.100 # Vector aggregator IP logging facility local0 logging severity informational -
Create Feed in nano
- Once logs flow through with
source_typeset, create a feed for each type - Use sample data in wizard to build parsers
- Publish and monitor
- Once logs flow through with
Ingestion Monitoring
Health Metrics
Monitor ingestion health in the Feeds dashboard:
- Event Count: Total events received
- Ingestion Rate: Events per second/minute
- Parse Errors: Failed parsing attempts
- Last Event: Timestamp of most recent log
- Health Status: Healthy, Stale, No Data, Error
Troubleshooting Common Issues
No Data Appearing
-
Check Authentication
# Test with curl curl -X POST http://your-nanosiem:8080/ \ -H "Authorization: Bearer wrong-token" \ -d "test" # Should return 401 if auth is working -
Verify Source Type Routing
- Ensure
X-Source-Typeheader matches feed configuration - Check Vector logs for routing decisions
- Ensure
-
Check Parser Deployment
- Go to Log Sources → [Your Source] → Overview tab
- Check deployment status and event volume chart
- If changes are pending, publish to deploy
Parse Errors
-
Review Error Logs
- Check System → Ingestion Errors
- Look for VRL syntax or field extraction issues
-
Test Parser
- Use Test function in parser editor
- Validate against actual log samples
- Adjust VRL code as needed
Performance Issues
-
Check Rate Limits
- Default: 1000 events/second per client
- Increase in Vector configuration if needed
-
Monitor Resource Usage
- ClickHouse disk space and memory
- Vector CPU and memory usage
- Network bandwidth
Best Practices
Security
- Use Strong Authentication Tokens: Generate random 32+ character tokens
- Enable TLS: Use HTTPS for production deployments
- Rotate Credentials: Regularly update cloud credentials
- Network Segmentation: Restrict ingestion endpoints to trusted networks
Performance
- Batch Ingestion: Send multiple logs in single requests when possible
- Use Appropriate Source Types: Choose the right ingestion method for your volume
- Monitor Parsing Efficiency: Complex VRL can impact performance
- Optimize Field Extraction: Only extract fields you need for detection and search
Data Quality
- Validate Sample Data: Ensure samples represent all log variations
- Test Edge Cases: Include malformed or unusual log entries in testing
- Monitor Parse Success Rate: Aim for >95% successful parsing
- Regular Parser Updates: Update parsers when log formats change
- Use Version History: Review diffs between versions to track parser evolution and simplify rollbacks
Operational
- Document Feed Configurations: Maintain documentation for each data source
- Set Up Alerting: Monitor for ingestion failures or data gaps
- Regular Health Checks: Review feed health metrics weekly
- Capacity Planning: Monitor growth trends and plan for scaling
Next Steps
After setting up ingestion:
- Configure Enrichment: Add IP geolocation and threat intelligence
- Create Detections: Build rules to identify security events
- Set Up Dashboards: Visualize your ingested data
- Enable Alerting: Get notified of critical security events
For more advanced configuration options, see:
- UDM Fields - Complete UDM field documentation
- Enrichments - IP geolocation and threat intelligence
- Detections - Creating security detection rules