Monitoring
Monitoring and Observability
Arches deploys a comprehensive monitoring stack with Grafana and Loki, providing built-in dashboards for immediate visibility into system health and performance.
Stack Overview
Core Components
- Grafana: Visualization and dashboarding platform with pre-configured dashboards
- Loki: Log aggregation system for centralized logging
- Prometheus: Metrics collection and alerting
- Promtail: Log shipping agent for Loki
Built-in Dashboards
Arches comes with pre-configured Grafana dashboards that provide immediate insights:
1. Application Overview Dashboard
- Request rate and response times
- Error rates by endpoint
- Active connections and goroutines
- Memory and CPU usage
- Database connection pool metrics
2. Infrastructure Dashboard
- Node resource utilization
- Pod status and restarts
- Network traffic patterns
- Disk I/O and usage
- Container resource limits
3. Business Metrics Dashboard
- User registrations and logins
- API usage by endpoint
- Organization activity
- Workflow execution metrics
- Content processing statistics
4. Logs Dashboard (Loki)
- Real-time log streaming
- Log level distribution
- Error log aggregation
- Request tracing
- Structured query capabilities
Quick Start
Deploy Monitoring Stack
Code
Access Dashboards
Code
Configuration
Grafana Data Sources
Pre-configured data sources include:
Code
Loki Configuration
Code
Dashboard Features
Real-time Metrics
- Live updating graphs with 5-second refresh
- Customizable time ranges
- Drill-down capabilities
- Correlation between metrics
Log Analysis
- Full-text search across all logs
- LogQL query language support
- Context viewing for log entries
- Export capabilities
Alerting
- Pre-configured alert rules
- Multiple notification channels
- Alert history and silencing
- SLA tracking
Integration
Application Instrumentation
Arches automatically exports metrics and logs:
Code
Custom Metrics
Add custom metrics easily:
Code
Alerts
Pre-configured alerts include:
- High error rate (greater than 5% of requests)
- High latency (p95 greater than 1s)
- Low disk space (less than 10% free)
- Pod restarts (more than 3 in 5 minutes)
- Database connection issues
- Memory pressure (greater than 80% usage)
Best Practices
- Retention: Logs retained for 30 days, metrics for 90 days
- Sampling: Automatic sampling for high-volume endpoints
- Cardinality: Labels kept minimal to prevent metric explosion
- Security: TLS enabled, authentication required
- Backup: Daily backups of Grafana dashboards and configurations
Troubleshooting
Common Issues
-
No data in dashboards
Code -
High memory usage
- Adjust retention policies
- Increase resource limits
- Enable log sampling
-
Slow queries
- Add appropriate indexes
- Optimize LogQL queries
- Use time range filters
Last modified on