firefrost-gaming/skill-seekers-reference

Files

yusyus 66c823107e revert: restore DOCKER_GUIDE.md and KUBERNETES_GUIDE.md

These files were incorrectly deleted — they have distinct content from
the *_DEPLOYMENT.md files (different structure, different focus, different
examples) and are not duplicates.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-18 22:24:34 +03:00

19 KiB

Raw Blame History

Kubernetes Deployment Guide

Complete guide for deploying Skill Seekers to Kubernetes using Helm charts.

Prerequisites
Quick Start
Installation Methods
Configuration
Accessing Services
Scaling
Persistence
Vector Databases
Security
Monitoring
Troubleshooting
Production Best Practices

Prerequisites

Required

Kubernetes cluster (1.23+)
Helm 3.8+
kubectl configured for your cluster
20GB+ available storage (for persistence)

Cluster Resource Requirements

Minimum (Development):

2 CPU cores
8GB RAM
20GB storage

Recommended (Production):

8+ CPU cores
32GB+ RAM
200GB+ storage (persistent volumes)

Quick Start

1. Add Helm Repository (if published)

# Add Helm repo
helm repo add skill-seekers https://yourusername.github.io/skill-seekers
helm repo update

# Install with default values
helm install my-skill-seekers skill-seekers/skill-seekers \
  --create-namespace \
  --namespace skill-seekers

2. Install from Local Chart

# Clone repository
git clone https://github.com/yourusername/skill-seekers.git
cd skill-seekers

# Install chart
helm install my-skill-seekers ./helm/skill-seekers \
  --create-namespace \
  --namespace skill-seekers

3. Quick Test

# Port-forward MCP server
kubectl port-forward -n skill-seekers svc/my-skill-seekers-mcp 8765:8765

# Test health endpoint
curl http://localhost:8765/health

# Expected response: {"status": "ok"}

Installation Methods

Method 1: Minimal Installation (Testing)

Smallest deployment for testing - no persistence, no vector databases.

helm install my-skill-seekers ./helm/skill-seekers \
  --namespace skill-seekers \
  --create-namespace \
  --set persistence.enabled=false \
  --set vectorDatabases.weaviate.enabled=false \
  --set vectorDatabases.qdrant.enabled=false \
  --set vectorDatabases.chroma.enabled=false \
  --set mcpServer.replicaCount=1 \
  --set mcpServer.autoscaling.enabled=false

Method 2: Development Installation

Moderate resources with persistence for local development.

helm install my-skill-seekers ./helm/skill-seekers \
  --namespace skill-seekers \
  --create-namespace \
  --set persistence.data.size=5Gi \
  --set persistence.output.size=10Gi \
  --set vectorDatabases.weaviate.persistence.size=20Gi \
  --set mcpServer.replicaCount=1 \
  --set secrets.anthropicApiKey="sk-ant-..."

Method 3: Production Installation

Full production deployment with autoscaling, persistence, and all vector databases.

helm install my-skill-seekers ./helm/skill-seekers \
  --namespace skill-seekers \
  --create-namespace \
  --values production-values.yaml

production-values.yaml:

global:
  environment: production

mcpServer:
  enabled: true
  replicaCount: 3
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 20
    targetCPUUtilizationPercentage: 70
  resources:
    limits:
      cpu: 2000m
      memory: 4Gi
    requests:
      cpu: 500m
      memory: 1Gi

persistence:
  data:
    size: 20Gi
    storageClass: "fast-ssd"
  output:
    size: 50Gi
    storageClass: "fast-ssd"

vectorDatabases:
  weaviate:
    enabled: true
    persistence:
      size: 100Gi
      storageClass: "fast-ssd"
  qdrant:
    enabled: true
    persistence:
      size: 100Gi
      storageClass: "fast-ssd"
  chroma:
    enabled: true
    persistence:
      size: 50Gi
      storageClass: "fast-ssd"

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
  hosts:
    - host: skill-seekers.example.com
      paths:
        - path: /mcp
          pathType: Prefix
          backend:
            service:
              name: mcp
              port: 8765
  tls:
    - secretName: skill-seekers-tls
      hosts:
        - skill-seekers.example.com

secrets:
  anthropicApiKey: "sk-ant-..."
  googleApiKey: ""
  openaiApiKey: ""
  githubToken: ""

Method 4: Custom Values Installation

# Create custom values
cat > my-values.yaml <<EOF
mcpServer:
  replicaCount: 2
  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
secrets:
  anthropicApiKey: "sk-ant-..."
EOF

# Install with custom values
helm install my-skill-seekers ./helm/skill-seekers \
  --namespace skill-seekers \
  --create-namespace \
  --values my-values.yaml

Configuration

API Keys and Secrets

Option 1: Via Helm values (NOT recommended for production)

helm install my-skill-seekers ./helm/skill-seekers \
  --set secrets.anthropicApiKey="sk-ant-..." \
  --set secrets.githubToken="ghp_..."

Option 2: Create Secret first (Recommended)

# Create secret
kubectl create secret generic skill-seekers-secrets \
  --from-literal=ANTHROPIC_API_KEY="sk-ant-..." \
  --from-literal=GITHUB_TOKEN="ghp_..." \
  --namespace skill-seekers

# Reference in values
# (Chart already uses the secret name pattern)
helm install my-skill-seekers ./helm/skill-seekers \
  --namespace skill-seekers

Option 3: External Secrets Operator

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: skill-seekers-secrets
  namespace: skill-seekers
spec:
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: skill-seekers-secrets
  data:
    - secretKey: ANTHROPIC_API_KEY
      remoteRef:
        key: skill-seekers/anthropic-api-key

Environment Variables

Customize via ConfigMap values:

env:
  MCP_TRANSPORT: "http"
  MCP_PORT: "8765"
  PYTHONUNBUFFERED: "1"
  CUSTOM_VAR: "value"

Resource Limits

Development:

mcpServer:
  resources:
    limits:
      cpu: 1000m
      memory: 2Gi
    requests:
      cpu: 250m
      memory: 512Mi

Production:

mcpServer:
  resources:
    limits:
      cpu: 4000m
      memory: 8Gi
    requests:
      cpu: 1000m
      memory: 2Gi

Accessing Services

Port Forwarding (Development)

# MCP Server
kubectl port-forward -n skill-seekers svc/my-skill-seekers-mcp 8765:8765

# Weaviate
kubectl port-forward -n skill-seekers svc/my-skill-seekers-weaviate 8080:8080

# Qdrant
kubectl port-forward -n skill-seekers svc/my-skill-seekers-qdrant 6333:6333

# Chroma
kubectl port-forward -n skill-seekers svc/my-skill-seekers-chroma 8000:8000

Via LoadBalancer

mcpServer:
  service:
    type: LoadBalancer

Get external IP:

kubectl get svc -n skill-seekers my-skill-seekers-mcp

Via Ingress (Production)

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: skill-seekers.example.com
      paths:
        - path: /mcp
          pathType: Prefix
          backend:
            service:
              name: mcp
              port: 8765

Access at: https://skill-seekers.example.com/mcp

Scaling

Manual Scaling

# Scale MCP server
kubectl scale deployment -n skill-seekers my-skill-seekers-mcp --replicas=5

# Scale Weaviate
kubectl scale deployment -n skill-seekers my-skill-seekers-weaviate --replicas=3

Horizontal Pod Autoscaler

Enabled by default for MCP server:

mcpServer:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80

Monitor HPA:

kubectl get hpa -n skill-seekers
kubectl describe hpa -n skill-seekers my-skill-seekers-mcp

Vertical Scaling

Update resource requests/limits:

helm upgrade my-skill-seekers ./helm/skill-seekers \
  --namespace skill-seekers \
  --set mcpServer.resources.requests.cpu=2000m \
  --set mcpServer.resources.requests.memory=4Gi \
  --reuse-values

Persistence

Storage Classes

Specify storage class for different workloads:

persistence:
  data:
    storageClass: "fast-ssd"  # Frequently accessed
  output:
    storageClass: "standard"  # Archive storage
  configs:
    storageClass: "fast-ssd"  # Configuration files

PVC Management

# List PVCs
kubectl get pvc -n skill-seekers

# Expand PVC (if storage class supports it)
kubectl patch pvc my-skill-seekers-data \
  -n skill-seekers \
  -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}'

# View PVC details
kubectl describe pvc -n skill-seekers my-skill-seekers-data

Backup and Restore

Backup:

# Using Velero
velero backup create skill-seekers-backup \
  --include-namespaces skill-seekers

# Manual backup (example with data PVC)
kubectl exec -n skill-seekers deployment/my-skill-seekers-mcp -- \
  tar czf - /data | \
  cat > skill-seekers-data-backup.tar.gz

Restore:

# Using Velero
velero restore create --from-backup skill-seekers-backup

# Manual restore
kubectl exec -i -n skill-seekers deployment/my-skill-seekers-mcp -- \
  tar xzf - -C /data < skill-seekers-data-backup.tar.gz

Vector Databases

Weaviate

Access:

kubectl port-forward -n skill-seekers svc/my-skill-seekers-weaviate 8080:8080

Query:

curl http://localhost:8080/v1/schema

Qdrant

Access:

# HTTP API
kubectl port-forward -n skill-seekers svc/my-skill-seekers-qdrant 6333:6333

# gRPC
kubectl port-forward -n skill-seekers svc/my-skill-seekers-qdrant 6334:6334

Query:

curl http://localhost:6333/collections

Chroma

Access:

kubectl port-forward -n skill-seekers svc/my-skill-seekers-chroma 8000:8000

Query:

curl http://localhost:8000/api/v1/collections

Disable Vector Databases

To disable individual vector databases:

vectorDatabases:
  weaviate:
    enabled: false
  qdrant:
    enabled: false
  chroma:
    enabled: false

Security

Pod Security Context

Runs as non-root user (UID 1000):

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000

securityContext:
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: false
  allowPrivilegeEscalation: false

Network Policies

Create network policies for isolation:

networkPolicy:
  enabled: true
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
            name: ingress-nginx
  egress:
    - to:
      - namespaceSelector: {}

RBAC

Enable RBAC with minimal permissions:

rbac:
  create: true
  rules:
    - apiGroups: [""]
      resources: ["configmaps", "secrets"]
      verbs: ["get", "list"]

Secrets Management

Best Practices:

Never commit secrets to git
Use external secret managers (AWS Secrets Manager, HashiCorp Vault)
Enable encryption at rest in Kubernetes
Rotate secrets regularly

Example with Sealed Secrets:

# Create sealed secret
kubectl create secret generic skill-seekers-secrets \
  --from-literal=ANTHROPIC_API_KEY="sk-ant-..." \
  --dry-run=client -o yaml | \
  kubeseal -o yaml > sealed-secret.yaml

# Apply sealed secret
kubectl apply -f sealed-secret.yaml -n skill-seekers

Monitoring

Pod Metrics

# View pod status
kubectl get pods -n skill-seekers

# View pod metrics (requires metrics-server)
kubectl top pods -n skill-seekers

# View pod logs
kubectl logs -n skill-seekers -l app.kubernetes.io/component=mcp-server --tail=100 -f

Prometheus Integration

Enable ServiceMonitor (requires Prometheus Operator):

serviceMonitor:
  enabled: true
  interval: 30s
  scrapeTimeout: 10s
  labels:
    prometheus: kube-prometheus

Grafana Dashboards

Import dashboard JSON from helm/skill-seekers/dashboards/.

Health Checks

MCP server has built-in health checks:

livenessProbe:
  httpGet:
    path: /health
    port: 8765
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health
    port: 8765
  initialDelaySeconds: 10
  periodSeconds: 5

Test manually:

kubectl exec -n skill-seekers deployment/my-skill-seekers-mcp -- \
  curl http://localhost:8765/health

Troubleshooting

Pods Not Starting

# Check pod status
kubectl get pods -n skill-seekers

# View events
kubectl get events -n skill-seekers --sort-by='.lastTimestamp'

# Describe pod
kubectl describe pod -n skill-seekers <pod-name>

# Check logs
kubectl logs -n skill-seekers <pod-name>

Common Issues

Issue: ImagePullBackOff

# Check image pull secrets
kubectl get secrets -n skill-seekers

# Verify image exists
docker pull <image-name>

Issue: CrashLoopBackOff

# View recent logs
kubectl logs -n skill-seekers <pod-name> --previous

# Check environment variables
kubectl exec -n skill-seekers <pod-name> -- env

Issue: PVC Pending

# Check storage class
kubectl get storageclass

# View PVC events
kubectl describe pvc -n skill-seekers <pvc-name>

# Check if provisioner is running
kubectl get pods -n kube-system | grep provisioner

Issue: API Key Not Working

# Verify secret exists
kubectl get secret -n skill-seekers my-skill-seekers

# Check secret contents (base64 encoded)
kubectl get secret -n skill-seekers my-skill-seekers -o yaml

# Test API key manually
kubectl exec -n skill-seekers deployment/my-skill-seekers-mcp -- \
  env | grep ANTHROPIC

Debug Container

Run debug container in same namespace:

kubectl run debug -n skill-seekers --rm -it \
  --image=nicolaka/netshoot \
  --restart=Never -- bash

# Inside debug container:
# Test MCP server connectivity
curl http://my-skill-seekers-mcp:8765/health

# Test vector database connectivity
curl http://my-skill-seekers-weaviate:8080/v1/.well-known/ready

Production Best Practices

1. Resource Planning

Capacity Planning:

MCP Server: 500m CPU + 1Gi RAM per 10 concurrent requests
Vector DBs: 2GB RAM + 10GB storage per 100K documents
Reserve 30% overhead for spikes

Example Production Setup:

mcpServer:
  replicaCount: 5  # Handle 50 concurrent requests
  resources:
    requests:
      cpu: 2500m
      memory: 5Gi
  autoscaling:
    minReplicas: 5
    maxReplicas: 20

2. High Availability

Anti-Affinity Rules:

mcpServer:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/component
            operator: In
            values:
            - mcp-server
        topologyKey: kubernetes.io/hostname

Multiple Replicas:

MCP Server: 3+ replicas across different nodes
Vector DBs: 2+ replicas with replication

3. Monitoring and Alerting

Key Metrics to Monitor:

Pod restart count (> 5 per hour = critical)
Memory usage (> 90% = warning)
CPU throttling (> 50% = investigate)
Request latency (p95 > 1s = warning)
Error rate (> 1% = critical)

Prometheus Alerts:

- alert: HighPodRestarts
  expr: rate(kube_pod_container_status_restarts_total{namespace="skill-seekers"}[15m]) > 0.1
  for: 5m
  labels:
    severity: warning

4. Backup Strategy

Automated Backups:

# CronJob for daily backups
apiVersion: batch/v1
kind: CronJob
metadata:
  name: skill-seekers-backup
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: skill-seekers:latest
            command:
            - /bin/sh
            - -c
            - tar czf /backup/data-$(date +%Y%m%d).tar.gz /data

5. Security Hardening

Security Checklist:

Enable Pod Security Standards
Use Network Policies
Enable RBAC with least privilege
Rotate secrets every 90 days
Scan images for vulnerabilities
Enable audit logging
Use private container registry
Enable encryption at rest

6. Cost Optimization

Strategies:

Use spot/preemptible instances for non-critical workloads
Enable cluster autoscaler
Right-size resource requests
Use storage tiering (hot/warm/cold)
Schedule downscaling during off-hours

Example Cost Optimization:

# Development environment: downscale at night
# Create CronJob to scale down replicas
apiVersion: batch/v1
kind: CronJob
metadata:
  name: downscale-dev
spec:
  schedule: "0 20 * * *"  # 8 PM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: scaler
          containers:
          - name: kubectl
            image: bitnami/kubectl
            command:
            - kubectl
            - scale
            - deployment
            - my-skill-seekers-mcp
            - --replicas=1

7. Update Strategy

Rolling Updates:

mcpServer:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

Update Process:

# 1. Test in staging
helm upgrade my-skill-seekers ./helm/skill-seekers \
  --namespace skill-seekers-staging \
  --values staging-values.yaml

# 2. Run smoke tests
./scripts/smoke-test.sh

# 3. Deploy to production
helm upgrade my-skill-seekers ./helm/skill-seekers \
  --namespace skill-seekers \
  --values production-values.yaml

# 4. Monitor for 15 minutes
kubectl rollout status deployment -n skill-seekers my-skill-seekers-mcp

# 5. Rollback if issues
helm rollback my-skill-seekers -n skill-seekers

Upgrade Guide

Minor Version Upgrade

# Fetch latest chart
helm repo update

# Upgrade with existing values
helm upgrade my-skill-seekers skill-seekers/skill-seekers \
  --namespace skill-seekers \
  --reuse-values

Major Version Upgrade

# Backup current values
helm get values my-skill-seekers -n skill-seekers > backup-values.yaml

# Review CHANGELOG for breaking changes
curl https://raw.githubusercontent.com/yourusername/skill-seekers/main/CHANGELOG.md

# Upgrade with migration steps
helm upgrade my-skill-seekers skill-seekers/skill-seekers \
  --namespace skill-seekers \
  --values backup-values.yaml \
  --force  # Only if schema changed

Uninstallation

Full Cleanup

# Delete Helm release
helm uninstall my-skill-seekers -n skill-seekers

# Delete PVCs (if you want to remove data)
kubectl delete pvc -n skill-seekers --all

# Delete namespace
kubectl delete namespace skill-seekers

Keep Data

# Delete release but keep PVCs
helm uninstall my-skill-seekers -n skill-seekers

# PVCs remain for later use
kubectl get pvc -n skill-seekers

Additional Resources

Need Help?

GitHub Issues: https://github.com/yourusername/skill-seekers/issues
Documentation: https://skillseekersweb.com
Community: [Link to Discord/Slack]

19 KiB Raw Blame History

Kubernetes Deployment Guide

Table of Contents

Prerequisites

Required

Recommended

Cluster Resource Requirements

Quick Start

1. Add Helm Repository (if published)

2. Install from Local Chart

3. Quick Test

Installation Methods

Method 1: Minimal Installation (Testing)

Method 2: Development Installation

Method 3: Production Installation

Method 4: Custom Values Installation

Configuration

API Keys and Secrets

Environment Variables

Resource Limits

Accessing Services

Port Forwarding (Development)

Via LoadBalancer

Via Ingress (Production)

Scaling

Manual Scaling

Horizontal Pod Autoscaler

Vertical Scaling

Persistence

Storage Classes

PVC Management

Backup and Restore

Vector Databases

Weaviate

Qdrant

Chroma

Disable Vector Databases

Security

Pod Security Context

Network Policies

RBAC

Secrets Management

Monitoring

Pod Metrics

Prometheus Integration

Grafana Dashboards

Health Checks

Troubleshooting

Pods Not Starting

Common Issues

Debug Container

Production Best Practices

1. Resource Planning

2. High Availability

3. Monitoring and Alerting

4. Backup Strategy

5. Security Hardening

6. Cost Optimization

7. Update Strategy

Upgrade Guide

Minor Version Upgrade

Major Version Upgrade

Uninstallation

Full Cleanup

Keep Data

Additional Resources

19 KiB

Raw Blame History