diff --git a/skills/grpc-golang/SKILL.md b/skills/grpc-golang/SKILL.md new file mode 100644 index 00000000..66d17712 --- /dev/null +++ b/skills/grpc-golang/SKILL.md @@ -0,0 +1,103 @@ +--- +name: grpc-golang +description: "Build production-ready gRPC services in Go with mTLS, streaming, and observability. Use when designing Protobuf contracts with Buf or implementing secure service-to-service transport." +risk: safe +source: self +--- + +# gRPC Golang (gRPC-Go) + +## Overview + +Comprehensive guide for designing and implementing production-grade gRPC services in Go. Covers contract standardization with Buf, transport layer security via mTLS, and deep observability with OpenTelemetry interceptors. + +## Use this skill when + +- Designing microservices communication with gRPC in Go. +- Building high-performance internal APIs using Protobuf. +- Implementing streaming workloads (unidirectional or bidirectional). +- Standardizing API contracts using Protobuf and Buf. +- Configuring mTLS for service-to-service authentication. + +## Do not use this skill when + +- Building pure REST/HTTP public APIs without gRPC requirements. +- Modifying legacy `.proto` files without the ability to introduce a new API version (e.g., `api.v2`) or ensure backward compatibility. +- Managing service mesh traffic routing (e.g., Istio/Linkerd), which is outside the application code scope. + +## Step-by-Step Guide + +1. **Confirm Technical Context**: Identify Go version, gRPC-Go version, and whether the project uses Buf or raw protoc. +2. **Confirm Requirements**: Identify mTLS needs, load patterns (unary/streaming), SLOs, and message size limits. +3. **Plan Schema**: Define package versioning (e.g., `api.v1`), resource types, and error mapping. +4. **Security Design**: Implement mTLS for service-to-service authentication. +5. **Observability**: Configure interceptors for tracing, metrics, and structured logging. +6. **Verification**: Always run `buf lint` and breaking change checks before finalizing code generation. + +Refer to `resources/implementation-playbook.md` for detailed patterns, code examples, and anti-patterns. + +## Examples + +### Example 1: Defining a Service & Message (v1 API) + +```proto +syntax = "proto3"; +package api.v1; +option go_package = "github.com/org/repo/gen/api/v1;apiv1"; + +service UserService { + rpc GetUser(GetUserRequest) returns (GetUserResponse); +} + +message User { + string id = 1; + string name = 2; +} + +message GetUserRequest { + string id = 1; +} + +message GetUserResponse { + User user = 1; +} +``` + +## Best Practices + +- ✅ **Do:** Use Buf to standardize your toolchain and linting with `buf.yaml` and `buf.gen.yaml`. +- ✅ **Do:** Always use semantic versioning in package paths (e.g., `package api.v1`). +- ✅ **Do:** Enforce mTLS for all internal service-to-service communication. +- ✅ **Do:** Handle `ctx.Done()` in all streaming handlers to prevent resource leaks. +- ✅ **Do:** Map domain errors to standard gRPC status codes (e.g., `codes.NotFound`). +- ❌ **Don't:** Return raw internal error strings or stack traces to gRPC clients. +- ❌ **Don't:** Create a new `grpc.ClientConn` per request; always reuse connections. + +## Troubleshooting + +- **Error: Inconsistent Gen**: If the generated code does not match the schema, run `buf generate` and verify the `go_package` option. +- **Error: Context Deadline**: Check client timeouts and ensure the server is not blocking infinitely in streaming handlers. +- **Error: mTLS Handshake**: Ensure the CA certificate is correctly added to the `x509.CertPool` on both client and server sides. + +## Limitations + +- Does not cover service mesh traffic routing (Istio/Linkerd configuration). +- Does not cover gRPC-Web or browser-based gRPC integration. +- Assumes Go 1.21+ and gRPC-Go v1.60+; older versions may have different APIs (e.g., `grpc.Dial` vs `grpc.NewClient`). +- Does not cover L7 gRPC-aware load balancer configuration (e.g., Envoy, NGINX). +- Does not address Protobuf schema registry or large-scale schema governance beyond Buf lint. + +## Resources + +- `resources/implementation-playbook.md` for detailed patterns, code examples, and anti-patterns. +- [Google API Design Guide](https://cloud.google.com/apis/design) +- [Buf Docs](https://buf.build/docs) +- [gRPC-Go Docs](https://grpc.io/docs/languages/go/) +- [OpenTelemetry Go Instrumentation](https://opentelemetry.io/docs/instrumentation/go/) + +## Related Skills + +- @golang-pro - General Go patterns and performance optimization outside the gRPC layer. +- @go-concurrency-patterns - Advanced goroutine lifecycle management for streaming handlers. +- @api-design-principles - Resource naming and versioning strategy before writing `.proto` files. +- @docker-expert - Containerizing gRPC services and configuring TLS cert injection via Docker secrets. diff --git a/skills/grpc-golang/resources/implementation-playbook.md b/skills/grpc-golang/resources/implementation-playbook.md new file mode 100644 index 00000000..e6eece49 --- /dev/null +++ b/skills/grpc-golang/resources/implementation-playbook.md @@ -0,0 +1,548 @@ +# gRPC Golang Implementation Playbook + +This file contains detailed patterns, checklists, and code samples referenced by the skill. + +## Schema Design Standards + +### Protobuf Definition + +- **Syntax**: Use proto3 only. +- **Versioning**: Use package versioning (e.g., `api.v1`). +- **Pagination**: Use `page_token` and `page_size` for list operations. +- **Timezone**: Always use `google.protobuf.Timestamp` with UTC values at the server level. +- **Idempotency**: Use idempotency keys or design side-effect-free methods to allow safe retries. +- **Validation**: Adopt a schema-level validation approach (e.g., Buf validation rules or `protoc-gen-validate`) and ensure generated code is enforced server-side. + +```proto +syntax = "proto3"; +package api.v1; +option go_package = "github.com/org/repo/gen/api/v1;apiv1"; + +import "google/protobuf/timestamp.proto"; + +service UserService { + rpc GetUser(GetUserRequest) returns (GetUserResponse); + rpc ListUsers(ListUsersRequest) returns (ListUsersResponse); + rpc WatchUsers(WatchUsersRequest) returns (stream UserEvent); +} + +message User { + string id = 1; + string name = 2; + string email = 3; + google.protobuf.Timestamp created_at = 4; +} + +message GetUserRequest { + string id = 1; +} + +message GetUserResponse { + User user = 1; +} + +message ListUsersRequest { + int32 page_size = 1; + string page_token = 2; +} + +message ListUsersResponse { + repeated User users = 1; + string next_page_token = 2; +} + +message WatchUsersRequest { + // Empty; streams all user events from the current point. +} + +message UserEvent { + enum EventType { + EVENT_TYPE_UNSPECIFIED = 0; + EVENT_TYPE_CREATED = 1; + EVENT_TYPE_UPDATED = 2; + EVENT_TYPE_DELETED = 3; + } + EventType type = 1; + User user = 2; + google.protobuf.Timestamp occurred_at = 3; +} +``` + +## Code Generation + +- **Toolchain**: Use `google.golang.org/protobuf/cmd/protoc-gen-go` and `protoc-gen-go-grpc`. +- **Management**: Use `buf.gen.yaml` to manage plugin versions and generation parameters. +- **Compatibility**: Ensure plugins use Protobuf Go v2 API (`google.golang.org/protobuf`). Do not mix with the deprecated v1 API (`github.com/golang/protobuf`). + +### buf.gen.yaml Example + +```yaml +version: v2 +plugins: + - remote: buf.build/protocolbuffers/go + out: gen + opt: paths=source_relative + - remote: buf.build/grpc/go + out: gen + opt: paths=source_relative +``` + +## Server Implementation + +### Full Server Setup with Graceful Shutdown + +```go +package main + +import ( + "context" + "log" + "net" + "os" + "os/signal" + "syscall" + "time" + + "google.golang.org/grpc" + "google.golang.org/grpc/health" + healthpb "google.golang.org/grpc/health/grpc_health_v1" + "google.golang.org/grpc/keepalive" + + apiv1 "github.com/org/repo/gen/api/v1" +) + +func main() { + srv := grpc.NewServer( + grpc.ChainUnaryInterceptor( + recoveryInterceptor, + loggingInterceptor, + otelUnaryInterceptor, + ), + grpc.KeepaliveParams(keepalive.ServerParameters{ + MaxConnectionIdle: 5 * time.Minute, + Time: 1 * time.Minute, + Timeout: 20 * time.Second, + }), + grpc.MaxRecvMsgSize(4<<20), // 4 MB + grpc.MaxSendMsgSize(4<<20), // 4 MB + ) + + // Register application services. + apiv1.RegisterUserServiceServer(srv, newUserService()) + + // Register health check with fully-qualified service name. + healthSrv := health.NewServer() + healthpb.RegisterHealthServer(srv, healthSrv) + healthSrv.SetServingStatus( + "api.v1.UserService", + healthpb.HealthCheckResponse_SERVING, + ) + + lis, err := net.Listen("tcp", ":50051") + if err != nil { + log.Fatalf("listen: %v", err) + } + + // Graceful shutdown: GracefulStop with a fallback timeout to Stop. + go func() { + sigCh := make(chan os.Signal, 1) + signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM) + <-sigCh + + log.Println("shutting down gRPC server...") + healthSrv.SetServingStatus( + "api.v1.UserService", + healthpb.HealthCheckResponse_NOT_SERVING, + ) + + ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second) + defer cancel() + + stopped := make(chan struct{}) + go func() { + srv.GracefulStop() + close(stopped) + }() + + select { + case <-stopped: + log.Println("server stopped gracefully") + case <-ctx.Done(): + log.Println("graceful stop timed out, forcing stop") + srv.Stop() + } + }() + + log.Printf("gRPC server listening on %s", lis.Addr()) + if err := srv.Serve(lis); err != nil { + log.Fatalf("serve: %v", err) + } +} +``` + +## mTLS Setup + +```go +package main + +import ( + "crypto/tls" + "crypto/x509" + "fmt" + "log" + "os" + + "google.golang.org/grpc" + "google.golang.org/grpc/credentials" +) + +// loadServerTLS configures mTLS for the server side. +func loadServerTLS() grpc.ServerOption { + tlsCert, err := tls.LoadX509KeyPair("server.crt", "server.key") + if err != nil { + log.Fatalf("load server cert: %v", err) + } + + caCert, err := os.ReadFile("ca.crt") + if err != nil { + log.Fatalf("read CA cert: %v", err) + } + + caPool := x509.NewCertPool() + if !caPool.AppendCertsFromPEM(caCert) { + log.Fatal("failed to append CA cert") + } + + tlsCfg := &tls.Config{ + Certificates: []tls.Certificate{tlsCert}, + ClientCAs: caPool, + ClientAuth: tls.RequireAndVerifyClientCert, + MinVersion: tls.VersionTLS13, + } + return grpc.Creds(credentials.NewTLS(tlsCfg)) +} + +// dialWithMTLS creates a client connection using mTLS. +func dialWithMTLS(target string) (*grpc.ClientConn, error) { + clientCert, err := tls.LoadX509KeyPair("client.crt", "client.key") + if err != nil { + return nil, fmt.Errorf("load client cert: %w", err) + } + + caCert, err := os.ReadFile("ca.crt") + if err != nil { + return nil, fmt.Errorf("read CA cert: %w", err) + } + + caPool := x509.NewCertPool() + if !caPool.AppendCertsFromPEM(caCert) { + return nil, fmt.Errorf("failed to append CA cert") + } + + creds := credentials.NewTLS(&tls.Config{ + Certificates: []tls.Certificate{clientCert}, + RootCAs: caPool, + MinVersion: tls.VersionTLS13, + }) + + // Note: for gRPC-Go v1.63+, grpc.NewClient is the recommended replacement. + conn, err := grpc.Dial(target, grpc.WithTransportCredentials(creds)) + if err != nil { + return nil, fmt.Errorf("dial %s: %w", target, err) + } + return conn, nil +} +``` + +## Client Best Practices + +### Connection Reuse + +```go +package main + +import ( + "context" + "fmt" + "log" + "os" + "time" + + "google.golang.org/grpc" + "google.golang.org/grpc/credentials" + + apiv1 "github.com/org/repo/gen/api/v1" +) + +// Initialize once at startup; reuse across the application lifetime. +var userConn *grpc.ClientConn + +func initClients(creds credentials.TransportCredentials) { + var err error + // Note: for gRPC-Go v1.63+, use grpc.NewClient instead. + userConn, err = grpc.Dial( + os.Getenv("USER_SVC_ADDR"), + grpc.WithTransportCredentials(creds), + ) + if err != nil { + log.Fatalf("dial user-svc: %v", err) + } +} + +func callListUsers(ctx context.Context) (*apiv1.ListUsersResponse, error) { + // Always set a deadline per call, not per connection. + ctx, cancel := context.WithTimeout(ctx, 5*time.Second) + defer cancel() + + client := apiv1.NewUserServiceClient(userConn) + resp, err := client.ListUsers(ctx, &apiv1.ListUsersRequest{PageSize: 20}) + if err != nil { + return nil, fmt.Errorf("list users: %w", err) + } + return resp, nil +} +``` + +### Retry Policy + +Only enable retries for idempotent calls. Use exponential backoff. + +```go +import "google.golang.org/grpc" + +// Service config with retry policy for idempotent methods. +const retryPolicy = `{ + "methodConfig": [{ + "name": [{"service": "api.v1.UserService", "method": "GetUser"}], + "retryPolicy": { + "maxAttempts": 3, + "initialBackoff": "0.1s", + "maxBackoff": "1s", + "backoffMultiplier": 2, + "retryableStatusCodes": ["UNAVAILABLE", "DEADLINE_EXCEEDED"] + } + }] +}` + +// Note: for gRPC-Go v1.63+, use grpc.NewClient instead of grpc.Dial. +conn, err := grpc.Dial( + target, + grpc.WithTransportCredentials(creds), + grpc.WithDefaultServiceConfig(retryPolicy), +) +``` + +## Observability + +### Interceptor Labels + +- **Logging**: Include `grpc.method`, `grpc.service`, `grpc.code`, `latency_ms`, and `trace_id`. +- **Metrics**: Export request count, latency histogram, and in-flight stream count. + +### OpenTelemetry Integration + +```go +import ( + "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc" + "google.golang.org/grpc" +) + +srv := grpc.NewServer( + grpc.StatsHandler(otelgrpc.NewServerHandler()), +) + +// Note: for gRPC-Go v1.63+, use grpc.NewClient instead of grpc.Dial. +conn, err := grpc.Dial( + target, + grpc.WithStatsHandler(otelgrpc.NewClientHandler()), +) +``` + +## Testing + +### bufconn In-Process Test + +```go +package service_test + +import ( + "context" + "net" + "testing" + "time" + + "google.golang.org/grpc" + "google.golang.org/grpc/credentials/insecure" + "google.golang.org/grpc/status" + "google.golang.org/grpc/codes" + "google.golang.org/grpc/test/bufconn" + + apiv1 "github.com/org/repo/gen/api/v1" +) + +func TestListUsers(t *testing.T) { + lis := bufconn.Listen(1 << 20) + srv := grpc.NewServer() + apiv1.RegisterUserServiceServer(srv, &fakeUserSvc{}) + go func() { + if err := srv.Serve(lis); err != nil { + t.Logf("server exited: %v", err) + } + }() + t.Cleanup(srv.GracefulStop) + + // Note: for gRPC-Go v1.63+, use grpc.NewClient instead of grpc.DialContext. + conn, err := grpc.DialContext(context.Background(), + "bufnet", + grpc.WithContextDialer(func(ctx context.Context, _ string) (net.Conn, error) { + return lis.DialContext(ctx) + }), + grpc.WithTransportCredentials(insecure.NewCredentials()), + ) + if err != nil { + t.Fatalf("dial bufnet: %v", err) + } + t.Cleanup(func() { conn.Close() }) + + client := apiv1.NewUserServiceClient(conn) + ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second) + defer cancel() + + resp, err := client.ListUsers(ctx, &apiv1.ListUsersRequest{PageSize: 10}) + if code := status.Code(err); code != codes.OK { + t.Fatalf("expected OK, got %v: %v", code, err) + } + if resp == nil { + t.Fatal("expected non-nil response") + } +} +``` + +## Streaming Handler Pattern + +Always check `ctx.Done()` in streaming loops. Never expose raw internal errors to clients. + +```go +func (s *userService) WatchUsers( + req *apiv1.WatchUsersRequest, + stream apiv1.UserService_WatchUsersServer, +) error { + ctx := stream.Context() + + events := s.subscribeUserEvents() + defer s.unsubscribe(events) + + for { + select { + case <-ctx.Done(): + // Client disconnected or deadline exceeded; exit cleanly. + return status.Error(codes.Canceled, "client disconnected") + + case event, ok := <-events: + if !ok { + // Channel closed; server is shutting down. + return status.Error(codes.Unavailable, "service shutting down") + } + + if err := stream.Send(event); err != nil { + // Log the raw error server-side for diagnostics. + log.Printf("stream send failed: %v", err) + // Return a generic message to the client; never leak raw err. + return status.Error(codes.Internal, "failed to send event") + } + } + } +} +``` + +## Error Mapping + +Map domain errors to gRPC status codes consistently: + +Only return `err.Error()` to clients when it is a safe, user-facing domain message (not an internal error string). + +```go +package service + +import ( + "errors" + + "google.golang.org/grpc/codes" + "google.golang.org/grpc/status" +) + +var ( + ErrNotFound = errors.New("resource not found") + ErrAlreadyExists = errors.New("resource already exists") + ErrInvalidInput = errors.New("invalid input") + ErrPermission = errors.New("permission denied") +) + +// toGRPCError maps a domain error to a gRPC status error. +func toGRPCError(err error) error { + if err == nil { + return nil + } + switch { + case errors.Is(err, ErrNotFound): + return status.Error(codes.NotFound, err.Error()) + case errors.Is(err, ErrAlreadyExists): + return status.Error(codes.AlreadyExists, err.Error()) + case errors.Is(err, ErrInvalidInput): + return status.Error(codes.InvalidArgument, err.Error()) + case errors.Is(err, ErrPermission): + return status.Error(codes.PermissionDenied, err.Error()) + default: + return status.Error(codes.Internal, "internal error") + } +} +``` + +## Project Layout + +``` +project/ + buf.gen.yaml + buf.yaml + proto/ + api/ + v1/ + user_service.proto + gen/ # Generated code (committed or gitignored) + api/ + v1/ + user_service.pb.go + user_service_grpc.pb.go + internal/ + service/ + user.go # Service implementation + user_test.go # bufconn tests + domain/ + errors.go # Domain error definitions + cmd/ + server/ + main.go # Server entrypoint with graceful shutdown + config/ + config.go # Env-based config (timeouts, TLS paths, limits) +``` + +## Safety Checklist + +- Default to TLS/mTLS for all production traffic. +- Enforce limits (`MaxRecvMsgSize`, `MaxSendMsgSize`, metadata size) to reduce resource exhaustion. +- Treat client-sent metadata as untrusted; validate and allowlist keys used for auth or tenant routing. +- Disable gRPC reflection in production to avoid exposing internal service schemas. +- Check `context.Done()` in every iteration of a streaming handler to prevent goroutine leaks. + +## Anti-Patterns + +| Anti-Pattern | Why It Hurts | Fix | +| --------------------------------------------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------ | +| Create new `grpc.ClientConn` per request | Exhausts OS sockets and disables HTTP/2 multiplexing, causing high latency and resource leaks | Initialize once, reuse globally | +| Mix Protobuf v1 and v2 libraries | Causes silent marshaling bugs; `proto.Marshal` from v1 and v2 are NOT interchangeable | Pin to `google.golang.org/protobuf` (v2) throughout | +| Expose raw internal error strings to clients | Leaks stack traces and internal service names; a security and UX risk | Map errors with `status.Errorf` using appropriate gRPC codes | +| Ignore `context.Done()` in streaming handlers | Goroutine and connection leak when client disconnects | Check `ctx.Err()` in every iteration of a streaming loop | +| Skip error handling with `_ =` | Hides failures silently; production outages become undiagnosable | Always check and handle errors explicitly | +| Use `grpc.Dial` without health checks | Connection failures are deferred and may surface as runtime errors | Use health checks and monitor connection state | + +> **Migration note**: For gRPC-Go v1.63+ (Jan 2024), `grpc.NewClient` is the newer API recommended by the gRPC-Go project for new code. For older versions (or when following existing codebases and official grpc.io examples), using `grpc.Dial` / `grpc.DialContext` is still common.