···8899**Sync Architecture**: The operator implements **unidirectional sync from HSM to Kubernetes Secrets only**. HSM is the authoritative source of truth. K8s Secrets are read-only replicas that get updated when HSM data changes. There is no K8s → HSM sync functionality.
10101111+**GitOps Deployment**: This Kubernetes deployment is GitOps-based. You are not able to push new images to the Kubernetes cluster from this machine. Code changes require updating the deployment through the GitOps pipeline.
1212+1313+**Code Modernization**: Avoid keeping legacy code whenever possible. Replace and improve functions as necessary rather than maintaining backward compatibility with outdated patterns.
1414+1115## Project Overview
12161317A Kubernetes operator that bridges Hardware Security Module (HSM) data storage with Kubernetes Secrets, providing true secret portability through hardware-based security. The operator implements a controller pattern that synchronizes HSM binary data files to Kubernetes Secret objects using a unified binary architecture with gRPC communication, automatic USB device discovery, and dynamic agent deployment.
···4347**Race-Free Coordination:**
4448- HSMDevice CRDs contain readonly specifications only (no status field)
4549- Discovery pods report via their own pod annotations
4646-- HSMPool CRDs aggregate all discovery reports from multiple nodes
5050+- **HSMPool CRDs are the source of truth** for agent discovery and multi-device operations
5151+- HSMPool aggregates all discovery reports from multiple nodes
4752- Owner references ensure automatic cleanup when resources are deleted
4853- 5-minute grace periods prevent agent churn during outages
49545555+**Multi-Device Agent Architecture:**
5656+- **HSMPool-based Agent Discovery**: API and controllers query HSMPool to find all agent instances for a device type
5757+- **Multiple Agent Instances**: Each physical device gets its own agent pod (e.g., `hsm-agent-pico-hsm-0`, `hsm-agent-pico-hsm-1`)
5858+- **Multi-Agent Operations**: API operations (list, write, delete) work across all agents when mirroring is enabled
5959+- **Automatic Synchronization**: HSMSyncReconciler handles conflict detection and resolution between devices
6060+5061**gRPC Communication Architecture:**
5162- Protocol definition in `api/proto/hsm/v1/hsm.proto` with 10 HSM operations
5263- Manager ↔ Agent: gRPC for efficient, type-safe HSM operations
5364- Discovery → Manager: Pod annotations for race-free device reporting
5454-- External → Manager: REST API proxy routing to appropriate agents
6565+- **External → Manager**: REST API proxy routes to ALL agents for multi-device operations
5566- Generated code: `api/proto/hsm/v1/hsm.pb.go` and `hsm_grpc.pb.go`
56675768**Controller Hierarchy:**
···6071├── HSMSecretReconciler - HSM to K8s Secret sync
6172├── HSMPoolReconciler - Aggregates discovery reports from pod annotations
6273├── HSMPoolAgentReconciler - Deploys agents when pools are ready
7474+├── HSMSyncReconciler - Multi-device HSM synchronization and conflict resolution
6375└── DiscoveryDaemonSetReconciler - Manages discovery DaemonSet lifecycle
64766577Discovery Controllers:
···255267# Monitor sync status
256268kubectl get hsmsecret my-secret -o jsonpath='{.status.syncStatus}'
257269258258-# Check discovered devices
270270+# Check discovered devices in HSMPool (source of truth for agents)
259271kubectl get hsmpool -o jsonpath='{.status.aggregatedDevices[*].devicePath}'
272272+273273+# Check HSMPool readiness for agent deployment
274274+kubectl get hsmpool -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,DEVICES:.status.totalDevices
275275+276276+# View all agent pods for multi-device setup
277277+kubectl get pods -l app.kubernetes.io/name=hsm-agent
260278261279# View discovery pod reports
262280kubectl get pods -l app.kubernetes.io/component=discovery \
···3083262. `HSMPoolReconciler` aggregates device discovery reports from pod annotations (race-free)
3093273. `HSMPoolAgentReconciler` deploys agents dynamically when devices are ready
3103284. `HSMSyncReconciler` handles multi-device HSM synchronization (HSM ↔ HSM only)
329329+330330+**Agent Discovery Architecture:**
331331+- **HSMPool as Source of Truth**: API and controllers query HSMPool.Status.AggregatedDevices instead of individual HSMDevice resources
332332+- **Multi-Instance Agent Tracking**: Agent manager tracks agents by keys like `pico-hsm-0`, `pico-hsm-1` for multiple physical devices
333333+- **Pool-Based Cleanup**: Agent cleanup based on HSMSecret existence rather than device-specific references
334334+- **API Multi-Device Operations**:
335335+ - `findAvailableAgent()` queries HSMPools to find any available agent for a device type
336336+ - `getAllAvailableAgents()` returns all agents across all pools for mirroring operations
337337+ - Operations like delete/write with `mirror=true` target ALL agents simultaneously
311338312339**PKCS#11 Client Implementation:**
313340- Production: `internal/hsm/pkcs11_client.go` with CGO
+31-78
internal/agent/deployment.go
···247247 return fmt.Errorf("failed to list HSMSecrets: %w", err)
248248 }
249249250250- // Count references to this device
251251- references := 0
252252- for _, secret := range hsmSecretList.Items {
253253- if m.secretReferencesDevice(&secret, hsmDevice) {
254254- references++
255255- }
256256- }
257257-258258- // If there are still references, don't cleanup
259259- if references > 0 {
250250+ // In the HSMPool architecture, cleanup should be based on device availability in pool
251251+ // rather than individual secret references, since all secrets can use any available device
252252+ // Check if there are any active HSMSecrets - if so, keep the agents running
253253+ if len(hsmSecretList.Items) > 0 {
260254 return nil
261255 }
262256···500494 }
501495502496 return m.Create(ctx, deployment)
503503-}
504504-505505-// secretReferencesDevice checks if an HSMSecret references the given device
506506-func (m *Manager) secretReferencesDevice(hsmSecret *hsmv1alpha1.HSMSecret, hsmDevice *hsmv1alpha1.HSMDevice) bool {
507507- // This is a simplified check - in practice, you might want more sophisticated logic
508508- // to determine which device an HSMSecret should use based on path, device type, etc.
509509- _ = hsmSecret // TODO: Use for device preference checks
510510- _ = hsmDevice // TODO: Use for device type compatibility
511511-512512- // For now, assume any HSMSecret could use any available device of the right type
513513- // A more sophisticated implementation might check:
514514- // - HSMSecret annotations for device preferences
515515- // - Path-based device mapping
516516- // - Device type compatibility
517517-518518- return true // Simplified for initial implementation
519497}
520498521499// buildAgentEnv builds environment variables for the HSM agent
···839817 return runningPods > 0
840818}
841819842842-// GetAgentPodIPs returns the pod IPs for a device (for direct gRPC connections)
843843-func (m *Manager) GetAgentPodIPs(deviceName string) ([]string, error) {
820820+// GetAgentPodIPs returns all agent pod IPs for a device type from HSMPool
821821+func (m *Manager) GetAgentPodIPs(ctx context.Context, deviceName, namespace string) ([]string, error) {
822822+ // Get HSMPool for this device
823823+ poolName := deviceName + "-pool"
824824+ var hsmPool hsmv1alpha1.HSMPool
825825+ if err := m.Get(ctx, types.NamespacedName{
826826+ Name: poolName,
827827+ Namespace: namespace,
828828+ }, &hsmPool); err != nil {
829829+ return nil, fmt.Errorf("failed to get HSMPool %s: %w", poolName, err)
830830+ }
831831+844832 m.mu.RLock()
845833 defer m.mu.RUnlock()
846834847847- agentInfo, exists := m.activeAgents[deviceName]
848848- if !exists {
849849- return nil, fmt.Errorf("no active agents found for device %s", deviceName)
835835+ var allPodIPs []string
836836+837837+ // Collect pod IPs from all agent instances for this device
838838+ for i := range hsmPool.Status.AggregatedDevices {
839839+ agentKey := fmt.Sprintf("%s-%d", deviceName, i)
840840+ if agentInfo, exists := m.activeAgents[agentKey]; exists && len(agentInfo.PodIPs) > 0 {
841841+ allPodIPs = append(allPodIPs, agentInfo.PodIPs...)
842842+ }
850843 }
851844852852- if len(agentInfo.PodIPs) == 0 {
853853- return nil, fmt.Errorf("no pod IPs available for device %s", deviceName)
845845+ if len(allPodIPs) == 0 {
846846+ return nil, fmt.Errorf("no active agents found for device %s in pool %s", deviceName, poolName)
854847 }
855848856856- return agentInfo.PodIPs, nil
849849+ return allPodIPs, nil
857850}
858851859852// GetGRPCEndpoints returns gRPC endpoints for all agent pods of a device
860860-func (m *Manager) GetGRPCEndpoints(deviceName string) ([]string, error) {
861861- podIPs, err := m.GetAgentPodIPs(deviceName)
853853+func (m *Manager) GetGRPCEndpoints(ctx context.Context, deviceName, namespace string) ([]string, error) {
854854+ podIPs, err := m.GetAgentPodIPs(ctx, deviceName, namespace)
862855 if err != nil {
863856 return nil, err
864857 }
···871864 return endpoints, nil
872865}
873866874874-// CreateGRPCClients creates gRPC clients for all agent pods of a device
875875-func (m *Manager) CreateGRPCClients(ctx context.Context, deviceName string, logger logr.Logger) ([]hsm.Client, error) {
876876- endpoints, err := m.GetGRPCEndpoints(deviceName)
877877- if err != nil {
878878- return nil, err
879879- }
880880-881881- clients := make([]hsm.Client, 0, len(endpoints))
882882- for _, endpoint := range endpoints {
883883- grpcClient, err := NewGRPCClient(endpoint, deviceName, logger)
884884- if err != nil {
885885- // Clean up any successful connections
886886- for _, c := range clients {
887887- if err := c.Close(); err != nil {
888888- logger.Error(err, "Failed to close gRPC connection during cleanup")
889889- }
890890- }
891891- return nil, fmt.Errorf("failed to create gRPC client for %s: %w", endpoint, err)
892892- }
893893-894894- // Test the connection
895895- if err := grpcClient.Initialize(ctx, hsm.Config{}); err != nil {
896896- // Clean up any successful connections
897897- for _, c := range clients {
898898- if err := c.Close(); err != nil {
899899- logger.Error(err, "Failed to close gRPC connection during cleanup")
900900- }
901901- }
902902- if err := grpcClient.Close(); err != nil {
903903- logger.Error(err, "Failed to close gRPC client during cleanup")
904904- }
905905- return nil, fmt.Errorf("failed to initialize gRPC client for %s: %w", endpoint, err)
906906- }
907907-908908- clients = append(clients, grpcClient)
909909- }
910910-911911- return clients, nil
912912-}
913913-914867// CreateSingleGRPCClient creates a gRPC client for the first available agent pod of a device
915915-func (m *Manager) CreateSingleGRPCClient(ctx context.Context, deviceName string, logger logr.Logger) (hsm.Client, error) {
916916- endpoints, err := m.GetGRPCEndpoints(deviceName)
868868+func (m *Manager) CreateSingleGRPCClient(ctx context.Context, deviceName, namespace string, logger logr.Logger) (hsm.Client, error) {
869869+ endpoints, err := m.GetGRPCEndpoints(ctx, deviceName, namespace)
917870 if err != nil {
918871 return nil, err
919872 }