Skip to content

Specification: Locking

Spec ID: tower-fleet/lock/v1 Status: Normative Version: 1.0.0 Created: 2025-12-18

Abstract

This specification defines the locking mechanism for tower-fleet intent execution. It establishes lock identity, file format, acquisition/release semantics, heartbeat protocol, and stale lock recovery.

Conformance

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.


1. Goals

  1. Prevent concurrent mutation - Two executions MUST NOT mutate the same resource simultaneously
  2. Survive crashes - Locks MUST be recoverable after executor failure
  3. Be diagnosable - Lock state MUST be inspectable and understandable
  4. Support timeouts - Stale locks MUST be detectable and recoverable

2. Lock Identity

2.1 Lock Name

The lock name is derived from the intent's controls.lock field after template resolution.

spec:
  controls:
    lock: "${params.app}-${params.environment}"

With params.app = "money-tracker" and params.environment = "production": - Lock name: money-tracker-production

2.2 Lock Name Constraints

Lock names MUST: - Contain only: [a-z0-9_-] - Be between 1 and 128 characters - Be filesystem-safe (no path separators)

Lock names MUST NOT: - Contain uppercase letters - Contain spaces or special characters - Start or end with hyphen or underscore

2.3 Lock File Path

Lock files are stored at:

/root/tower-fleet/logs/locks/<lock_name>.lock

Example: /root/tower-fleet/logs/locks/money-tracker-production.lock


3. Lock File Format

3.1 Schema

Lock files MUST contain valid JSON with the following structure:

{
  "lock_version": "v1",
  "lock_name": "money-tracker-production",
  "request_id": "req_abc123def456",
  "actor": "claude-code",
  "intent": "deploy-app",
  "intent_version": "1.2.0",
  "host_id": "tower-01",
  "pid": 12345,
  "created_at": "2025-12-18T10:30:10Z",
  "last_heartbeat_at": "2025-12-18T10:30:40Z",
  "ttl_seconds": 900,
  "metadata": {}
}

3.2 Field Definitions

Field Type Required Description
lock_version string MUST Always "v1" for this spec
lock_name string MUST The resolved lock name
request_id string MUST Unique identifier for this execution
actor string MUST Who initiated the execution
intent string MUST Intent name being executed
intent_version string MUST Version of the intent definition
host_id string MUST Identifier of the executing host
pid integer MUST Process ID of the executor
created_at string MUST ISO-8601 UTC timestamp of lock creation
last_heartbeat_at string MUST ISO-8601 UTC timestamp of last heartbeat
ttl_seconds integer MUST Maximum seconds between heartbeats before stale
metadata object MAY Additional context (optional)

3.3 Default TTL

If not specified in the intent, the default TTL is 900 seconds (15 minutes).

Intents MAY override via:

spec:
  timeouts:
    lock_ttl: 1800  # 30 minutes for long-running operations


4. Acquisition Semantics

4.1 Acquisition Flow

┌─────────────────┐
│  Check if lock  │
│  file exists    │
└────────┬────────┘
    ┌────┴────┐
    │ Exists? │
    └────┬────┘
    ┌────┴────┐        ┌──────────────┐
    │   Yes   │───────►│ Check stale  │
    └─────────┘        └──────┬───────┘
         │                    │
         │              ┌─────┴─────┐
         │              │  Stale?   │
         │              └─────┬─────┘
         │                    │
         │         ┌──────────┴──────────┐
         │         │                     │
         │    ┌────┴────┐          ┌─────┴─────┐
         │    │   Yes   │          │    No     │
         │    └────┬────┘          └─────┬─────┘
         │         │                     │
         │         ▼                     ▼
         │    ┌─────────┐          ┌───────────┐
         │    │--force? │          │  BLOCKED  │
         │    └────┬────┘          └───────────┘
         │         │
         │    ┌────┴────┐
         │    │   Yes   │──────► Steal lock
         │    └─────────┘
         │         │
         │    ┌────┴────┐
         │    │   No    │──────► BLOCKED (suggest --force)
         │    └─────────┘
    ┌────┴────┐
    │   No    │
    └────┬────┘
┌─────────────────┐
│ Create lock     │
│ atomically      │
└────────┬────────┘
┌─────────────────┐
│ Emit audit:     │
│ lock_acquired   │
└─────────────────┘

4.2 Atomic Creation

Lock creation MUST be atomic to prevent race conditions.

Method 1: Exclusive file creation (recommended)

# Fails if file exists
set -o noclobber
echo "$lock_json" > "$lock_path"

Method 2: Rename from temp

temp_path="${lock_path}.${request_id}.tmp"
echo "$lock_json" > "$temp_path"
mv -n "$temp_path" "$lock_path" || { rm "$temp_path"; exit 1; }

4.3 Acquisition Audit Event

On successful acquisition, emit:

{
  "event": "lock_acquired",
  "request_id": "req_abc123",
  "timestamp": "2025-12-18T10:30:10Z",
  "lock_name": "money-tracker-production",
  "lock_path": "/root/tower-fleet/logs/locks/money-tracker-production.lock",
  "ttl_seconds": 900
}

4.4 Blocked Response

When blocked by an active lock, the executor MUST:

  1. NOT proceed with execution
  2. Return error with lock holder information:
{
  "error": "lock_blocked",
  "lock_name": "money-tracker-production",
  "held_by": {
    "request_id": "req_xyz789",
    "actor": "claude-code",
    "intent": "deploy-app",
    "created_at": "2025-12-18T10:25:00Z",
    "last_heartbeat_at": "2025-12-18T10:29:45Z"
  },
  "suggestion": "Wait for completion or use --force-lock if stale"
}

5. Heartbeat Protocol

5.1 Purpose

Heartbeats allow detection of crashed executors by updating last_heartbeat_at periodically.

5.2 Heartbeat Interval

The executor MUST update the heartbeat at interval:

heartbeat_interval = ttl_seconds / 3

For default TTL of 900s, heartbeat every 300s (5 minutes).

RECOMMENDED: Use shorter interval of 15-30 seconds for better responsiveness.

5.3 Heartbeat Update

Heartbeat updates MUST: 1. Be atomic (write temp then rename, or in-place with flock) 2. Update only last_heartbeat_at field 3. Preserve all other fields

update_heartbeat() {
  local lock_path="$1"
  local temp_path="${lock_path}.heartbeat.tmp"

  jq --arg ts "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
     '.last_heartbeat_at = $ts' \
     "$lock_path" > "$temp_path"

  mv "$temp_path" "$lock_path"
}

5.4 Heartbeat Failure

If heartbeat update fails: 1. Log warning but continue execution 2. Retry on next interval 3. After 3 consecutive failures, emit heartbeat_failed audit event


6. Stale Detection

6.1 Definition

A lock is stale if:

current_time - last_heartbeat_at > ttl_seconds

6.2 Stale Check Implementation

is_lock_stale() {
  local lock_path="$1"

  local last_heartbeat
  last_heartbeat=$(jq -r '.last_heartbeat_at' "$lock_path")

  local ttl
  ttl=$(jq -r '.ttl_seconds' "$lock_path")

  local last_epoch
  last_epoch=$(date -d "$last_heartbeat" +%s)

  local now_epoch
  now_epoch=$(date +%s)

  local age=$((now_epoch - last_epoch))

  if [[ $age -gt $ttl ]]; then
    echo "stale"
    return 0
  else
    echo "active"
    return 1
  fi
}

6.3 Stale Lock Handling

When a stale lock is detected:

Default behavior (no --force):

{
  "error": "lock_stale",
  "lock_name": "money-tracker-production",
  "stale_since": "2025-12-18T10:15:00Z",
  "age_seconds": 1800,
  "ttl_seconds": 900,
  "held_by": {
    "request_id": "req_old123",
    "actor": "claude-code",
    "host_id": "tower-01",
    "pid": 12345
  },
  "suggestion": "Use --force-lock to acquire (stale lock will be logged)"
}

With --force-lock: 1. Read existing lock metadata 2. Delete existing lock file 3. Create new lock 4. Emit lock_stolen audit event


7. Lock Stealing

7.1 When Allowed

Lock stealing is ONLY allowed when: 1. Lock is stale (heartbeat expired), AND 2. --force-lock flag is provided

Lock stealing is NEVER allowed for active (non-stale) locks.

7.2 Stolen Lock Audit Event

{
  "event": "lock_stolen",
  "request_id": "req_new456",
  "timestamp": "2025-12-18T10:45:00Z",
  "lock_name": "money-tracker-production",
  "previous_lock": {
    "request_id": "req_old123",
    "actor": "claude-code",
    "intent": "deploy-app",
    "created_at": "2025-12-18T10:00:00Z",
    "last_heartbeat_at": "2025-12-18T10:15:00Z",
    "host_id": "tower-01",
    "pid": 12345
  },
  "previous_lock_hash": "sha256:abc123...",
  "reason": "stale_lock_forced"
}

8. Release Semantics

8.1 Normal Release

On successful completion or controlled abort:

  1. Remove lock file
  2. Emit lock_released audit event
{
  "event": "lock_released",
  "request_id": "req_abc123",
  "timestamp": "2025-12-18T10:45:00Z",
  "lock_name": "money-tracker-production",
  "held_duration_seconds": 900,
  "result": "success"
}

8.2 Release on Failure

If execution fails:

  1. Attempt rollback (if defined)
  2. Release lock regardless of rollback outcome
  3. Emit lock_released with failure info
{
  "event": "lock_released",
  "request_id": "req_abc123",
  "timestamp": "2025-12-18T10:45:00Z",
  "lock_name": "money-tracker-production",
  "held_duration_seconds": 600,
  "result": "failure",
  "failure_step": "deploy_manifests"
}

8.3 Release Failure

If lock file cannot be removed:

  1. Log error with errno/message
  2. Emit lock_release_failed audit event
  3. Continue with execution result reporting
{
  "event": "lock_release_failed",
  "request_id": "req_abc123",
  "timestamp": "2025-12-18T10:45:00Z",
  "lock_name": "money-tracker-production",
  "lock_path": "/root/tower-fleet/logs/locks/money-tracker-production.lock",
  "error": "EACCES: permission denied",
  "action": "manual_cleanup_required"
}

9. Mode-Based Locking Requirements

9.1 Locking by Intent Mode

Mode Lock Required Rationale
observe MUST NOT lock Read-only operations are safe to parallelize
mutate MUST lock if controls.lock defined Mutations need serialization
reconcile SHOULD lock Prevents conflicting reconciliation

9.2 Observe Mode Bypass

Intents with mode: observe: - MUST NOT acquire locks - MUST NOT be blocked by existing locks - MAY read lock status for reporting

9.3 Missing Lock Configuration

For mode: mutate intents without controls.lock: - Executor SHOULD warn: "Mutate intent without lock - concurrent execution possible" - Execution MAY proceed (operator's choice)


10. Implementation Reference

10.1 Directory Setup

mkdir -p /root/tower-fleet/logs/locks
chmod 700 /root/tower-fleet/logs/locks

10.2 Lock Manager Script

#!/bin/bash
# /root/tower-fleet/scripts/lock-manager.sh

set -euo pipefail

LOCK_DIR="/root/tower-fleet/logs/locks"
DEFAULT_TTL=900
HEARTBEAT_INTERVAL=30

acquire_lock() {
  local lock_name="$1"
  local request_id="$2"
  local intent="$3"
  local intent_version="$4"
  local force="${5:-false}"

  local lock_path="${LOCK_DIR}/${lock_name}.lock"

  # Check existing lock
  if [[ -f "$lock_path" ]]; then
    if is_lock_stale "$lock_path"; then
      if [[ "$force" == "true" ]]; then
        steal_lock "$lock_path" "$request_id"
      else
        echo "ERROR: Lock is stale. Use --force-lock to acquire." >&2
        cat "$lock_path" >&2
        return 1
      fi
    else
      echo "ERROR: Lock is held by active process." >&2
      cat "$lock_path" >&2
      return 1
    fi
  fi

  # Create lock atomically
  local lock_json
  lock_json=$(create_lock_json "$lock_name" "$request_id" "$intent" "$intent_version")

  local temp_path="${lock_path}.${request_id}.tmp"
  echo "$lock_json" > "$temp_path"

  if ! mv -n "$temp_path" "$lock_path" 2>/dev/null; then
    rm -f "$temp_path"
    echo "ERROR: Failed to acquire lock (race condition)" >&2
    return 1
  fi

  echo "Lock acquired: $lock_name"
  return 0
}

release_lock() {
  local lock_name="$1"
  local request_id="$2"

  local lock_path="${LOCK_DIR}/${lock_name}.lock"

  if [[ ! -f "$lock_path" ]]; then
    echo "WARN: Lock file not found: $lock_path" >&2
    return 0
  fi

  # Verify we own the lock
  local held_by
  held_by=$(jq -r '.request_id' "$lock_path")

  if [[ "$held_by" != "$request_id" ]]; then
    echo "ERROR: Lock held by different request: $held_by" >&2
    return 1
  fi

  rm -f "$lock_path"
  echo "Lock released: $lock_name"
  return 0
}

create_lock_json() {
  local lock_name="$1"
  local request_id="$2"
  local intent="$3"
  local intent_version="$4"

  local now
  now=$(date -u +%Y-%m-%dT%H:%M:%SZ)

  jq -n \
    --arg lock_version "v1" \
    --arg lock_name "$lock_name" \
    --arg request_id "$request_id" \
    --arg actor "${ACTOR:-unknown}" \
    --arg intent "$intent" \
    --arg intent_version "$intent_version" \
    --arg host_id "$(hostname)" \
    --argjson pid "$$" \
    --arg created_at "$now" \
    --arg last_heartbeat_at "$now" \
    --argjson ttl_seconds "$DEFAULT_TTL" \
    '{
      lock_version: $lock_version,
      lock_name: $lock_name,
      request_id: $request_id,
      actor: $actor,
      intent: $intent,
      intent_version: $intent_version,
      host_id: $host_id,
      pid: $pid,
      created_at: $created_at,
      last_heartbeat_at: $last_heartbeat_at,
      ttl_seconds: $ttl_seconds,
      metadata: {}
    }'
}

is_lock_stale() {
  local lock_path="$1"

  local last_heartbeat ttl last_epoch now_epoch age
  last_heartbeat=$(jq -r '.last_heartbeat_at' "$lock_path")
  ttl=$(jq -r '.ttl_seconds' "$lock_path")
  last_epoch=$(date -d "$last_heartbeat" +%s 2>/dev/null || echo 0)
  now_epoch=$(date +%s)
  age=$((now_epoch - last_epoch))

  [[ $age -gt $ttl ]]
}

# Export functions for use in executor
export -f acquire_lock release_lock is_lock_stale

11. Operational Commands

11.1 List Active Locks

# List all locks
ls -la /root/tower-fleet/logs/locks/*.lock 2>/dev/null

# Show lock details
for lock in /root/tower-fleet/logs/locks/*.lock; do
  echo "=== $(basename "$lock") ==="
  jq . "$lock"
done

11.2 Check Specific Lock

# Check if lock is stale
./lock-manager.sh check money-tracker-production

11.3 Force Release (Emergency)

# Manual force release (audit this!)
rm /root/tower-fleet/logs/locks/money-tracker-production.lock

# Prefer: use executor with --force-lock for proper audit trail

12. Security Considerations

12.1 Lock Directory Permissions

The lock directory SHOULD: - Be owned by the executor user - Have mode 700 (rwx------) - Not be world-writable

12.2 Lock File Integrity

Lock files: - SHOULD be validated on read (JSON parse + schema check) - MUST NOT be trusted blindly (validate request_id format, timestamps)

12.3 PID Verification

The pid field is informational only. Do NOT use it for: - Process existence checks (PIDs can be recycled) - Kill signals (dangerous across hosts)

Use heartbeat expiry as the authority for staleness.