claudify CLI should auto-reattach on gateway restart #2

New issue

Open

opened 2026-04-24 15:20:12 +00:00 by qwolff · 0 comments

qwolff commented

2026-04-24 15:20:12 +00:00

Owner

When the gateway pod is rolled (helm upgrade, crash, etc.), in-flight SSH attach connections die. The CLI currently prints the ssh error and exits. For any real use, the CLI should:

Detect the close was unclean (not user-initiated ~.).
Print a short 'connection dropped, retrying...' line to stderr.
Poll GET /sessions/<name> until phase == Running and podName is set, with a bounded backoff.
Re-exec ssh with the same session name.
Give up after N retries or a user-configurable timeout (--retry-timeout 60s flag?).

Tricky bits:

exec() on Linux replaces the process, so a retry loop must either not exec and instead spawn+wait, or use a wrapper shell loop. Spawning is probably cleaner; lose the 'ssh fully owns the PTY' property but gain the ability to restart.
Bearer token may have expired in the interval; CLI already handles this via the 401 retry logic in client/mod.rs. The attach path doesn't use that; needs similar treatment.

Required for

Not a v0.1.0 blocker. Users can manually claudify attach <same name> after a gateway roll, which works because the SSH subsystem+session resolution is idempotent.

Implementation sketch

Replace the current cmd.exec() in crates/cli/src/cmd/attach.rs with a retry loop:

loop {
    let status = cmd.status()?;        // spawn+wait, not exec
    if status.success() { return Ok(()); }
    if !should_retry(&status) { return Err(...); }
    eprintln!("connection dropped (code {:?}), retrying in {}s...", status.code(), delay.as_secs());
    tokio::time::sleep(delay).await;
    delay = (delay * 2).min(Duration::from_secs(30));
    // refresh session status before retrying
    wait_for_session_running(&client, &session_name).await?;
}

When the gateway pod is rolled (helm upgrade, crash, etc.), in-flight SSH attach connections die. The CLI currently prints the ssh error and exits. For any real use, the CLI should: 1. Detect the close was unclean (not user-initiated ~.). 2. Print a short 'connection dropped, retrying...' line to stderr. 3. Poll `GET /sessions/<name>` until `phase == Running` and `podName` is set, with a bounded backoff. 4. Re-exec `ssh` with the same session name. 5. Give up after N retries or a user-configurable timeout (`--retry-timeout 60s` flag?). Tricky bits: - `exec()` on Linux replaces the process, so a retry loop must either not `exec` and instead `spawn+wait`, or use a wrapper shell loop. Spawning is probably cleaner; lose the 'ssh fully owns the PTY' property but gain the ability to restart. - Bearer token may have expired in the interval; CLI already handles this via the 401 retry logic in `client/mod.rs`. The attach path doesn't use that; needs similar treatment. ## Required for Not a v0.1.0 blocker. Users can manually `claudify attach <same name>` after a gateway roll, which works because the SSH subsystem+session resolution is idempotent. ## Implementation sketch Replace the current `cmd.exec()` in `crates/cli/src/cmd/attach.rs` with a retry loop: ```rust loop { let status = cmd.status()?; // spawn+wait, not exec if status.success() { return Ok(()); } if !should_retry(&status) { return Err(...); } eprintln!("connection dropped (code {:?}), retrying in {}s...", status.code(), delay.as_secs()); tokio::time::sleep(delay).await; delay = (delay * 2).min(Duration::from_secs(30)); // refresh session status before retrying wait_for_session_running(&client, &session_name).await?; } ```

qwolff referenced this issue from a commit

2026-04-25 09:35:39 +00:00

docs(setup): close gaps found during e2e

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

qwolff/claudify#2

No description provided.

Rows
Columns