Skip to content

fix(cli-proxy): surface HTTP status in DIFC probe diagnostics#5616

Open
lpcox wants to merge 2 commits into
mainfrom
lpcox/cli-proxy-probe-diagnostics-5615
Open

fix(cli-proxy): surface HTTP status in DIFC probe diagnostics#5616
lpcox wants to merge 2 commits into
mainfrom
lpcox/cli-proxy-probe-diagnostics-5615

Conversation

@lpcox

@lpcox lpcox commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

What & why

Fixes #5615.

The awf-cli-proxy sidecar probes the external DIFC proxy with gh api rate_limit before serving agent traffic. The classifier in containers/cli-proxy/entrypoint.sh only recognized connection-refused, timeout, and DNS failures; anything else became an opaque diagnosis=unknown.

On GitHub Enterprise Cloud data-residency (*.ghe.com) tenants the proxy comes up healthy and is reachable, but the forwarded gh api call returns an HTTP error (the proxy targets the wrong API host). That HTTP error fell into the unknown bucket, and the gh response body was discarded — so the firewall failed fast with no actionable signal. This is the firewall-side diagnostics ask (#3) from github/gh-aw#41225.

Changes

  • Capture gh stdout (response body) and stderr separately instead of redirecting stdout to /dev/null.
  • New classification bucket reachable-but-api-error (HTTP NNN) that extracts the HTTP status from the gh output.
  • Surface the captured gh error on every failed attempt (not just the last), and print the response body on final failure.
  • Targeted hint when GITHUB_SERVER_URL is a *.ghe.com host, pointing at the DIFC-proxy enterprise-host root cause.
  • Refactor the classifier into a pure classify_probe_failure() function and add tests/cli-proxy-probe-classify.test.sh (wired into the build.yml "Run shell unit tests" step).

No change to the fail-fast / health-gate semantics — this is purely better diagnostics.

Scope note

This PR is the firewall-side piece only. The underlying root cause — the DIFC proxy not being enterprise-host-aware on *.ghe.com — is tracked in the companion issues:

Testing

$ bash tests/cli-proxy-probe-classify.test.sh
Results: 8 passed, 0 failed
$ bash tests/setup-iptables-port-spec.test.sh
Results: 53 passed, 0 failed
$ bash -n containers/cli-proxy/entrypoint.sh   # syntax OK

The new test covers each bucket, including HTTP status in stderr vs body, and confirms the unknown fallback does not trip set -e when no HTTP status matches.

Example log (before → after)

Before:

[cli-proxy] DIFC proxy probe failed (attempt 1/10, diagnosis=unknown) ...

After (on a *.ghe.com tenant):

[cli-proxy] ERROR: DIFC proxy liveness probe failed ... diagnosis=reachable-but-api-error (HTTP 404)
[cli-proxy] gh api stderr: gh: Not Found (HTTP 404)
[cli-proxy] gh api response body: {"message":"Not Found", ...}
[cli-proxy] HINT: GITHUB_SERVER_URL=... looks like a GitHub Enterprise data-residency tenant ...

The awf-cli-proxy liveness probe ran 'gh api rate_limit' through the
external DIFC proxy and bucketed any failure that was not conn-refused,
timeout, or DNS into an opaque 'diagnosis=unknown'. On GHEC data-residency
(*.ghe.com) tenants the proxy is reachable but the forwarded call returns an
HTTP error, so the real cause was hidden.

- Capture gh stdout (response body) and stderr separately instead of
  discarding stdout.
- Add a 'reachable-but-api-error (HTTP NNN)' classification bucket.
- Print the gh stderr on every failed attempt and the body on final failure.
- Emit a targeted hint when GITHUB_SERVER_URL is a *.ghe.com host.
- Extract classify_probe_failure() as a pure function and unit-test it.

Refs #5615. Root cause tracked in github/gh-aw-mcpg#8202 and github/gh-aw#41911.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 27, 2026 17:47
@github-actions

Copy link
Copy Markdown
Contributor

✅ Coverage Check Passed

Overall Coverage

Metric Base PR Delta
Lines 98.25% 98.28% 📈 +0.03%
Statements 98.17% 98.21% 📈 +0.04%
Functions 99.53% 99.53% ➡️ +0.00%
Branches 94.00% 94.00% ➡️ +0.00%
📁 Per-file Coverage Changes (1 files)
File Lines (Before → After) Statements (Before → After)
src/workdir-setup.ts 92.7% → 94.5% (+1.82%) 92.7% → 94.5% (+1.82%)

Coverage comparison generated by scripts/ci/compare-coverage.ts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves awf-cli-proxy DIFC liveness probe diagnostics by capturing gh api rate_limit stdout/stderr separately, classifying HTTP API failures explicitly (including HTTP status), and surfacing actionable logs/hints (notably for *.ghe.com tenants) while keeping existing fail-fast gating semantics.

Changes:

  • Refactors probe failure classification into a pure classify_probe_failure() function and adds an HTTP-status-based bucket (reachable-but-api-error (HTTP NNN)).
  • Captures and surfaces gh stderr on every failed attempt, and prints stderr + response body on final failure (plus a targeted *.ghe.com hint).
  • Adds a focused shell unit test for the classifier and wires it into the build workflow.
Show a summary per file
File Description
containers/cli-proxy/entrypoint.sh Adds a testable classifier, captures stdout/stderr separately, and emits improved retry/final-failure diagnostics including HTTP status and a *.ghe.com hint.
tests/cli-proxy-probe-classify.test.sh New shell unit test to validate each probe failure classification bucket, including HTTP status extraction paths.
.github/workflows/build.yml Runs the new shell unit test in CI alongside existing shell tests.

Review details

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 3/3 changed files
  • Comments generated: 1
  • Review effort level: Low

Comment thread tests/cli-proxy-probe-classify.test.sh
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

✅ Copilot review passed with no inline comments.

@lpcox Add the ready-for-aw label to this PR to trigger agentic CI smoke tests.

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Smoke Gemini completed. All facets verified. 💎

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Smoke Claude passed

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Chroot tests passed! Smoke Chroot - All security and functionality tests succeeded.

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK completed. Copilot BYOK mode operational. 🔓

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Security Guard failed. Please review the logs for details.

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Build Test Suite completed successfully!

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

🔌 Smoke Services — All services reachable! ✅

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

✨ The prophecy is fulfilled... Smoke Codex has completed its mystical journey. The stars align. 🌟

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK AOAI (Entra) completed. Copilot AOAI BYOK (Entra) mode operational. 🔓

Posted smoke test results and added label smoke-copilot-byok-aoai-entra

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

📰 VERDICT: Smoke Copilot has concluded. All systems operational. This is a developing story. 🎤

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK AOAI (api-key) completed. Copilot AOAI BYOK (api-key) mode operational. 🔓

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Contribution Check completed successfully!

Contribution check complete: PR #5616 follows the applicable CONTRIBUTING.md guidelines. It has a clear description with related issue references, includes focused tests wired into CI, and places changes in appropriate container/test/workflow locations.

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

📡 Smoke OTel Tracing completed. All tracing scenarios validated. ✅

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

🔑 Smoke Copilot PAT PAT auth validated. All systems operational. ✅

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: Claude Engine

  • API check: ✅ PASS
  • gh check: ✅ PASS
  • File check: ✅ PASS

Overall result: PASS

Generated by Smoke Claude for issue #5616 · 61.3 AIC · ⊞ 3.3K ·

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: Copilot BYOK (Direct) Mode — ✅ PASS

✅ GitHub MCP connectivity verified (2 merged PRs fetched)
✅ GitHub.com connectivity: HTTP 200
✅ File write/read test: artifact file exists
✅ BYOK inference path working (api-proxy → api.githubcopilot.com)

Mode: Direct BYOK via COPILOT_PROVIDER_API_KEY with api-proxy sidecar
Status: All smoke tests passed

🔑 BYOK report filed by Smoke Copilot BYOK

@github-actions

Copy link
Copy Markdown
Contributor

🔍 Smoke Test Results

Test Status
GitHub MCP connectivity
GitHub.com HTTP connectivity ✅ HTTP 200
File write/read ❌ Pre-step outputs not substituted

PR: fix(cli-proxy): surface HTTP status in DIFC probe diagnostics
Author: @lpcox

Overall: FAIL — pre-step template variables (${{ steps.smoke-data.outputs.* }}) were not expanded; file test could not be verified.

📰 BREAKING: Report filed by Smoke Copilot

@github-actions github-actions Bot mentioned this pull request Jun 27, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test

  • Apply safe dependency updates for June 2026 security refresh
  • Update transitive \linkify-it` to 5.0.1 in lockfile` ✅
  • GitHub title check
  • Temp file write/read
  • Discussion lookup/comment
  • npm ci && npm run build
  • Overall: PASS

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • registry.npmjs.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "registry.npmjs.org"

See Network Configuration for more information.

🔮 The oracle has spoken through Smoke Codex

@github-actions

Copy link
Copy Markdown
Contributor

📡 OTEL Tracing Smoke Test Results

Scenario Result Notes
1. Module Loading otel.js loaded; isEnabled()=true; exports startRequestSpan, setTokenAttributes, setBudgetAttributes, endSpan, endSpanError, shutdown + internals
2. Test Suite 59 passed, 0 failed across 2 suites (otel.test.js, otel-fanout.test.js)
3. Env Var Forwarding ⚠️ pending src/services/api-proxy-service.ts does not yet forward OTEL_EXPORTER_OTLP_ENDPOINT / GITHUB_AW_OTEL_TRACE_ID — expected during development
4. Token Tracker Integration onUsage callback present in token-tracker-http.js
5. OTEL Diagnostics ⚠️ n/a No otel.jsonl — expected; file-fallback spans only emit when api-proxy container runs

Summary: Core OTEL implementation is healthy (module + 59 tests passing + onUsage hook in place). Env-var forwarding from the AWF CLI layer to the api-proxy container is the one remaining open item.

📡 OTel tracing validated by Smoke OTel Tracing

@github-actions

Copy link
Copy Markdown
Contributor

🔬 Smoke Test: Copilot PAT Auth — PASS

Test Result
GitHub MCP connectivity
GitHub.com HTTP status ✅ 200
File write/read

Overall: PASS | Auth mode: PAT (COPILOT_GITHUB_TOKEN)
cc @lpcox

🔑 PAT report filed by Smoke Copilot PAT

@github-actions

Copy link
Copy Markdown
Contributor

🧪 Chroot Version Comparison Results

Runtime Host Version Chroot Version Match?
Python Python 3.12.13 Python 3.12.3
Node.js v24.17.0 v22.23.0
Go go1.22.12 go1.22.12

Overall: ❌ FAILED — Python and Node.js versions differ between host and chroot environment.

Tested by Smoke Chroot

@github-actions

Copy link
Copy Markdown
Contributor

@lpcox
fix(cli-proxy): surface HTTP status in DIFC probe diagnostics ✅
fix: correctly recover runner tool on PATH (after sudo w/ secure_path). remove incorrect reading from GITHUB_PATH ✅
GitHub.com connectivity ✅
File write/read test ✅
Running in direct BYOK mode (COPILOT_PROVIDER_API_KEY + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw) ✅

Overall: PASS

🔑 BYOK (AOAI api-key) report filed by Smoke Copilot BYOK AOAI (api-key)

@github-actions

Copy link
Copy Markdown
Contributor

🏗️ Build Test Suite Results

Ecosystem Project Build/Install Tests Status
Bun elysia 1/1 passed ✅ PASS
Bun hono 1/1 passed ✅ PASS
C++ fmt N/A ✅ PASS
C++ json N/A ✅ PASS
Deno oak N/A 1/1 passed ✅ PASS
Deno std N/A 1/1 passed ✅ PASS
.NET hello-world N/A ✅ PASS
.NET json-parse N/A ✅ PASS
Go color 1/1 passed ✅ PASS
Go env 1/1 passed ✅ PASS
Go uuid 1/1 passed ✅ PASS
Java gson 1/1 passed ✅ PASS
Java caffeine 1/1 passed ✅ PASS
Node.js clsx all passed ✅ PASS
Node.js execa all passed ✅ PASS
Node.js p-limit all passed ✅ PASS
Rust fd 1/1 passed ✅ PASS
Rust zoxide 1/1 passed ✅ PASS

Overall: 8/8 ecosystems passed — ✅ PASS

Generated by Build Test Suite for issue #5616 · 37.3 AIC · ⊞ 7.8K ·

@github-actions

Copy link
Copy Markdown
Contributor

@lpcox

fix(cli-proxy): surface HTTP status in DIFC probe diagnostics ✅
fix: correctly recover runner tool on PATH (after sudo w/ secure_path). remove incorrect reading from GITHUB_PATH ✅
GitHub.com connectivity ✅
File write/read ✅
Running in direct BYOK mode (AWF_AUTH_TYPE=github-oidc + AWF_AUTH_AZURE_* + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw) authenticated via Microsoft Entra ✅

Overall: PASS

🪪 BYOK (AOAI Entra) report filed by Smoke Copilot BYOK AOAI (Entra)

@lpcox lpcox deployed to aoai-model June 27, 2026 17:57 — with GitHub Actions Active
@github-actions

Copy link
Copy Markdown
Contributor

Gemini Engine Validation Results

  • GitHub MCP Testing: ✅
  • GitHub.com Connectivity: ✅
  • File Writing Testing: ✅
  • Bash Tool Testing: ✅

Overall status: PASS

Last 2 merged PRs:

  1. Apply safe dependency updates for June 2026 security refresh (Apply safe dependency updates for June 2026 security refresh #5609)
  2. Update transitive linkify-it to 5.0.1 in lockfile (Update transitive linkify-it to 5.0.1 in lockfile #5608)

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • localhost

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "localhost"

See Network Configuration for more information.

💎 Faceted by Smoke Gemini

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: GitHub Actions Services Connectivity

Check Result
Redis PING ❌ No response (timeout)
PostgreSQL pg_isready no response
PostgreSQL SELECT 1 ❌ No response (timeout)

Overall: FAIL

host.docker.internal resolves to 172.17.0.1 but neither Redis (6379) nor PostgreSQL (5432) are reachable from this agent environment. Service containers do not appear to be running or accessible in the current workflow context.

🔌 Service connectivity validated by Smoke Services

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve awf-cli-proxy DIFC probe diagnostics (replace opaque diagnosis=unknown)

2 participants