fix(cli-proxy): surface HTTP status in DIFC probe diagnostics#5616
fix(cli-proxy): surface HTTP status in DIFC probe diagnostics#5616lpcox wants to merge 2 commits into
Conversation
The awf-cli-proxy liveness probe ran 'gh api rate_limit' through the external DIFC proxy and bucketed any failure that was not conn-refused, timeout, or DNS into an opaque 'diagnosis=unknown'. On GHEC data-residency (*.ghe.com) tenants the proxy is reachable but the forwarded call returns an HTTP error, so the real cause was hidden. - Capture gh stdout (response body) and stderr separately instead of discarding stdout. - Add a 'reachable-but-api-error (HTTP NNN)' classification bucket. - Print the gh stderr on every failed attempt and the body on final failure. - Emit a targeted hint when GITHUB_SERVER_URL is a *.ghe.com host. - Extract classify_probe_failure() as a pure function and unit-test it. Refs #5615. Root cause tracked in github/gh-aw-mcpg#8202 and github/gh-aw#41911. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
✅ Coverage Check PassedOverall Coverage
📁 Per-file Coverage Changes (1 files)
Coverage comparison generated by |
There was a problem hiding this comment.
Pull request overview
Improves awf-cli-proxy DIFC liveness probe diagnostics by capturing gh api rate_limit stdout/stderr separately, classifying HTTP API failures explicitly (including HTTP status), and surfacing actionable logs/hints (notably for *.ghe.com tenants) while keeping existing fail-fast gating semantics.
Changes:
- Refactors probe failure classification into a pure
classify_probe_failure()function and adds an HTTP-status-based bucket (reachable-but-api-error (HTTP NNN)). - Captures and surfaces
ghstderr on every failed attempt, and prints stderr + response body on final failure (plus a targeted*.ghe.comhint). - Adds a focused shell unit test for the classifier and wires it into the build workflow.
Show a summary per file
| File | Description |
|---|---|
containers/cli-proxy/entrypoint.sh |
Adds a testable classifier, captures stdout/stderr separately, and emits improved retry/final-failure diagnostics including HTTP status and a *.ghe.com hint. |
tests/cli-proxy-probe-classify.test.sh |
New shell unit test to validate each probe failure classification bucket, including HTTP status extraction paths. |
.github/workflows/build.yml |
Runs the new shell unit test in CI alongside existing shell tests. |
Review details
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 3/3 changed files
- Comments generated: 1
- Review effort level: Low
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
✅ Copilot review passed with no inline comments. @lpcox Add the |
|
✅ Smoke Gemini completed. All facets verified. 💎 |
|
✅ Smoke Claude passed |
|
Chroot tests passed! Smoke Chroot - All security and functionality tests succeeded. |
|
✅ Smoke Copilot BYOK completed. Copilot BYOK mode operational. 🔓 |
|
❌ Security Guard failed. Please review the logs for details. |
|
✅ Build Test Suite completed successfully! |
|
🔌 Smoke Services — All services reachable! ✅ |
|
✨ The prophecy is fulfilled... Smoke Codex has completed its mystical journey. The stars align. 🌟 |
|
✅ Smoke Copilot BYOK AOAI (Entra) completed. Copilot AOAI BYOK (Entra) mode operational. 🔓 Posted smoke test results and added label smoke-copilot-byok-aoai-entra |
|
📰 VERDICT: Smoke Copilot has concluded. All systems operational. This is a developing story. 🎤 |
|
✅ Smoke Copilot BYOK AOAI (api-key) completed. Copilot AOAI BYOK (api-key) mode operational. 🔓 |
|
✅ Contribution Check completed successfully! Contribution check complete: PR #5616 follows the applicable CONTRIBUTING.md guidelines. It has a clear description with related issue references, includes focused tests wired into CI, and places changes in appropriate container/test/workflow locations. |
|
📡 Smoke OTel Tracing completed. All tracing scenarios validated. ✅ |
|
🔑 Smoke Copilot PAT PAT auth validated. All systems operational. ✅ |
Smoke Test: Claude Engine
Overall result: PASS
|
Smoke Test: Copilot BYOK (Direct) Mode — ✅ PASS✅ GitHub MCP connectivity verified (2 merged PRs fetched) Mode: Direct BYOK via
|
🔍 Smoke Test Results
PR: fix(cli-proxy): surface HTTP status in DIFC probe diagnostics Overall: FAIL — pre-step template variables (
|
Smoke Test
Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "registry.npmjs.org"See Network Configuration for more information.
|
📡 OTEL Tracing Smoke Test Results
Summary: Core OTEL implementation is healthy (module + 59 tests passing +
|
🔬 Smoke Test: Copilot PAT Auth — PASS
Overall: PASS | Auth mode: PAT (COPILOT_GITHUB_TOKEN)
|
🧪 Chroot Version Comparison Results
Overall: ❌ FAILED — Python and Node.js versions differ between host and chroot environment.
|
|
@lpcox Overall: PASS
|
🏗️ Build Test Suite Results
Overall: 8/8 ecosystems passed — ✅ PASS
|
|
fix(cli-proxy): surface HTTP status in DIFC probe diagnostics ✅ Overall: PASS
|
Gemini Engine Validation Results
Overall status: PASS Last 2 merged PRs:
Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "localhost"See Network Configuration for more information.
|
Smoke Test: GitHub Actions Services Connectivity
Overall: FAIL
|
What & why
Fixes #5615.
The
awf-cli-proxysidecar probes the external DIFC proxy withgh api rate_limitbefore serving agent traffic. The classifier incontainers/cli-proxy/entrypoint.shonly recognized connection-refused, timeout, and DNS failures; anything else became an opaquediagnosis=unknown.On GitHub Enterprise Cloud data-residency (
*.ghe.com) tenants the proxy comes up healthy and is reachable, but the forwardedgh apicall returns an HTTP error (the proxy targets the wrong API host). That HTTP error fell into theunknownbucket, and theghresponse body was discarded — so the firewall failed fast with no actionable signal. This is the firewall-side diagnostics ask (#3) from github/gh-aw#41225.Changes
ghstdout (response body) and stderr separately instead of redirecting stdout to/dev/null.reachable-but-api-error (HTTP NNN)that extracts the HTTP status from the gh output.gherror on every failed attempt (not just the last), and print the response body on final failure.GITHUB_SERVER_URLis a*.ghe.comhost, pointing at the DIFC-proxy enterprise-host root cause.classify_probe_failure()function and addtests/cli-proxy-probe-classify.test.sh(wired into thebuild.yml"Run shell unit tests" step).No change to the fail-fast / health-gate semantics — this is purely better diagnostics.
Scope note
This PR is the firewall-side piece only. The underlying root cause — the DIFC proxy not being enterprise-host-aware on
*.ghe.com— is tracked in the companion issues:Testing
The new test covers each bucket, including HTTP status in stderr vs body, and confirms the
unknownfallback does not tripset -ewhen no HTTP status matches.Example log (before → after)
Before:
After (on a
*.ghe.comtenant):