Summary
When the awf-cli-proxy sidecar's DIFC-proxy liveness probe fails because the proxy is reachable but the forwarded gh api call returns an HTTP error (e.g. on GitHub Enterprise Cloud data-residency *.ghe.com tenants), the probe reports the opaque diagnosis=unknown and discards the actual gh api response. This makes the failure in github/gh-aw#41225 much harder to diagnose than it needs to be.
This issue tracks the firewall-side diagnostics improvement (ask #3 from github/gh-aw#41225). The underlying enterprise-host root cause is tracked in github/gh-aw-mcpg and github/gh-aw.
Current behavior
containers/cli-proxy/entrypoint.sh runs:
PROBE_ERR="$(timeout "${LIVENESS_TIMEOUT_SECONDS}" gh api rate_limit 2>&1 >/dev/null)"
and classifies the failure into four buckets:
connection refused / ECONNREFUSED → not-yet-ready
- exit 124 /
timeout / deadline → unreachable (timeout)
EAI_AGAIN / ENOTFOUND / getaddrinfo → dns-not-yet-ready
- everything else →
unknown
Gaps:
- An HTTP error from the forwarded call (proxy up, but API host wrong / auth failed) is none of the three known classes, so it silently falls into
unknown.
- The
gh response body is discarded (>/dev/null); only stderr is captured.
PROBE_ERR is printed only on the final attempt, so intermediate retries show nothing but diagnosis=unknown.
Proposed improvement (diagnostics only — no change to the gate behavior)
- Add a fifth classification bucket: grep
PROBE_ERR for HTTP [0-9]{3} / gh: and report e.g. reachable-but-api-error (HTTP 404) instead of unknown.
- Capture stdout (the response body) as well, and print it on final failure so the actual status/body is visible.
- Surface the captured
gh api error on each failed attempt (or an inline snippet), not just the last one.
- When the bucket is the HTTP-error case and
GITHUB_SERVER_URL is a *.ghe.com host, emit a targeted hint pointing at the DIFC-proxy enterprise-host gap (cross-link the companion issues).
Acceptance
- A
*.ghe.com probe failure reports the HTTP status/body and a meaningful diagnosis, not diagnosis=unknown.
- No change to fail-fast/gate semantics; purely better logging.
Companion issues
This is tracked across three repositories:
Original report: github/gh-aw#41225
Related runner-doctor failure modes: C2 (#1315), C4 (#1452, #1460, #1492, #1499), B5 (#5543, #5542); #1300
Summary
When the
awf-cli-proxysidecar's DIFC-proxy liveness probe fails because the proxy is reachable but the forwardedgh apicall returns an HTTP error (e.g. on GitHub Enterprise Cloud data-residency*.ghe.comtenants), the probe reports the opaquediagnosis=unknownand discards the actualgh apiresponse. This makes the failure in github/gh-aw#41225 much harder to diagnose than it needs to be.This issue tracks the firewall-side diagnostics improvement (ask #3 from github/gh-aw#41225). The underlying enterprise-host root cause is tracked in github/gh-aw-mcpg and github/gh-aw.
Current behavior
containers/cli-proxy/entrypoint.shruns:PROBE_ERR="$(timeout "${LIVENESS_TIMEOUT_SECONDS}" gh api rate_limit 2>&1 >/dev/null)"and classifies the failure into four buckets:
connection refused/ECONNREFUSED→not-yet-readytimeout/deadline→unreachable (timeout)EAI_AGAIN/ENOTFOUND/getaddrinfo→dns-not-yet-readyunknownGaps:
unknown.ghresponse body is discarded (>/dev/null); only stderr is captured.PROBE_ERRis printed only on the final attempt, so intermediate retries show nothing butdiagnosis=unknown.Proposed improvement (diagnostics only — no change to the gate behavior)
PROBE_ERRforHTTP [0-9]{3}/gh:and report e.g.reachable-but-api-error (HTTP 404)instead ofunknown.gh apierror on each failed attempt (or an inline snippet), not just the last one.GITHUB_SERVER_URLis a*.ghe.comhost, emit a targeted hint pointing at the DIFC-proxy enterprise-host gap (cross-link the companion issues).Acceptance
*.ghe.comprobe failure reports the HTTP status/body and a meaningful diagnosis, notdiagnosis=unknown.Companion issues
This is tracked across three repositories:
awf-cli-proxyprobe diagnostics improvement (ask feat: add integration test for rostbuness #3)Original report: github/gh-aw#41225
Related runner-doctor failure modes: C2 (#1315), C4 (#1452, #1460, #1492, #1499), B5 (#5543, #5542); #1300