Skip to content

Improve awf-cli-proxy DIFC probe diagnostics (replace opaque diagnosis=unknown) #5615

Description

@lpcox

Summary

When the awf-cli-proxy sidecar's DIFC-proxy liveness probe fails because the proxy is reachable but the forwarded gh api call returns an HTTP error (e.g. on GitHub Enterprise Cloud data-residency *.ghe.com tenants), the probe reports the opaque diagnosis=unknown and discards the actual gh api response. This makes the failure in github/gh-aw#41225 much harder to diagnose than it needs to be.

This issue tracks the firewall-side diagnostics improvement (ask #3 from github/gh-aw#41225). The underlying enterprise-host root cause is tracked in github/gh-aw-mcpg and github/gh-aw.

Current behavior

containers/cli-proxy/entrypoint.sh runs:

PROBE_ERR="$(timeout "${LIVENESS_TIMEOUT_SECONDS}" gh api rate_limit 2>&1 >/dev/null)"

and classifies the failure into four buckets:

  • connection refused / ECONNREFUSEDnot-yet-ready
  • exit 124 / timeout / deadlineunreachable (timeout)
  • EAI_AGAIN / ENOTFOUND / getaddrinfodns-not-yet-ready
  • everything else → unknown

Gaps:

  1. An HTTP error from the forwarded call (proxy up, but API host wrong / auth failed) is none of the three known classes, so it silently falls into unknown.
  2. The gh response body is discarded (>/dev/null); only stderr is captured.
  3. PROBE_ERR is printed only on the final attempt, so intermediate retries show nothing but diagnosis=unknown.

Proposed improvement (diagnostics only — no change to the gate behavior)

  1. Add a fifth classification bucket: grep PROBE_ERR for HTTP [0-9]{3} / gh: and report e.g. reachable-but-api-error (HTTP 404) instead of unknown.
  2. Capture stdout (the response body) as well, and print it on final failure so the actual status/body is visible.
  3. Surface the captured gh api error on each failed attempt (or an inline snippet), not just the last one.
  4. When the bucket is the HTTP-error case and GITHUB_SERVER_URL is a *.ghe.com host, emit a targeted hint pointing at the DIFC-proxy enterprise-host gap (cross-link the companion issues).

Acceptance

  • A *.ghe.com probe failure reports the HTTP status/body and a meaningful diagnosis, not diagnosis=unknown.
  • No change to fail-fast/gate semantics; purely better logging.

Companion issues

This is tracked across three repositories:

Original report: github/gh-aw#41225
Related runner-doctor failure modes: C2 (#1315), C4 (#1452, #1460, #1492, #1499), B5 (#5543, #5542); #1300

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions