Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oci downloader fails with ACR (token header mistakenly added on redirect) #7189

Open
bluebrown opened this issue Nov 23, 2024 · 19 comments · May be fixed by #7196
Open

oci downloader fails with ACR (token header mistakenly added on redirect) #7189

bluebrown opened this issue Nov 23, 2024 · 19 comments · May be fixed by #7196
Labels
bug inactive investigating Issues being actively investigated

Comments

@bluebrown
Copy link

bluebrown commented Nov 23, 2024

Short description

When I run OPA locally, from binary release, I am able to download OCI
bundles. When I use the same version of OPA from the official docker hub
repo, it fails with 403.

Sometimes, it even fails only after the first metadata requests, on
subrequests, to get the blobs.

I have experienced the same issue on a deployed OPA instance,
in Kubernetes.

Steps to reproduce

Files

.env:

ACR_URL=<http url>
ACR_TOKEN=<acrname>:<admintoken>
ACR_BUNDLE=<image-fqdm>

opa config:

decision_logs:
  console: true
services:
  acr:
    url: ${ACR_URL}
    type: oci
    credentials:
      bearer:
        scheme: Basic
        token: ${ACR_TOKEN}
bundles:
  main:
    service: acr
    resource: ${ACR_BUNDLE}
    persist: false

Local

$ opa version
Version: 0.70.0
Build Commit: 2ea031ea04e6a8afbc5dd22f656131dc3cfc5a7d
Build Timestamp: 2024-10-31T19:39:52Z
Build Hostname: 799a5774bce7
Go Version: go1.23.1
Platform: linux/amd64
WebAssembly: available

$ source .env && opa run -c opa.yaml
{"level":"info","msg":"Starting bundle loader.","name":"main","plugin":"bundle","time":"2024-11-23T09:17:37+01:00"}
{"level":"info","msg":"Starting decision logger.","plugin":"decision_logs","time":"2024-11-23T09:17:37+01:00"}
OPA 0.70.0 (commit 2ea031ea04e6a8afbc5dd22f656131dc3cfc5a7d, built at 2024-10-31T19:39:52Z)

Run 'help' to see a list of commands and check for updates.

> {"level":"info","msg":"Bundle loaded and activated successfully. Etag updated to 2c2def83472ef874c864e9a9af91e57cffef3076537e34919cfb5b2d459d6b7d.","name":"main","plugin":"bundle","time":"2024-11-23T09:17:37+01:00"}

Docker

$ docker run --rm --env-file .env -v "$PWD/opa.yaml:/opa.yaml" docker.io/openpolicyagent/opa:0.70.0 version
Version: 0.70.0
Build Commit: 2ea031ea04e6a8afbc5dd22f656131dc3cfc5a7d
Build Timestamp: 2024-10-31T19:39:52Z
Build Hostname: fc72eb07593c
Go Version: go1.23.1
Platform: linux/amd64
WebAssembly: available

$ docker run --rm --env-file .env -v "$PWD/opa.yaml:/opa.yaml" docker.io/openpolicyagent/opa:0.70.0 run -c /opa.yaml
{"level":"info","msg":"Starting bundle loader.","name":"main","plugin":"bundle","time":"2024-11-23T08:24:38Z"}
{"level":"info","msg":"Starting decision logger.","plugin":"decision_logs","time":"2024-11-23T08:24:38Z"}
OPA 0.70.0 (commit 2ea031ea04e6a8afbc5dd22f656131dc3cfc5a7d, built at 2024-10-31T19:39:52Z)

Run 'help' to see a list of commands and check for updates.

>
Do you want to exit ([y]/n)? {"level":"info","msg":"Stopping bundle loader.","name":"main","plugin":"bundle","time":"2024-11-23T08:24:38Z"}
{"error":"failed to do request: Head \"https://org.azurecr.io/v2/acme/policy/manifests/latest\": context canceled","host":"org.azurecr.io","level":"info","msg":"trying next host","time":"2024-11-23T08:24:38Z"}
{"level":"error","msg":"Bundle load failed: failed to pull org.azurecr.io/acme/policy:latest: download for 'org.azurecr.io/acme/policy:latest' failed: failed to resolve org.azurecr.io/acme/policy:latest: failed to do request: Head \"https://org.azurecr.io/v2/acme/policy/manifests/latest\": context canceled","name":"main","plugin":"bundle","time":"2024-11-23T08:24:38Z"}
{"level":"info","msg":"Stopping decision logger.","plugin":"decision_logs","time":"2024-11-23T08:24:38Z"}

Expected behavior

I would expect that OPA in the container image has the same capabilities
as the binary release.

@bluebrown bluebrown added the bug label Nov 23, 2024
@bluebrown
Copy link
Author

I tried to create a custom docker image, fetching the exact same binary, I have locally where its working. But it still fails the same. Could it be that these containerd libs in use are getting confused by being run in a container?

FROM docker.io/library/debian:stable-slim
ADD https://github.com/open-policy-agent/opa/releases/download/v0.70.0/opa_linux_amd64 .
RUN echo '2879c01f1e5762f28e27c9f81b4035bd5f532753f18c2c6dcbc2943347cc6ea5 opa_linux_amd64' | sha256sum -c
RUN install opa_linux_amd64 /usr/local/bin/opa
ENTRYPOINT ["/usr/local/bin/opa"]

@bluebrown
Copy link
Author

bluebrown commented Nov 23, 2024

This is the error for the blob requests. This one only comes when
running opa in server mode.

This one is most strange, because it has aleady sucessfully requested
and read the metadata. Otherwise, it could not know which blobs to
fetch.

Bundle load failed: failed to pull org.azurecr.io/acme/policy:latest:
download for 'org.azurecr.io/acme/policy:latest' failed: failed to
ingest: copy failed: httpReadSeeker: failed open: unexpected status code
https://org.azurecr.io/v2/acme/policy/blobs/sha256:ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356:
403 Server failed to authenticate the request. Make sure the value of
Authorization header is formed correctly including the signature.

@bluebrown
Copy link
Author

bluebrown commented Nov 23, 2024

I created a repo here with this reproduction:
https://github.com/bluebrown/opa-issue-containerized-oci-download.

It also contains go code trying to replicate the behavior of opa, by
using the same libraries and so on. I cant reproduce the issue outside
of running opa in a container. The code downloads the blobs as
expected, locally and in a container.
https://github.com/bluebrown/opa-issue-containerized-oci-download/blob/main/gist/main.go

@srenatus
Copy link
Contributor

You can check the environment of the running OPA server by querying for x = opa.runtime().env. You can do that with its minimal web UI on "/", for example. Do you find all env vars that you expected to find?

@bluebrown
Copy link
Author

bluebrown commented Nov 23, 2024

Hi, I run the command on the web ui from the official opa docker image.
It looks as expected.

{
  "result": [
    {
      "x": {
        "ACR_BUNDLE": "org.azurecr.io/acme/policy:latest",
        "ACR_TOKEN": "org:<redacted admin token>",
        "ACR_URL": "https://org.azurecr.io",
        "HOME": "/",
        "HOSTNAME": "e2ebb7897e97",
        "PATH": "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/",
        "SSL_CERT_FILE": "/etc/ssl/certs/ca-certificates.crt"
      }
    }
  ]
}

When I turn on debug logs, we can see it does authenticate successfully
on the first metadata requests, but then it fails on the blob requests.
So I don't think it's an issue with the variables. Because then the first
request should already fail, or it should report that it's trying to use
anonymous authentication.

[DEBUG] OCI - Download starting.

[DEBUG] OCIDownloader: using auth plugin: *rest.bearerAuthPlugin

[DEBUG] resolving
  host = "org.azurecr.io"

[DEBUG] do request
  request.header.user-agent = "containerd/1.7.23+unknown"
  host = "org.azurecr.io"
  url = "https://org.azurecr.io/v2/acme/policy/manifests/latest"
  request.header.accept = "application/vnd.docker.distribution.manifest.v2+json, application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.oci.image.index.v1+json, */*"
  request.method = "HEAD"

[DEBUG] fetch response received
  response.header.etag = "sha256:a17a80e88aa467b8101eae2c90c209ca07006483d5663c02a4bc8593b87c1320"
  response.header.access-control-expose-headers.2 = "Link"
  response.header.x-content-type-options = "nosniff"
  response.header.docker-content-digest = "sha256:a17a80e88aa467b8101eae2c90c209ca07006483d5663c02a4bc8593b87c1320"
  response.header.date = "Sat, 23 Nov 2024 17:05:02 GMT"
  response.header.access-control-expose-headers.3 = "X-Ms-Correlation-Request-Id"
  response.header.x-ms-client-request-id = ""
  response.header.content-type = "application/vnd.oci.image.manifest.v1+json"
  response.header.strict-transport-security.1 = "max-age=31536000; includeSubDomains"
  host = "org.azurecr.io"
  response.header.x-ms-correlation-request-id = "bfc4bff4-b174-4772-a11d-20b3e4aca8c2"
  response.header.strict-transport-security = "max-age=31536000; includeSubDomains"
  response.header.connection = "keep-alive"
  response.header.docker-distribution-api-version = "registry/2.0"
  response.header.x-ms-request-id = "313dff1f-6ee8-44dd-9f09-f1f3f816a929"
  response.status = "200 OK"
  response.header.access-control-expose-headers.1 = "WWW-Authenticate"
  response.header.server = "AzureContainerRegistry"
  response.header.access-control-expose-headers = "Docker-Content-Digest"
  response.header.content-length = 549
  url = "https://org.azurecr.io/v2/acme/policy/manifests/latest"

[DEBUG] resolved
  host = "org.azurecr.io"
  desc.digest = "sha256:a17a80e88aa467b8101eae2c90c209ca07006483d5663c02a4bc8593b87c1320"

[DEBUG] do request
  digest = "sha256:a17a80e88aa467b8101eae2c90c209ca07006483d5663c02a4bc8593b87c1320"
  url = "https://org.azurecr.io/v2/acme/policy/manifests/sha256:a17a80e88aa467b8101eae2c90c209ca07006483d5663c02a4bc8593b87c1320"
  request.method = "GET"
  request.header.accept = "application/vnd.oci.image.manifest.v1+json, */*"
  request.header.user-agent = "containerd/1.7.23+unknown"

[DEBUG] fetch response received
  response.header.access-control-expose-headers = "Docker-Content-Digest"
  response.header.docker-distribution-api-version = "registry/2.0"
  response.header.date = "Sat, 23 Nov 2024 17:05:02 GMT"
  response.header.access-control-expose-headers.3 = "X-Ms-Correlation-Request-Id"
  response.header.x-ms-correlation-request-id = "61e072b2-1176-44ae-a63e-a7507a596260"
  response.header.content-length = 549
  response.header.connection = "keep-alive"
  response.header.x-content-type-options = "nosniff"
  response.header.access-control-expose-headers.2 = "Link"
  response.header.content-type = "application/vnd.oci.image.manifest.v1+json"
  digest = "sha256:a17a80e88aa467b8101eae2c90c209ca07006483d5663c02a4bc8593b87c1320"
  response.header.etag = "sha256:a17a80e88aa467b8101eae2c90c209ca07006483d5663c02a4bc8593b87c1320"
  response.header.strict-transport-security.1 = "max-age=31536000; includeSubDomains"
  response.header.access-control-expose-headers.1 = "WWW-Authenticate"
  response.header.server = "AzureContainerRegistry"
  url = "https://org.azurecr.io/v2/acme/policy/manifests/sha256:a17a80e88aa467b8101eae2c90c209ca07006483d5663c02a4bc8593b87c1320"
  response.header.strict-transport-security = "max-age=31536000; includeSubDomains"
  response.header.x-ms-request-id = "f55af762-73c8-4ab4-a50c-9527adbbe0da"
  response.header.x-ms-client-request-id = ""
  response.header.docker-content-digest = "sha256:a17a80e88aa467b8101eae2c90c209ca07006483d5663c02a4bc8593b87c1320"
  response.status = "200 OK"

[DEBUG] do request
  request.method = "GET"
  request.header.accept = "application/vnd.oci.image.layer.v1.tar+gzip, */*"
  request.header.user-agent = "containerd/1.7.23+unknown"
  url = "https://org.azurecr.io/v2/acme/policy/blobs/sha256:2c2def83472ef874c864e9a9af91e57cffef3076537e34919cfb5b2d459d6b7d"
  digest = "sha256:2c2def83472ef874c864e9a9af91e57cffef3076537e34919cfb5b2d459d6b7d"

[DEBUG] do request
  digest = "sha256:ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356"
  url = "https://org.azurecr.io/v2/acme/policy/blobs/sha256:ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356"
  request.header.accept = "application/vnd.oci.image.config.v1+json, */*"
  request.method = "GET"
  request.header.user-agent = "containerd/1.7.23+unknown"

[DEBUG] fetch response received
  response.header.date = "Sat, 23 Nov 2024 17:05:02 GMT"
  response.header.content-length = 321
  response.status = "403 Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature."
  response.header.content-type = "application/xml"
  response.header.server = "Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0"
  response.header.x-ms-request-id = "2ec3e60f-f01e-006f-08c9-3d83bc000000"
  digest = "sha256:ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356"
  url = "https://org.azurecr.io/v2/acme/policy/blobs/sha256:ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356"

[ERROR] Bundle load failed: failed to pull org.azurecr.io/acme/policy:latest: download for 'org.azurecr.io/acme/policy:latest' failed: failed to ingest: copy failed: httpReadSeeker: failed open: unexpected status code https://org.azurecr.io/v2/acme/policy/blobs/sha256:ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356: 403 Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
  plugin = "bundle"
  name = "main"

@srenatus
Copy link
Contributor

Can you try the edge image tag? Perhaps your binary was built from main and the docker image is the latest release.

@srenatus
Copy link
Contributor

Nevermind you've ruled that out before. Hmm

@srenatus
Copy link
Contributor

srenatus commented Nov 23, 2024

OK, it's a bit of a long shot, but could this be time or timezone related? Is, for some reason, the time in your container host off? I'm wondering because request signatures usually have something time-related to them, so clock drift could give you an invalid signature.

uld rule that out -- since it's unlikely that my clock is adrift in the same way -- but I don't have any ACR setup to try this with.

@bluebrown
Copy link
Author

Yeah, I dont really know what do anymore. I guess this is one of those
where you have to use some linux tracing tools. Im not very good
with those.

I could create an ACR on our infrastructure and share access credentials,
if that would we useful for you or someone wanting to give it a shot.

I will try to look into the time. But I want to point out that the go code
snippet, uses the same libraries and all, and its also run the same debian
base image. So I would expect it also to suffer from something like
container specific time issues.

@srenatus
Copy link
Contributor

I could create an ACR on our infrastructure and share access credentials, if that would we useful for you or someone wanting to give it a shot.

You can do that, but I'm unlikely to get around to try it until Monday. You could send them to me on the OPA slack.

and its also run the same debian base image.

Have you tried putting your custom go code into the exact same OPA image (via FROM openpolicyagent/opa:...) and see if it works the same there? This would let us rule out (I believe) that there's something specifically odd about our base image.

@srenatus srenatus added the investigating Issues being actively investigated label Nov 23, 2024
@bluebrown
Copy link
Author

The ACR, I have shared with you yesterday, works now locally as well for me,
with OPA. It exhibits the same behaviors:

  • works locally with opa
  • does NOT work in container with opa (offical image)
  • does NOT work in container with opa (custom image)
  • works locally, with custom go code
  • works in container with custom go code

@ashutosh-narkar
Copy link
Member

@carabasdaniel @DerGut have y'all seen this before?

@srenatus
Copy link
Contributor

I have noticed that the local binary case doesn't do as many requests to ACR as the container. The container/non-container difference seems to be that ORAS is caching. And indeed, oci_download.go sets up something like /tmp/opa/oci as cache directory.

If we hinder that, by something like

func NewOCI(config Config, client rest.Client, path, storePath string) *OCIDownloader {
+	t := filepath.Join(os.TempDir(), time.Now().String())
+	localstore, err := oci.New(t)
-	localstore, err := oci.New(storePath)

we see the same behaviour: it's failing a lot instead of pulling the bundle, even outside of the container. The issue thus has to do with pulling from ACR, and nothing to do with running in a container. The local caching only hid the problem, and in a fresh docker container that wasn't the case.

@srenatus
Copy link
Contributor

So, I've been swarming on the problem together with @bluebrown for a bit this morning and we've found that at some point, OPA's HTTP auth and OCI code sets an Authorization header where there shouldn't be one.

  1. A request with the Authorization header sees a "307 Temporary Redirect"
  2. the request to that URL must not contain an Authorization header to succeed
  3. the header is injected anyways, the request fails because the backend doesn't like that.

@bluebrown
Copy link
Author

bluebrown commented Nov 25, 2024

To add more context, the redirect is to azure blob storage, as this is used by ACR as backend.
When being redirected, the query parameters contain a SAS, shared access signature, which functions
as auth method for the request.

[ERROR] Bundle load failed: failed to pull squirrel.azurecr.io/acme/policy:latest: download for 'squirrel.azurecr.io/acme/policy:latest' failed: failed to ingest: copy failed: httpReadSeeker: failed open: failed to do request: 
Get "https://weumanaged241.blob.core.windows.net/267c910ae8a4480eb1d0cdb0b248317f-g9t56skrew//docker/registry/v2/blobs/sha256/ca/ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356/data
    ?se=2024-11-25T10%3A39%3A51Z
    &sig=<redacted>&sp=r&spr=https&sr=b&sv=2018-03-28&regid=267c910ae8a4480eb1d0cdb0b248317f": 
context canceled
  name = "main"
  plugin = "bundle"

Therefore, adding an Authorization header on this request leads to ambiguity, since two
auth methods have been supplied.

https://learn.microsoft.com/en-us/azure/ai-services/translator/document-translation/how-to-guides/create-sas-tokens?tabs=Containers

@bluebrown
Copy link
Author

So, I am doing some research if there is something in the http spec,
saying if or when auth headers should be forwarded on redirect.

I didn't find anything specific so far but I found this issue :
axios/axios#2855.

Maybe we can do something similar and add another field to the
bearer auth plugin config, so that the user can specify weather or
not the http client returned by the plugin config should drop the
headers on redirect or not.

What do you think about that @srenatus ?

@srenatus
Copy link
Contributor

Hmm. I suppose it would work, yeah. What confused me a bit is that the resolver specifically re-authorizes redirects -- i.e., it does the steps to add the headers again here. And yet, when you use the gist you've created, that is not a problem. The header is not there. And yet, when the OPA OCI code is used, it is added again? So it seems like something is wrong, and that workaround would be duct tape applied to the problem 😉

On that axios link, they end up having follow-redirects drop the authorization header when the redirect goes to another host.. That sounds reasonable, but I still wonder if there isn't some simple thing in the auth plugin <--> oci interaction we're doing that we shouldn't do. In a DM, you pointed at this code, maybe that is the simple thing we're doing wrong? 🤔

@bluebrown
Copy link
Author

bluebrown commented Nov 26, 2024

As I understand, the config returns an http client. that is passed to docker.Authorizer

Client: client,

The client passed is the one from the rest plugin, calling bearer auth's prepare, on each request.

err = c.config.authPrepare(req, c.authPluginLookup)

I think this is wrong. I think that docker resolver should not get a client that does any auth,
so perhaps we could pass in the defaultAuthPlugin when we detect that its the oci/berarer
combo. So it wont do any auth.

func (m *Manager) AuthPlugin(name string) rest.HTTPAuthPlugin {

We dont need to call prepare, we can get the token from the config.Bearer.Token field,
instead. That is d.client.Config().

resolver, err := dockerResolver(plugin, d.client.Config(), d.logger)

@srenatus srenatus changed the title oci downloader fails in opa docker image oci downloader fails with ACR (token header mistakenly added on redirect) Nov 29, 2024
Copy link

stale bot commented Dec 30, 2024

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days. Although currently inactive, the issue could still be considered and actively worked on in the future. More details about the use-case this issue attempts to address, the value provided by completing it or possible solutions to resolve it would help to prioritize the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug inactive investigating Issues being actively investigated
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants