Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output filer mangles bucket name when using s3:// schema and bucket name contains the characters "s3". #45

Open
JossWhittle opened this issue Nov 23, 2023 · 10 comments

Comments

@JossWhittle
Copy link

Assume that TESK is deployed with a config file that sets the output endpoint to some s3 instance in http or https format...

[default]
endpoint_url=http://some.endpoint.com

Then in the job json the "url" for outputs is set to the following:

# This works! 
"url": "s3://output",

The s3 schema means "output" gets treated as the bucket name.

# These all fail!  
"url": "s3://outputs3",
"url": "s3://s3output",

# Here a less contrived example to show how this can happen when you don't even intentionally use "s3" to mean "s3"
"url": "s3://shoulders3486output",

The s3 schema is detected but because the bucket name also contains "s3" it falsely triggers this regex:

match = re.search('^([^.]+).s3', self.netloc)

Which mangles the bucket name leading to a bucket not found error.

But we can trick it...

# This works! 
"url": "http://s3.foo.bar.baz/shoulders3486output",

HTTP is detected as the schema, but the netloc part of the url contains "s3" so it is treated as s3 due to this logic:

if 's3' in netloc:
return S3Transput

The bucket name is now part of the URL "path" not the URL "netloc", so it doesn't get mangled.

s3.foo.bar.baz the netloc part is never actually used other than detecting if it's an s3 transfer or http transfer.

@uniqueg
Copy link
Member

uniqueg commented Nov 30, 2023

Thanks for reporting. @lvarin: do you have time to take a look?

@lvarin
Copy link
Contributor

lvarin commented Dec 1, 2023

sure, I will check this

@lvarin
Copy link
Contributor

lvarin commented Dec 1, 2023

So after a first look, indeed the lines reported are key here.

The easiest will be to only support S3 when s3:// is used as an schema. And then if https:// is used, TSK will use plain http(s) to get the files, but not upload. I do not like this option, because it feels like too much like a "spherical cow" solution.

Another simple fix could be to do these changes:

diff --git a/src/tesk_core/filer.py b/src/tesk_core/filer.py
index 4c20679..b646aab 100755
--- a/src/tesk_core/filer.py
+++ b/src/tesk_core/filer.py
@@ -413,7 +413,7 @@ def newTransput(scheme, netloc):
     elif scheme == 'file':
         return fileTransputIfEnabled()
     elif scheme in ['http', 'https']:
-        if 's3' in netloc:
+        if re.match('^(([^\.]+)\.)?s3\.', netloc):
             return S3Transput
         return HTTPTransput
     elif scheme == 's3':
diff --git a/src/tesk_core/filer_s3.py b/src/tesk_core/filer_s3.py
index 80354c3..1c813b8 100644
--- a/src/tesk_core/filer_s3.py
+++ b/src/tesk_core/filer_s3.py
@@ -42,9 +42,9 @@ class S3Transput(Transput):
         If the s3 url are of following formats
         1. File type = FILE
             * http://mybucket.s3.amazonaws.com/file.txt
-            * http://mybucket.s3-aws-region.amazonaws.com/file.txt
+            * http://mybucket.s3.aws-region.amazonaws.com/file.txt
             * http://s3.amazonaws.com/mybucket/file.txt
-            * http://s3-aws-region.amazonaws.com/mybucket/file.txt
+            * http://s3.aws-region.amazonaws.com/mybucket/file.txt
             * s3://mybucket/file.txt
 
             return values will be
@@ -52,16 +52,16 @@ class S3Transput(Transput):
 
         2. File type = DIRECTORY
             * http://mybucket.s3.amazonaws.com/dir1/dir2/
-            * http://mybucket.s3-aws-region.amazonaws.com/dir1/dir2/
+            * http://mybucket.s3.aws-region.amazonaws.com/dir1/dir2/
             * http://s3.amazonaws.com/mybucket/dir1/dir2/
-            * http://s3-aws-region.amazonaws.com/mybucket/dir1/dir2/
+            * http://s3.aws-region.amazonaws.com/mybucket/dir1/dir2/
             * s3://mybucket/dir1/dir2/
 
             return values will be
             bucket name = mybucket , file path = dir1/dir2/
         """
 
-        match = re.search('^([^.]+).s3', self.netloc)
+        match = re.search('^(([^\.]+)\.)?s3\.', self.netloc)
         if match:
             bucket = match.group(1)
         else:

This is still not perfect, but I think it is good compromise.

Any better suggestion?

(I still did not test it yet)

@lvarin
Copy link
Contributor

lvarin commented Dec 1, 2023

(edited the regexp)

@uniqueg
Copy link
Member

uniqueg commented Dec 2, 2023

I feel like I really don't know enough about s3 and the URL types used to access s3 buckets to meaningfully contribute to this discussion.

However, I do feel that relying on regexes is problematic (unless we can be 100% sure they will never fail). If I see this correctly, the following would match the original (^([^.]+).s3) regex:

https://s3.bla.com
https://bla.s3.com
https://s3bla.bla.com
https://bla.s3bla.com

But surely not all URLs of these format must necessarily point to s3 resources.

So intuitively, I tend to think that being explicit and requiring the s3:// schema in the outputs specification is a reasonable idea. Then for inputs (but not outputs), we can fetch the files like you suggest in your first proposal, @lvarin, depending on the schema (https or s3). Why do you think it's a spherical cow?

@lvarin
Copy link
Contributor

lvarin commented Dec 4, 2023

After thinking about this during the weekend. I think it is not really an spherical cow and we can just require the s3:// protocol.

@lvarin
Copy link
Contributor

lvarin commented Dec 7, 2023

I have more thoughts about this after reading the code more:

  • The host name of the URL is currently never taken into account. Only the configured endpoint_url is used. The parsing of the hostname (netloc) is only used to extract the bucket name and detect if it is a S3 URL.
  • Any S3 URL written in "HTTP format" can be translated into s3 format.

I will make the PR.

@lvarin
Copy link
Contributor

lvarin commented Dec 12, 2023

This solves it:
#46

@JossWhittle
Copy link
Author

@lvarin Many thanks! :) We will give this a test.

@lvarin
Copy link
Contributor

lvarin commented Dec 12, 2023

Let me know any problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants