Output filer mangles bucket name when using s3:// schema and bucket name contains the characters "s3". #45

JossWhittle · 2023-11-23T12:29:34Z

Assume that TESK is deployed with a config file that sets the output endpoint to some s3 instance in http or https format...

[default]
endpoint_url=http://some.endpoint.com

Then in the job json the "url" for outputs is set to the following:

# This works! 
"url": "s3://output",

The s3 schema means "output" gets treated as the bucket name.

# These all fail!  
"url": "s3://outputs3",
"url": "s3://s3output",

# Here a less contrived example to show how this can happen when you don't even intentionally use "s3" to mean "s3"
"url": "s3://shoulders3486output",

The s3 schema is detected but because the bucket name also contains "s3" it falsely triggers this regex:

tesk-core/src/tesk_core/filer_s3.py

Line 64 in 1a7b810

match = re.search('^([^.]+).s3', self.netloc)

Which mangles the bucket name leading to a bucket not found error.

But we can trick it...

# This works! 
"url": "http://s3.foo.bar.baz/shoulders3486output",

HTTP is detected as the schema, but the netloc part of the url contains "s3" so it is treated as s3 due to this logic:

tesk-core/src/tesk_core/filer.py

Lines 416 to 417 in 1a7b810

    
           if 's3' in netloc: 
        
               return S3Transput

The bucket name is now part of the URL "path" not the URL "netloc", so it doesn't get mangled.

s3.foo.bar.baz the netloc part is never actually used other than detecting if it's an s3 transfer or http transfer.

The text was updated successfully, but these errors were encountered:

uniqueg · 2023-11-30T16:39:46Z

Thanks for reporting. @lvarin: do you have time to take a look?

lvarin · 2023-12-01T08:00:47Z

sure, I will check this

lvarin · 2023-12-01T09:53:37Z

So after a first look, indeed the lines reported are key here.

The easiest will be to only support S3 when s3:// is used as an schema. And then if https:// is used, TSK will use plain http(s) to get the files, but not upload. I do not like this option, because it feels like too much like a "spherical cow" solution.

Another simple fix could be to do these changes:

diff --git a/src/tesk_core/filer.py b/src/tesk_core/filer.py
index 4c20679..b646aab 100755
--- a/src/tesk_core/filer.py
+++ b/src/tesk_core/filer.py
@@ -413,7 +413,7 @@ def newTransput(scheme, netloc):
     elif scheme == 'file':
         return fileTransputIfEnabled()
     elif scheme in ['http', 'https']:
-        if 's3' in netloc:
+        if re.match('^(([^\.]+)\.)?s3\.', netloc):
             return S3Transput
         return HTTPTransput
     elif scheme == 's3':
diff --git a/src/tesk_core/filer_s3.py b/src/tesk_core/filer_s3.py
index 80354c3..1c813b8 100644
--- a/src/tesk_core/filer_s3.py
+++ b/src/tesk_core/filer_s3.py
@@ -42,9 +42,9 @@ class S3Transput(Transput):
         If the s3 url are of following formats
         1. File type = FILE
             * http://mybucket.s3.amazonaws.com/file.txt
-            * http://mybucket.s3-aws-region.amazonaws.com/file.txt
+            * http://mybucket.s3.aws-region.amazonaws.com/file.txt
             * http://s3.amazonaws.com/mybucket/file.txt
-            * http://s3-aws-region.amazonaws.com/mybucket/file.txt
+            * http://s3.aws-region.amazonaws.com/mybucket/file.txt
             * s3://mybucket/file.txt
 
             return values will be
@@ -52,16 +52,16 @@ class S3Transput(Transput):
 
         2. File type = DIRECTORY
             * http://mybucket.s3.amazonaws.com/dir1/dir2/
-            * http://mybucket.s3-aws-region.amazonaws.com/dir1/dir2/
+            * http://mybucket.s3.aws-region.amazonaws.com/dir1/dir2/
             * http://s3.amazonaws.com/mybucket/dir1/dir2/
-            * http://s3-aws-region.amazonaws.com/mybucket/dir1/dir2/
+            * http://s3.aws-region.amazonaws.com/mybucket/dir1/dir2/
             * s3://mybucket/dir1/dir2/
 
             return values will be
             bucket name = mybucket , file path = dir1/dir2/
         """
 
-        match = re.search('^([^.]+).s3', self.netloc)
+        match = re.search('^(([^\.]+)\.)?s3\.', self.netloc)
         if match:
             bucket = match.group(1)
         else:

This is still not perfect, but I think it is good compromise.

Any better suggestion?

(I still did not test it yet)

lvarin · 2023-12-01T11:53:57Z

(edited the regexp)

uniqueg · 2023-12-02T17:42:29Z

I feel like I really don't know enough about s3 and the URL types used to access s3 buckets to meaningfully contribute to this discussion.

However, I do feel that relying on regexes is problematic (unless we can be 100% sure they will never fail). If I see this correctly, the following would match the original (^([^.]+).s3) regex:

https://s3.bla.com
https://bla.s3.com
https://s3bla.bla.com
https://bla.s3bla.com

But surely not all URLs of these format must necessarily point to s3 resources.

So intuitively, I tend to think that being explicit and requiring the s3:// schema in the outputs specification is a reasonable idea. Then for inputs (but not outputs), we can fetch the files like you suggest in your first proposal, @lvarin, depending on the schema (https or s3). Why do you think it's a spherical cow?

lvarin · 2023-12-04T13:26:37Z

After thinking about this during the weekend. I think it is not really an spherical cow and we can just require the s3:// protocol.

lvarin · 2023-12-07T07:04:33Z

I have more thoughts about this after reading the code more:

The host name of the URL is currently never taken into account. Only the configured endpoint_url is used. The parsing of the hostname (netloc) is only used to extract the bucket name and detect if it is a S3 URL.
Any S3 URL written in "HTTP format" can be translated into s3 format.

I will make the PR.

lvarin · 2023-12-12T07:40:08Z

This solves it:
#46

JossWhittle · 2023-12-12T11:28:45Z

@lvarin Many thanks! :) We will give this a test.

lvarin · 2023-12-12T14:03:39Z

Let me know any problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output filer mangles bucket name when using s3:// schema and bucket name contains the characters "s3". #45

Output filer mangles bucket name when using s3:// schema and bucket name contains the characters "s3". #45

JossWhittle commented Nov 23, 2023

uniqueg commented Nov 30, 2023

lvarin commented Dec 1, 2023

lvarin commented Dec 1, 2023 •

edited

Loading

lvarin commented Dec 1, 2023

uniqueg commented Dec 2, 2023 •

edited

Loading

lvarin commented Dec 4, 2023

lvarin commented Dec 7, 2023

lvarin commented Dec 12, 2023

JossWhittle commented Dec 12, 2023

lvarin commented Dec 12, 2023

Output filer mangles bucket name when using s3:// schema and bucket name contains the characters "s3". #45

Output filer mangles bucket name when using s3:// schema and bucket name contains the characters "s3". #45

Comments

JossWhittle commented Nov 23, 2023

uniqueg commented Nov 30, 2023

lvarin commented Dec 1, 2023

lvarin commented Dec 1, 2023 • edited Loading

lvarin commented Dec 1, 2023

uniqueg commented Dec 2, 2023 • edited Loading

lvarin commented Dec 4, 2023

lvarin commented Dec 7, 2023

lvarin commented Dec 12, 2023

JossWhittle commented Dec 12, 2023

lvarin commented Dec 12, 2023

lvarin commented Dec 1, 2023 •

edited

Loading

uniqueg commented Dec 2, 2023 •

edited

Loading