-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output filer mangles bucket name when using s3:// schema and bucket name contains the characters "s3". #45
Comments
Thanks for reporting. @lvarin: do you have time to take a look? |
sure, I will check this |
So after a first look, indeed the lines reported are key here. The easiest will be to only support S3 when Another simple fix could be to do these changes: diff --git a/src/tesk_core/filer.py b/src/tesk_core/filer.py
index 4c20679..b646aab 100755
--- a/src/tesk_core/filer.py
+++ b/src/tesk_core/filer.py
@@ -413,7 +413,7 @@ def newTransput(scheme, netloc):
elif scheme == 'file':
return fileTransputIfEnabled()
elif scheme in ['http', 'https']:
- if 's3' in netloc:
+ if re.match('^(([^\.]+)\.)?s3\.', netloc):
return S3Transput
return HTTPTransput
elif scheme == 's3':
diff --git a/src/tesk_core/filer_s3.py b/src/tesk_core/filer_s3.py
index 80354c3..1c813b8 100644
--- a/src/tesk_core/filer_s3.py
+++ b/src/tesk_core/filer_s3.py
@@ -42,9 +42,9 @@ class S3Transput(Transput):
If the s3 url are of following formats
1. File type = FILE
* http://mybucket.s3.amazonaws.com/file.txt
- * http://mybucket.s3-aws-region.amazonaws.com/file.txt
+ * http://mybucket.s3.aws-region.amazonaws.com/file.txt
* http://s3.amazonaws.com/mybucket/file.txt
- * http://s3-aws-region.amazonaws.com/mybucket/file.txt
+ * http://s3.aws-region.amazonaws.com/mybucket/file.txt
* s3://mybucket/file.txt
return values will be
@@ -52,16 +52,16 @@ class S3Transput(Transput):
2. File type = DIRECTORY
* http://mybucket.s3.amazonaws.com/dir1/dir2/
- * http://mybucket.s3-aws-region.amazonaws.com/dir1/dir2/
+ * http://mybucket.s3.aws-region.amazonaws.com/dir1/dir2/
* http://s3.amazonaws.com/mybucket/dir1/dir2/
- * http://s3-aws-region.amazonaws.com/mybucket/dir1/dir2/
+ * http://s3.aws-region.amazonaws.com/mybucket/dir1/dir2/
* s3://mybucket/dir1/dir2/
return values will be
bucket name = mybucket , file path = dir1/dir2/
"""
- match = re.search('^([^.]+).s3', self.netloc)
+ match = re.search('^(([^\.]+)\.)?s3\.', self.netloc)
if match:
bucket = match.group(1)
else: This is still not perfect, but I think it is good compromise. Any better suggestion? (I still did not test it yet) |
(edited the regexp) |
I feel like I really don't know enough about s3 and the URL types used to access s3 buckets to meaningfully contribute to this discussion. However, I do feel that relying on regexes is problematic (unless we can be 100% sure they will never fail). If I see this correctly, the following would match the original (
But surely not all URLs of these format must necessarily point to s3 resources. So intuitively, I tend to think that being explicit and requiring the |
After thinking about this during the weekend. I think it is not really an spherical cow and we can just require the |
I have more thoughts about this after reading the code more:
I will make the PR. |
This solves it: |
@lvarin Many thanks! :) We will give this a test. |
Let me know any problem. |
Assume that TESK is deployed with a config file that sets the output endpoint to some s3 instance in http or https format...
Then in the job json the "url" for outputs is set to the following:
The s3 schema means "output" gets treated as the bucket name.
The s3 schema is detected but because the bucket name also contains "s3" it falsely triggers this regex:
tesk-core/src/tesk_core/filer_s3.py
Line 64 in 1a7b810
Which mangles the bucket name leading to a bucket not found error.
But we can trick it...
HTTP is detected as the schema, but the netloc part of the url contains "s3" so it is treated as s3 due to this logic:
tesk-core/src/tesk_core/filer.py
Lines 416 to 417 in 1a7b810
The bucket name is now part of the URL "path" not the URL "netloc", so it doesn't get mangled.
s3.foo.bar.baz
the netloc part is never actually used other than detecting if it's an s3 transfer or http transfer.The text was updated successfully, but these errors were encountered: