Merge branch 'release/2.0.0'

ctsit · Oct 16, 2018 · 458bd42 · 458bd42
2 parents c7b4993 + 05235a9
commit 458bd42
Show file tree

Hide file tree

Showing 57 changed files with 1,269 additions and 4,350 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,19 +1,25 @@
-## [1.3.0] 10-24-2017
+# Changelog
+
+## [2.0.0] 2018-10-16
+### Changed
+ * Moved from YAML config files to a domain-specifc language. (Patrick White)
+
+## [1.3.0] 2017-10-24
 ### Added
  * add docopt support (Matthew McConnell)
  * clean up version and dev-reqs files (Matthew McConnell)
  * update auditor main to use new version file (Matthew McConnell)
  * create a version file (Matthew McConnell)
  * specify the specific version of requirements (Matthew McConnell)
 
-## [1.2.0] 8-28-2017
+## [1.2.0] 2017-08-28
 ### Added
-* add strip_whitespace mapping (Patrick White)
+ * add strip_whitespace mapping (Patrick White)
 
 ### Fixed
-* fix regex default matching (Patrick White)
+ * fix regex default matching (Patrick White)
 
-## [1.1.0] 8-13-2017
+## [1.1.0] 2017-08-13
 ### Added
  * add regex example (Patrick White)
  * add deidentify example (Patrick White)
@@ -22,7 +28,7 @@
 ### Fixed
  * bugfix for blank lines, empty config keys (Patrick White)
 
-## [1.0.0]
+## [1.0.0] 2017-04-27
  * updates setup.py to be 1.0.0 (Patrick White)
  * let users use different encodings, and add quotechars on the way out (Patrick White)
  * allows config to not define new headers (Patrick White)

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# auditor v1.2.0
+# auditor v2.0.0
 Makes sure your CSV's remain compliant
 
 There are times that CSV files need either cleaning, or replacement of certain items, or filtering 
@@ -9,20 +9,68 @@ everything is ready to go before your application needs to do the rest of the da
 
 ## Configs
 
-Look in docs/config_information.org for a detailed treatment of the configuration options
+Auditor no longer uses configs as of version `2.0.0`. Please consult the examples to figure out how to
+migrate to the new system.
+
+## Usage
+
+Auditor defines a Domain Specific Language (D.S.L) which is yet to be named. It allows one to write an
+auditor program that operates on a csv on a row by row basis to do things like rename columns, apply
+data transforms, remove whitespace, compare one value in a row to another, whitelist and blacklist items,
+and do lookups.
+
+`$ auditor <path_to_my_program>`
+
+## Development
+
+Adding further features to auditor falls into two categories: new column transforms and other things.
+If all you need is to add a transform to the `col` block to implement something, for example fixed length
+fields, follow the following steps:
+
+  * fork this repo
+  * clone this repository
+  * install the dev-requirements with `pip3 install -r dev-requirements.txt`
+  * 'cd <repo-clone-directory>/auditor/transforms'
+  * run `python3 ../dev_scripts/make_new_col_transform.py <transforms-you-need>...`
+  * edit the code till it works. Testing against a program and data file in the examples directory
+  * submit a pull request
 
 ## Examples
 
-Look in the examples folder for some examples and feel free to add some!
+Look in the examples folder for a sample program of all features that are currently implemented.
 
-## Usage
+## Language specification
+
+There are two parts to the auditor D.S.L. The preamble, which describes the two csv files, their paths,
+how they are structured and whether to remove bad data, and the `col` blocks which describe how to transform
+a given row column pair in a csv.
+
+In the D.S.L. you can specify a string with spaces by surrounding it with `"` characters.
+
+### Preamble
+
+| keyword           | meaning / function                                             |      # args | arg types                                       |
+|-------------------|----------------------------------------------------------------|-------------|-------------------------------------------------|
+| `read`            | specifies the input file                                       |           1 | relative file path from where the script is run |
+| `write`           | specifies the output file                                      |           1 | relative file path from where the script is run |
+| `separator`       | the character that separates columns in the input file         |           1 | single unescaped character                      |
+| `quotechar`       | the character that quotes a cell in the input file             |           1 | single unescaped character                      |
+| `encoding`        | the encoding of the input file                                 |           1 | an encoding string python3 understands          |
+| `column_add`      | adds any columns to the output                                 | more than 0 | space separated list of column names to add     |
+| `column_order`    | the output order of the columns                                | num of cols | space separated list of column names in order   |
+| `column_rename`   | rename column from first arg to second                         |           2 | space separated list of old column name and new |
+| `remove_bad_data` | flag to remove rows from the output with a `<BAD_DATA>` string |           0 | this arg is a flag and takes no args            |
 
-  First run auditor on the file you want to alter. This will give a csv with the same number of
-rows with some cells replaced by control strings.
+Note that columns not listed in the column order will not be put into the output file.
 
-  Then run auditor with the -c flag on the control string output. This will give a much smaller
-csv that only has the rows that you want. No blacklisted items, only whitelisted, no empty data
-no bad data.
+### col blocks
 
-`$ auditor raw_data.txt auditor.conf.yaml -o data/audited.unclean.csv -v > logs/auditor.unclean.log`
-`$ auditor -c data/audited.unclean.csv auditor.conf.yaml -o data/auditor.clean.csv -v > logs/auditor.clean.log`
+These describe the sequence of transforms to take place in a run of auditor. Each block should have:
+  * A newline before and after
+  * A start line of `col <column_name> <optional_priority>`
+  * An ending line of `| done`
+
+The column name in the first line should be the name after renaming. The priority denotes which blocks get executed first.
+Since a column has access to the rest of the row, there are times you want to do something before something else.
+Higher priority col blocks get executed first. If two col blocks have the same priority, there is no defined behavior for
+which goes first.
diff --git a/auditor/__init__.py b/auditor/__init__.py
@@ -1 +1,4 @@
-from auditor.mappings import Mappings
+from auditor.base_exceptions import *
+from auditor.compiler import *
+from auditor.interpreter import *
+from auditor.transforms import *
diff --git a/auditor/__main__.py b/auditor/__main__.py
@@ -1,175 +1,33 @@
 docstr = """
 Auditor
 
-Usage: auditor.py [-hcv] (<file> <config>) [-o <output.csv>] [-c --clean] [--verbose]
+Usage: auditor (<program_path>)
+auditor migrate ( <old_config> <new_program> )
 
 Options:
   -h --help                                     show this message and exit
-  -v --version                                  Show version
-  -o <output.csv> --output=<output.csv>         optional output file for results
-  -c --clean                                    remove rows of a csv that have control strings
-  --verbose                                     print errors with the mappings handler
 
-Instructions:
-  First run auditor on the file you want to alter. This will give a csv with the same number of
-rows with some cells replaced by control strings.
-  Then run auditor with the -c flag on the control string output. This will give a much smaller
-csv that only has the rows that you want. No blacklisted items, only whitelisted, no empty data
-no bad data.
 
-$ auditor raw_data.txt auditor.conf.yaml -o data/audited.unclean.csv -v > logs/auditor.unclean.log
-$ auditor -c data/audited.unclean.csv auditor.conf.yaml -o data/auditor.clean.csv -v > logs/auditor.clean.log
+Auditor is used to run auditor program files to alter csv files
 """
-import csv
-
 from docopt import docopt
-import yaml
-
-from .mappings import Mappings
 from auditor.version import __version__
 
-_file = '<file>'
-_config = '<config>'
-_output = '--output'
-_do_clean = '--clean'
-_verbose = '--verbose'
-
-def main(args):
-    with open(args[_config], 'r') as config_file:
-        global config
-        config = yaml.load(config_file.read())
-
-    csv_file = open(args[_file], 'r', encoding=config['csv_encoding'])
-    data = csv.reader(csv_file, **config['csv_conf'])
-
-    if not args[_do_clean]:
-        data = do_add_headers(data, config.get('new_headers'))
-        new_rows = do_audit(data, verbose=args[_verbose])
-    else:
-        new_rows = do_clean(data)
-
-    if args[_do_clean] and config.get('sort'):
-        index = config['headers'].index(config['sort']['header'])
-        header = new_rows[0]
-        new_rows = new_rows[1:]
-        new_rows.sort(key=lambda row : row[index])
-        new_rows.insert(0, header)
-
-    if args.get(_output):
-        with open(args[_output], 'w') as outfile:
-            outfile.write(rows_format(new_rows))
+from auditor.compiler import AuditorCompiler
+from auditor.interpreter import Interpreter
+from auditor.old_config_parser import ConfigParser
+
+def main(args=docopt(docstr)):
+    if not args.get('migrate'):
+        program_path = args['<program_path>']
+        compiler = AuditorCompiler()
+        instructions = compiler(program_path)
+        interpreter = Interpreter(instructions)
+        interpreter()
     else:
-        print(rows_format(new_rows))
+        parser = ConfigParser(args.get('<old_config>'))
+        parser.write(args.get('<new_program>'))
 
-def do_add_headers(data, new_headers):
-    new_rows = []
-    if not new_headers:
-        return data
-    for key in new_headers.keys():
-        header_data = new_headers[key]
-        with open(header_data['lookup_file'], 'r') as lookup_file:
-            lookup_data = yaml.load(lookup_file.read())
-        for index, row in enumerate(data):
-            if index == 0:
-                old_headers = row
-                row.append(new_headers[key]['name'])
-                try:
-                    new_rows[index] = row
-                except IndexError:
-                    new_rows.append(row)
-            else:
-                lookup_key = row[old_headers.index(header_data['key'])]
-                lookup_value = lookup_data.get(lookup_key) or header_data['default'] or ''
-                row.append(lookup_value)
-                try:
-                    new_rows[index] = row
-                except IndexError:
-                    new_rows.append(row)
-    return new_rows
-
-def do_audit(data, verbose):
-    new_rows = []
-    indices = None
-    new_header = None
-    mappings = Mappings(config, verbose=verbose)
-    for index, row in enumerate(data):
-        if index == 0:
-            new_header = get_header(row)
-            indices = [row.index(header) for header in new_header]
-            new_rows.append(new_header)
-        else:
-            apply_map = get_map(new_header, mappings)
-            new_row = get_new_data_row(row, indices, new_header, apply_map)
-            if new_row:
-                new_rows.append(new_row)
-    return new_rows
-
-def get_header(row):
-    new_row = []
-    for col in row:
-        if col in config['headers']:
-            new_row.append(col)
-    return new_row
-
-def get_map(headers, mappings):
-    def apply_map(index, row):
-        nonlocal headers
-        nonlocal mappings
-        cell = row[index]
-        for mapping in config['mappings']:
-            if headers[index] == mapping['header']:
-                for map_index, my_map in enumerate(mapping['maps']):
-                    kwargs = {
-                        'item': cell,
-                        'headers': headers,
-                        'header': headers[index],
-                        'index': index,
-                        'row': row,
-                        'map': mapping['maps'][map_index]
-                    }
-                    cell = mappings.handler(**kwargs)
-        return cell
-    return apply_map
-
-def get_new_data_row(row, indices, header, apply_map):
-    if len(row):
-        raw = [row[index] for index in indices]
-        mapped = [apply_map(index, raw) for index, cell in enumerate(raw)]
-        valid = True
-        for cell in mapped:
-            if cell == '' or cell == None:
-                valid = False
-        return mapped if valid else None
-    else:
-        return None
-
-def do_clean(data):
-    new_rows = []
-    error_strings = config['error_strings'].values()
-    for row in data:
-        valid = True
-        for cell in row:
-            if cell in error_strings:
-                valid = False
-        for index, cell in enumerate(row):
-            if cell == config['control_strings']['empty_okay']:
-                row[index] = ''
-        if valid:
-            new_rows.append(row)
-    return new_rows
-
-def rows_format(rows):
-    quotechar = config['quotechar_write']
-    delim = config['csv_conf']['delimiter']
-    text = None
-    for row in rows:
-        srow = [quotechar + str(cell) + quotechar for cell in row]
-        line = delim.join(srow)
-        if not text:
-            text = line
-        else:
-            text = '\n'.join([text, line])
-    return text + '\n'
 
 def cli_run():
     args = docopt(docstr, version=__version__)
@@ -178,4 +36,3 @@ def cli_run():
 if __name__ == '__main__':
     cli_run()
     exit()
-