Skip to content

Commit

Permalink
Merge branch 'release/2.0.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
ufl-taeber committed Oct 16, 2018
2 parents c7b4993 + 05235a9 commit 458bd42
Show file tree
Hide file tree
Showing 57 changed files with 1,269 additions and 4,350 deletions.
18 changes: 12 additions & 6 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,19 +1,25 @@
## [1.3.0] 10-24-2017
# Changelog

## [2.0.0] 2018-10-16
### Changed
* Moved from YAML config files to a domain-specifc language. (Patrick White)

## [1.3.0] 2017-10-24
### Added
* add docopt support (Matthew McConnell)
* clean up version and dev-reqs files (Matthew McConnell)
* update auditor main to use new version file (Matthew McConnell)
* create a version file (Matthew McConnell)
* specify the specific version of requirements (Matthew McConnell)

## [1.2.0] 8-28-2017
## [1.2.0] 2017-08-28
### Added
* add strip_whitespace mapping (Patrick White)
* add strip_whitespace mapping (Patrick White)

### Fixed
* fix regex default matching (Patrick White)
* fix regex default matching (Patrick White)

## [1.1.0] 8-13-2017
## [1.1.0] 2017-08-13
### Added
* add regex example (Patrick White)
* add deidentify example (Patrick White)
Expand All @@ -22,7 +28,7 @@
### Fixed
* bugfix for blank lines, empty config keys (Patrick White)

## [1.0.0]
## [1.0.0] 2017-04-27
* updates setup.py to be 1.0.0 (Patrick White)
* let users use different encodings, and add quotechars on the way out (Patrick White)
* allows config to not define new headers (Patrick White)
Expand Down
70 changes: 59 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# auditor v1.2.0
# auditor v2.0.0
Makes sure your CSV's remain compliant

There are times that CSV files need either cleaning, or replacement of certain items, or filtering
Expand All @@ -9,20 +9,68 @@ everything is ready to go before your application needs to do the rest of the da

## Configs

Look in docs/config_information.org for a detailed treatment of the configuration options
Auditor no longer uses configs as of version `2.0.0`. Please consult the examples to figure out how to
migrate to the new system.

## Usage

Auditor defines a Domain Specific Language (D.S.L) which is yet to be named. It allows one to write an
auditor program that operates on a csv on a row by row basis to do things like rename columns, apply
data transforms, remove whitespace, compare one value in a row to another, whitelist and blacklist items,
and do lookups.

`$ auditor <path_to_my_program>`

## Development

Adding further features to auditor falls into two categories: new column transforms and other things.
If all you need is to add a transform to the `col` block to implement something, for example fixed length
fields, follow the following steps:

* fork this repo
* clone this repository
* install the dev-requirements with `pip3 install -r dev-requirements.txt`
* 'cd <repo-clone-directory>/auditor/transforms'
* run `python3 ../dev_scripts/make_new_col_transform.py <transforms-you-need>...`
* edit the code till it works. Testing against a program and data file in the examples directory
* submit a pull request

## Examples

Look in the examples folder for some examples and feel free to add some!
Look in the examples folder for a sample program of all features that are currently implemented.

## Usage
## Language specification

There are two parts to the auditor D.S.L. The preamble, which describes the two csv files, their paths,
how they are structured and whether to remove bad data, and the `col` blocks which describe how to transform
a given row column pair in a csv.

In the D.S.L. you can specify a string with spaces by surrounding it with `"` characters.

### Preamble

| keyword | meaning / function | # args | arg types |
|-------------------|----------------------------------------------------------------|-------------|-------------------------------------------------|
| `read` | specifies the input file | 1 | relative file path from where the script is run |
| `write` | specifies the output file | 1 | relative file path from where the script is run |
| `separator` | the character that separates columns in the input file | 1 | single unescaped character |
| `quotechar` | the character that quotes a cell in the input file | 1 | single unescaped character |
| `encoding` | the encoding of the input file | 1 | an encoding string python3 understands |
| `column_add` | adds any columns to the output | more than 0 | space separated list of column names to add |
| `column_order` | the output order of the columns | num of cols | space separated list of column names in order |
| `column_rename` | rename column from first arg to second | 2 | space separated list of old column name and new |
| `remove_bad_data` | flag to remove rows from the output with a `<BAD_DATA>` string | 0 | this arg is a flag and takes no args |

First run auditor on the file you want to alter. This will give a csv with the same number of
rows with some cells replaced by control strings.
Note that columns not listed in the column order will not be put into the output file.

Then run auditor with the -c flag on the control string output. This will give a much smaller
csv that only has the rows that you want. No blacklisted items, only whitelisted, no empty data
no bad data.
### col blocks

`$ auditor raw_data.txt auditor.conf.yaml -o data/audited.unclean.csv -v > logs/auditor.unclean.log`
`$ auditor -c data/audited.unclean.csv auditor.conf.yaml -o data/auditor.clean.csv -v > logs/auditor.clean.log`
These describe the sequence of transforms to take place in a run of auditor. Each block should have:
* A newline before and after
* A start line of `col <column_name> <optional_priority>`
* An ending line of `| done`

The column name in the first line should be the name after renaming. The priority denotes which blocks get executed first.
Since a column has access to the rest of the row, there are times you want to do something before something else.
Higher priority col blocks get executed first. If two col blocks have the same priority, there is no defined behavior for
which goes first.
5 changes: 4 additions & 1 deletion auditor/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
from auditor.mappings import Mappings
from auditor.base_exceptions import *
from auditor.compiler import *
from auditor.interpreter import *
from auditor.transforms import *
175 changes: 16 additions & 159 deletions auditor/__main__.py
Original file line number Diff line number Diff line change
@@ -1,175 +1,33 @@
docstr = """
Auditor
Usage: auditor.py [-hcv] (<file> <config>) [-o <output.csv>] [-c --clean] [--verbose]
Usage: auditor (<program_path>)
auditor migrate ( <old_config> <new_program> )
Options:
-h --help show this message and exit
-v --version Show version
-o <output.csv> --output=<output.csv> optional output file for results
-c --clean remove rows of a csv that have control strings
--verbose print errors with the mappings handler
Instructions:
First run auditor on the file you want to alter. This will give a csv with the same number of
rows with some cells replaced by control strings.
Then run auditor with the -c flag on the control string output. This will give a much smaller
csv that only has the rows that you want. No blacklisted items, only whitelisted, no empty data
no bad data.
$ auditor raw_data.txt auditor.conf.yaml -o data/audited.unclean.csv -v > logs/auditor.unclean.log
$ auditor -c data/audited.unclean.csv auditor.conf.yaml -o data/auditor.clean.csv -v > logs/auditor.clean.log
Auditor is used to run auditor program files to alter csv files
"""
import csv

from docopt import docopt
import yaml

from .mappings import Mappings
from auditor.version import __version__

_file = '<file>'
_config = '<config>'
_output = '--output'
_do_clean = '--clean'
_verbose = '--verbose'

def main(args):
with open(args[_config], 'r') as config_file:
global config
config = yaml.load(config_file.read())

csv_file = open(args[_file], 'r', encoding=config['csv_encoding'])
data = csv.reader(csv_file, **config['csv_conf'])

if not args[_do_clean]:
data = do_add_headers(data, config.get('new_headers'))
new_rows = do_audit(data, verbose=args[_verbose])
else:
new_rows = do_clean(data)

if args[_do_clean] and config.get('sort'):
index = config['headers'].index(config['sort']['header'])
header = new_rows[0]
new_rows = new_rows[1:]
new_rows.sort(key=lambda row : row[index])
new_rows.insert(0, header)

if args.get(_output):
with open(args[_output], 'w') as outfile:
outfile.write(rows_format(new_rows))
from auditor.compiler import AuditorCompiler
from auditor.interpreter import Interpreter
from auditor.old_config_parser import ConfigParser

def main(args=docopt(docstr)):
if not args.get('migrate'):
program_path = args['<program_path>']
compiler = AuditorCompiler()
instructions = compiler(program_path)
interpreter = Interpreter(instructions)
interpreter()
else:
print(rows_format(new_rows))
parser = ConfigParser(args.get('<old_config>'))
parser.write(args.get('<new_program>'))

def do_add_headers(data, new_headers):
new_rows = []
if not new_headers:
return data
for key in new_headers.keys():
header_data = new_headers[key]
with open(header_data['lookup_file'], 'r') as lookup_file:
lookup_data = yaml.load(lookup_file.read())
for index, row in enumerate(data):
if index == 0:
old_headers = row
row.append(new_headers[key]['name'])
try:
new_rows[index] = row
except IndexError:
new_rows.append(row)
else:
lookup_key = row[old_headers.index(header_data['key'])]
lookup_value = lookup_data.get(lookup_key) or header_data['default'] or ''
row.append(lookup_value)
try:
new_rows[index] = row
except IndexError:
new_rows.append(row)
return new_rows

def do_audit(data, verbose):
new_rows = []
indices = None
new_header = None
mappings = Mappings(config, verbose=verbose)
for index, row in enumerate(data):
if index == 0:
new_header = get_header(row)
indices = [row.index(header) for header in new_header]
new_rows.append(new_header)
else:
apply_map = get_map(new_header, mappings)
new_row = get_new_data_row(row, indices, new_header, apply_map)
if new_row:
new_rows.append(new_row)
return new_rows

def get_header(row):
new_row = []
for col in row:
if col in config['headers']:
new_row.append(col)
return new_row

def get_map(headers, mappings):
def apply_map(index, row):
nonlocal headers
nonlocal mappings
cell = row[index]
for mapping in config['mappings']:
if headers[index] == mapping['header']:
for map_index, my_map in enumerate(mapping['maps']):
kwargs = {
'item': cell,
'headers': headers,
'header': headers[index],
'index': index,
'row': row,
'map': mapping['maps'][map_index]
}
cell = mappings.handler(**kwargs)
return cell
return apply_map

def get_new_data_row(row, indices, header, apply_map):
if len(row):
raw = [row[index] for index in indices]
mapped = [apply_map(index, raw) for index, cell in enumerate(raw)]
valid = True
for cell in mapped:
if cell == '' or cell == None:
valid = False
return mapped if valid else None
else:
return None

def do_clean(data):
new_rows = []
error_strings = config['error_strings'].values()
for row in data:
valid = True
for cell in row:
if cell in error_strings:
valid = False
for index, cell in enumerate(row):
if cell == config['control_strings']['empty_okay']:
row[index] = ''
if valid:
new_rows.append(row)
return new_rows

def rows_format(rows):
quotechar = config['quotechar_write']
delim = config['csv_conf']['delimiter']
text = None
for row in rows:
srow = [quotechar + str(cell) + quotechar for cell in row]
line = delim.join(srow)
if not text:
text = line
else:
text = '\n'.join([text, line])
return text + '\n'

def cli_run():
args = docopt(docstr, version=__version__)
Expand All @@ -178,4 +36,3 @@ def cli_run():
if __name__ == '__main__':
cli_run()
exit()

Loading

0 comments on commit 458bd42

Please sign in to comment.