Skip to content

Latest commit

 

History

History
192 lines (149 loc) · 9.78 KB

C_EXTENSION.md

File metadata and controls

192 lines (149 loc) · 9.78 KB

Ion Python C Extension

  1. Overall
  2. Motivation
  3. Performance Improvement
  4. Setup
  5. Development
  6. Technical Details
    6.1  Common Binary Encoding Differences between C Extension and Original Ion Python
    6.2  Known Issues
  7. TODO
  8. Deploy
    8.1  Distribution

Overall

Ion Python C extension utilizes Ion C to access files that close the performance gap between the Ion Python simpleion module and other Ion implementations.

The simpleion module C extension supports limited options for now and will add more incrementally. Refer to TODO for details.

Motivation

Python is not fast which causes Ion Python to be slower than other Ion implementations. Ion Python is also slower than other similar python data serialization libraries such as simplejson which is a JSON encoder and decoder. The main reason for the difference in performance between Simplejson and Ion Python simpleion module is because Simplejson binds to a C extension while Ion Python is implemented purely in python.

There are couple technologies we can choose for binding C extension and C binaries (Ion C): CFFI, Cython and CPython APIs.

CFFI and Ctypes are slower than CPython and Cython for most of our use case, Cython is a little bit faster than CPython but it's a compiler for a new programming language that requires more development time. One of the most challenging issues no matter which tool we use is that how we distribute Ion C binaries as it's .dylib on Mac, .so on Linux and .lib on Windows. Also, CPython C extension code for simpleion was almost completed 2 years ago so we decided to choose this option.

If the performance becomes our biggest concern in the future, we should reevaluate the performance implications of the C extension to make sure we're keeping up with the innovations in the Python C extension ecosystem.

Performance Improvement

The performance improvement depends on a multitude of variables (e.g., how the files are structured, what APIs are called the most). Experiment results show around 6000% improvement for text writer/reader and 1400% improvement for binary writer/reader.

We use timeit module to measure the execution time.

setup = "from amazon.ion import simpleion"
code = '''
with open("file_name", "br") as fp:
    simpleion.dumps(simpleion.load(fp, single_value=False))
'''
print(timeit.timeit(setup=setup, stmt=code, number=1))

Experiment Result

test-driver-report.ion(10n) are reports generated by ion-test-driver which consists of Ion structs and strings.
log.ion(10n) are logs that contain a variety of scalar types, annotations, and nested containers.

Files C extension Ion Python Improvement
test-driver-report.ion (42MB) 3.8s 217s 5611%
test-driver-report.10n (13.7MB) 3.6s 55s 1428%
log.ion (84MB) 14.8s 987s 6569%
log.10n (14MB) 15s 221s 1373%

Setup

Ensure that cmake is installed. The setup for Ion Python C extension is the same as the original Ion Python Setup. If it runs into any issue during initialization, it will fall back to regular Ion Python. No extra action needed.

C extension is built under ion-python/amazon/ion and named according to the following format (may be slightly different depending on your platform) ionc.cpython-$py_version-$platform.$suffix (e.g., ionc.cpython-39-darwin.so)

Getting Started with C Extension:

>>> import amazon.ion.simpleion as ion
>>> obj = ion.loads('{abc: 123}')
>>> obj['abc']
123
>>> ion.dumps(obj, binary=True)
b'\xe0\x01\x00\xea\xe9\x81\x83\xd6\x87\xb4\x83abc\xd3\x8a!{'

Development

The Ion Python C extension is built as part of the PEP 517 build process using py-build-cmake, and leaning on Ion C's existing cmake build. A revision of Ion C is included as a submodule in this repo under src/ion-c. If you would like to update the version of Ion C, simply update the submodule to point to the desired revision.

The file src/CMakeLists.txt acts as the build script for the C extension itself, which then includes the Ion C codebase into the build tree.

With the extension built, it will be exposed to python as amazon._ioncmodule. For example:

>>> import amazon._ioncmodule as ionc
>>> ionc.ionc_version()
'v1.1.3 (rev: d61c09a)'
>>>

The amazon.ion.simpleion module then makes use of this extension when it is available to provide more efficient Ion reading and writing. Importing amazon._ioncmodule directly can determine if it is available, however simpleion also provides the field __IS_C_EXTENSION_SUPPORTED.

>>> import amazon.ion.simpleion as ion
>>> ion.__IS_C_EXTENSION_SUPPORTED
True
>>>

In order to build the extension, along with the package itself, we can use python's build module:

ion-python# python -m build .
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
  - py-build-cmake~=0.1.8
* Getting build dependencies for sdist...
* Building sdist...
* Building wheel from sdist
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
  - py-build-cmake~=0.1.8
* Getting build dependencies for wheel...
* Building wheel...
...
Successfully built amazon_ion-0.13.0.tar.gz and amazon_ion-0.13.0-cp310-cp310-linux_x86_64.whl

This will build both the source wheel, and the binary wheel for the current system. Installing the module can be done with pip. Depending on what you're doing with the package you may want to install different dependencies. Different sets of optional dependencies are provided, such as test, and benchmarking. More details can be found in the pyproject.toml.

To install the package and dependencies for unit tests you can run:

ion-python# python -m pip install '.[test]'
Processing /ion-python
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: amazon_ion
  Building wheel for amazon_ion (pyproject.toml) ... done
  Created wheel for amazon_ion: filename=amazon_ion-0.13.0-cp310-cp310-linux_x86_64.whl size=573770 sha256=96ef01efea7519a1a38d9fc9e42139ab82125aafa328cfb4a4823458de802fa1
  Stored in directory: /root/.cache/pip/wheels/78/55/f9/c6d69051d6a93c725251429f62d0d5b7d16d8c982a3772d666
Successfully built amazon_ion
Installing collected packages: amazon_ion
  Attempting uninstall: amazon_ion
    Found existing installation: amazon_ion 0.13.0
    Uninstalling amazon_ion-0.13.0:
      Successfully uninstalled amazon_ion-0.13.0
Successfully installed amazon_ion-0.13.0

Installing with -e will also allow you to update the python side of the package without having to re-install.

Technical Details

1. Common Binary Encoding Differences between C Extension and Original Ion Python

Note that both binary encodings are equivalent; one encoding is not more "correct" than the other.

1.1 Different ways to represent a struct's length. Refer to Amazon Ion Binary Encoding for details.

For Ion struct {a:2}:

Text       IVM               ion_symbol_table::{         symbols:[”a”]}  {         “a”: 2     }   
Ion C      \xe0\x01\x00\xea  \xe7\x81\x83      \xd4      \x87\xb2\x81a   \xd3      \x8a 21\x02
Ion Python \xe0\x01\x00\xea  \xe8\x81\x83      \xde\x84  \x87\xb2\x81a   \xde\x83  \x8a 21\x02

1.2 Different order of symbols within a symbol table.

For symbol abc with two annotations annot1 and annot2, annot1::annot2::abc:

Ion C text        ion_symbol_table::{         symbols:[    "abc", "annot1", "annot2"]}          annot1($11)::annot2($12)::abc($10)
Ion C binary      \xee\x99\x81\x83  \xde\x95  \x87\xbe\x92 \x83abc\x86annot1\x86annot2 \xe5\x82 \x8b         \x8c         \x71\x0a
Ion Python binary ion_symbol_table::{         symbols:[    "annot1", "annot2", "abc",]}         annot1($10)::annot2($11)::abc($12)
ion Python        \xee\x99\x81\x83  \xde\x95  \x87\xbe\x92 \x86annot1\x86annot2\x83abc \xe5\x82 \x8a         \x8b         \x71\x0c

2. Known Issues

  1. C extension only supports at most 9 for timestamp precision. Refer to amazon-ion/ion-python#160 for details.
  2. C extension only supports at most 34 decimal digits. Refer to amazon-ion/ion-python#159 for details.
  3. C extension has a limitation to read large Clob data. Refer to amazon-ion/ion-python#207 for details.
  4. For any memory leak issue, please comment on amazon-ion/ion-python#155.

TODO

  1. More bug fixing.
  2. More performance improvement.
  3. Support more simpleion options such as imports, catalog, omit_version_marker. (Ion Python uses pure python implementation to handle unsupported options currently)
  4. Support pretty print.

Deploy

1. Distribution

PYPI supports two ways of distribution: Source Code Distribution and Wheel Distribution. We support both of them.

Note that the benefits of wheel distribution are:

  1. Pre-compiling Ion C library avoids potential build/compile issues and does not require a C compiler to be present on the user's machine.
  2. Installation of wheels is faster and more efficient.