Release Stanza and UDPipe support, easy-to-use utility function, Token-attributes, and more · BramVanroy/spacy_conll

Fully reworked version!

Tested support for both spacy-stanza and spacy-udpipe! (Not included as a dependency, install manually)
Added a useful utility function init_parser that can easily initialise a parser together with the custom
pipeline component. (See the README or examples)
Added the disable_pandas flag the the formatter class in case you would want to disable setting the pandas
attribute even when pandas is installed.
Added custom properties for Tokens as well. So now a Doc, its sentence Spans as well as Tokens have custom attributes
Reworked datatypes of output. In version 2.0.0 the data types are as follows:
- ._.conll: raw CoNLL format
  - in Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as
    values.
  - in sentence Span: a list of its tokens' ._.conll dictionaries (list of dictionaries).
  - in a Doc: a list of its sentences' ._.conll lists (list of list of dictionaries).
- ._.conll_str: string representation of the CoNLL format
  - in Token: tab-separated representation of the contents of the CoNLL fields ending with a newline.
  - in sentence Span: the expected CoNLL format where each row represents a token. When
    ConllFormatter(include_headers=True) is used, two header lines are included as well, as per the
    CoNLL format_.
  - in Doc: all its sentences' ._.conll_str combined and separated by new lines.
- ._.conll_pd: pandas representation of the CoNLL format
  - in Token: a Series representation of this token's CoNLL properties.
  - in sentence Span: a DataFrame representation of this sentence, with the CoNLL names as column
    headers.
  - in Doc: a concatenation of its sentences' DataFrame's, leading to a new a DataFrame whose
    index is reset.
field_names has been removed, assuming that you do not need to change the column names of the CoNLL properties
Removed the Spacy2ConllParser class
Many doc changes, added tests, and a few examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanza and UDPipe support, easy-to-use utility function, Token-attributes, and more