Skip to content

Stanza and UDPipe support, easy-to-use utility function, Token-attributes, and more

Compare
Choose a tag to compare
@BramVanroy BramVanroy released this 11 May 17:36
· 112 commits to master since this release

Fully reworked version!

  • Tested support for both spacy-stanza and spacy-udpipe! (Not included as a dependency, install manually)
  • Added a useful utility function init_parser that can easily initialise a parser together with the custom
    pipeline component. (See the README or examples)
  • Added the disable_pandas flag the the formatter class in case you would want to disable setting the pandas
    attribute even when pandas is installed.
  • Added custom properties for Tokens as well. So now a Doc, its sentence Spans as well as Tokens have custom attributes
  • Reworked datatypes of output. In version 2.0.0 the data types are as follows:
    • ._.conll: raw CoNLL format
      • in Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as
        values.
      • in sentence Span: a list of its tokens' ._.conll dictionaries (list of dictionaries).
      • in a Doc: a list of its sentences' ._.conll lists (list of list of dictionaries).
    • ._.conll_str: string representation of the CoNLL format
      • in Token: tab-separated representation of the contents of the CoNLL fields ending with a newline.
      • in sentence Span: the expected CoNLL format where each row represents a token. When
        ConllFormatter(include_headers=True) is used, two header lines are included as well, as per the
        CoNLL format_.
      • in Doc: all its sentences' ._.conll_str combined and separated by new lines.
    • ._.conll_pd: pandas representation of the CoNLL format
      • in Token: a Series representation of this token's CoNLL properties.
      • in sentence Span: a DataFrame representation of this sentence, with the CoNLL names as column
        headers.
      • in Doc: a concatenation of its sentences' DataFrame's, leading to a new a DataFrame whose
        index is reset.
  • field_names has been removed, assuming that you do not need to change the column names of the CoNLL properties
  • Removed the Spacy2ConllParser class
  • Many doc changes, added tests, and a few examples