Stanza and UDPipe support, easy-to-use utility function, Token-attributes, and more
Fully reworked version!
- Tested support for both
spacy-stanza
andspacy-udpipe
! (Not included as a dependency, install manually) - Added a useful utility function
init_parser
that can easily initialise a parser together with the custom
pipeline component. (See the README or examples) - Added the
disable_pandas
flag the the formatter class in case you would want to disable setting the pandas
attribute even when pandas is installed. - Added custom properties for Tokens as well. So now a Doc, its sentence Spans as well as Tokens have custom attributes
- Reworked datatypes of output. In version 2.0.0 the data types are as follows:
._.conll
: raw CoNLL format- in
Token
: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as
values. - in sentence
Span
: a list of its tokens'._.conll
dictionaries (list of dictionaries). - in a
Doc
: a list of its sentences'._.conll
lists (list of list of dictionaries).
- in
._.conll_str
: string representation of the CoNLL format- in
Token
: tab-separated representation of the contents of the CoNLL fields ending with a newline. - in sentence
Span
: the expected CoNLL format where each row represents a token. When
ConllFormatter(include_headers=True)
is used, two header lines are included as well, as per the
CoNLL format
_. - in
Doc
: all its sentences'._.conll_str
combined and separated by new lines.
- in
._.conll_pd
:pandas
representation of the CoNLL format- in
Token
: aSeries
representation of this token's CoNLL properties. - in sentence
Span
: aDataFrame
representation of this sentence, with the CoNLL names as column
headers. - in
Doc
: a concatenation of its sentences'DataFrame
's, leading to a new aDataFrame
whose
index is reset.
- in
field_names
has been removed, assuming that you do not need to change the column names of the CoNLL properties- Removed the
Spacy2ConllParser
class - Many doc changes, added tests, and a few examples