cl-conllu

API Reference

cl-conllu

Common Lisp corpus conllu utilities

CL-CONLLU

  • Class TOKEN
    ID   Accessor: TOKEN-ID
    FORM   Accessor: TOKEN-FORM
    LEMMA   Accessor: TOKEN-LEMMA
    UPOSTAG   Accessor: TOKEN-UPOSTAG
    XPOSTAG   Accessor: TOKEN-XPOSTAG
    FEATS   Accessor: TOKEN-FEATS
    HEAD   Accessor: TOKEN-HEAD
    DEPREL   Accessor: TOKEN-DEPREL
    DEPS   Accessor: TOKEN-DEPS
    MISC   Accessor: TOKEN-MISC
    SENTENCE   Accessor: TOKEN-SENTENCE
  • Class MTOKEN
    START   Accessor: MTOKEN-START
    END   Accessor: MTOKEN-END
    FORM   Accessor: MTOKEN-FORM
    MISC   Accessor: MTOKEN-MISC
  • Class SENTENCE
    START   Accessor: SENTENCE-START
    META   Accessor: SENTENCE-META
    TOKENS   Accessor: SENTENCE-TOKENS
    MTOKENS   Accessor: SENTENCE-MTOKENS
  • Function SENTENCE-BINARY-TREE (sentence)
    Based on the idea from [1], it produces a tree view of the sentence, still need to improve the priorities of children. Code at https://github.com/sivareddyg/UDepLambda in file src/deplambda/parser/TreeTransformer.java method 'binarizeTree' [1] Siva Reddy, O. Tackstrom, M. Collins, T. Kwiatkowski, D. Das, M. Steedman, and M. Lapataw, Transforming Dependency Structures to Logical Forms for Semantic Parsing, Transactions of the Association for Computational Linguistics, pp. 127?140, Apr. 2016.
  • Function SENTENCE-HASH-TABLE (sentence)
  • Function SENTENCE-META-VALUE (sentence meta-field)
  • Function SENTENCE-ID (sentence)
  • Function SENTENCE-TEXT (sentence)
  • Function SENTENCE->TEXT (sentence &key (ignore-mtokens nil) (special-format-test #'null special-format-test-supplied-p) (special-format-function #'identity special-format-function-supplied-p))
    Receives SENTENCE, a sentence object, and returns a string reconstructed from its tokens and mtokens. If IGNORE-MTOKENS, then tokens' forms are used. Else, tokens with id contained in a mtoken are not used, with mtoken's form being used instead. It is possible to special format some tokens. In order to do so, both SPECIAL-FORMAT-TEST and SPECIAL-FORMAT-FUNCTION should be passed. Then for each object (token or mtoken) for which SPECIAL-FORMAT-TEST returns a non-nil result, its form is modified by SPECIAL-FORMAT-FUNCTION in the final string. Example: (sentence-tokens *sentence*) => (#<TOKEN The/DET #1-det-3> #<TOKEN US/PROPN #2-compound-3> #<TOKEN troops/NOUN #3-nsubj-4> #<TOKEN fired/VERB #4-root-0> #<TOKEN into/ADP #5-case-8> #<TOKEN the/DET #6-det-8> #<TOKEN hostile/ADJ #7-amod-8> #<TOKEN crowd/NOUN #8-obl-4> #<TOKEN ,/PUNCT #9-punct-4> #<TOKEN killing/VERB #10-advcl-4> #<TOKEN 4/NUM #11-obj-10> #<TOKEN ./PUNCT #12-punct-4>) (sentence->text sentence) => "The US troops fired into the hostile crowd, killing 4." (sentence->text sentence :special-format-test #'(lambda (token) (eq (token-upostag token) "VERB")) :special-format-function (lambda (string) (format nil "*~a*" (string-upcase string)))) => "The US troops *FIRED* into the hostile crowd, *KILLING* 4."
  • Function SENTENCE-VALID? (sentence)
  • Function SENTENCE-SIZE (sentence)
  • Function ADJUST-SENTENCE (sentence)
    Receives a sentence and reenumerate IDs and HEAD values of each token so that their order (as in sentence-tokens) is respected.
  • Function SIMPLE-DEPREL (deprel)
  • Function SENTENCE-EQUAL (sent-1 sent-2)
    Tests if, for each slot, sent-1 has the same values as sent-2. For tokens and multiword tokens, it uses token-equal and mtoken-equal, respectively.
  • Function MAKE-SENTENCE (lineno lines fn-meta)
  • Function READ-CONLLU (input &key (fn-meta #'collect-meta))
  • Function READ-DIRECTORY (path &key (fn-meta #'collect-meta))
  • Function READ-FILE (path &key (fn-meta #'collect-meta))
  • Function READ-STREAM (stream &key (fn-meta #'collect-meta))
  • Function WRITE-CONLLU-TO-STREAM (sentences &optional (out *standard-output*))
  • Function WRITE-CONLLU (sentences filename &key (if-exists :supersede))
  • Function QUERY (query sentences)
  • Function QUERY-AS-JSON (a-query sentences)
  • Function LEVENSHTEIN (s1 s2 &key test)
  • Function DIFF (sentences-a sentences-b &key test key)
  • Function NON-PROJECTIVE? (sentence)
    Verifies if a sentence tree is projective. Intuitively, this means that, keeping word order, there's no two dependency arcs that cross. More formally, let i -> j mean that j's head is node i. Let '->*' be the transitive closure of '->'. A tree if projective when, for each node i, j: if i -> j, then for each node k between i and j (i < k < j or j < k < i), i ->* k. References: - Nivre, Joakim; Inductive Dependency Parsing, 2006 - https://en.wikipedia.org/wiki/Discontinuity_(linguistics)
  • Function CONVERT-RDF (corpusname stream conlls text-fn id-fn)
    Converts the collection of sentences (as generated by READ-CONLLU) in CONLL, using the function TEXT-FN to extract the text of each sentence and ID-FN to extract the id of each sentence (we need this as there is no standardized way of knowing this.) Also the generated Turtle file contains a lot of duplication so when you import it into your triple-store, make sure you remove all duplicate triples afterwards.
  • Function CONVERT-RDF-FILE (file-in file-out)
  • Function CONVERT-TO-RDF (sentences &key (text-fn #'sentence-text) (id-fn #'sentence-id) (corpusname "my-corpus") (namespace-string "http://www.example.org/") (stream *standard-output*) (rdf-format :ntriples) (conll-namespace "http://br.ibm.com/conll/"))
    Converts a list of sentences (e.g. as generated by READ-CONLLU) in SENTENCES, using the function TEXT-FN to extract the text of each sentence and ID-FN to extract the id of each sentence (we need this as there is no standardized way of knowing this.) Currently only ntriples is supported as RDF-FORMAT.
  • Function APPLY-RULES-FROM-FILES (conllu-file rules-file new-conllu-file log-file &key recursive)
  • Function APPLY-RULES (sentences rules recursive)

CONLLU.PROLOG

  • Function CONVERT-FILENAME (context filename-in filename-out)

CONLLU.RDF

No exported symbols.

Also exports

  • CL-CONLLU:CONVERT-TO-RDF

CONLLU.CONVERTERS.NICELINE

No exported symbols.

CONLLU.CONVERTERS.TAGS

  • Function WRITE-SENTENCE-TAG-SUFFIX-TO-STREAM (sentence &key (stream *standard-output*) (tag 'upostag) (separator "_"))
    Writes sentence as CoNLL-U file in STREAM as FORM.SEPARATOR.TAGVALUE (without dots), followed by a whitespace character. If TAG is NIL, then writes only FORMs, followed by a whitepsace character. Example: ;; supposing sentence already defined (write-sentence-tag-suffix-to-stream (sentence :tag 'xpostag :separator "_")) Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS old_JJ ,_, will_MD join_VB the_DT board_NN as_IN a_DT nonexecutive_JJ director_NN Nov._NNP 29_CD ._. => NIL
  • Function WRITE-SENTENCES-TAG-SUFFIX-TO-STREAM (sentences &key (stream *standard-output*) (tag 'upostag) (separator "_"))
    See documentation for write-sentence-tag-suffix-to-stream
  • Function WRITE-SENTENCES-TAG-SUFFIX (sentences filename &key (tag 'upostag) (separator "_") (if-exists :supersede))
    See documentation for write-sentence-tag-suffix-to-stream
  • Function READ-SENTENCE-TAG-SUFFIX (stream field separator)
    Writes as sentence object input from STREAM as FORM.SEPARATOR.TAGVALUE (without dots), followed by a whitespace character. Example: ;; Consider the file example.txt, with contents: ;; Pudim_NOUN ?_VERB bom_ADJ ._PUNCT ;; E_CONJ torta_NOUN tamb?m_ADV ._PUNCT (with-open-file (s "./example.txt") (write-conllu-to-stream (read-sentence-tag-suffix s 'upostag "_"))) 1 Pudim _ NOUN _ _ _ _ _ _ 2 ? _ VERB _ _ _ _ _ _ 3 bom _ ADJ _ _ _ _ _ _ 4 . _ PUNCT _ _ _ _ _ _ 1 E _ CONJ _ _ _ _ _ _ 2 torta _ NOUN _ _ _ _ _ _ 3 tamb?m _ ADV _ _ _ _ _ _ 4 . _ PUNCT _ _ _ _ _ _
  • Function READ-FILE-TAG-SUFFIX (filename &key (tag 'upostag) (separator "_"))

CONLLU.DRAW

  • Function TREE-SENTENCE (sentence &key (stream *standard-output*) show-meta)

CONLLU.EVALUATE

Functions for evaluating datasets and parser outputs in the CoNLL-U format.
  • Function ATTACHMENT-SCORE-BY-SENTENCE (list-sent1 list-sent2 &key (labeled t) (ignore-punct nil) (simple-dep nil))
    Attachment score by sentence (macro-average). The attachment score is the percentage of words that have correct arcs to their heads. The unlabeled attachment score (UAS) considers only who is the head of the token, while the labeled attachment score (LAS) considers both the head and the arc label (dependency label / syntactic class). In order to choose between labeled or unlabeled, set the key argument LABELED. References: - Dependency Parsing. Kubler, Mcdonald and Nivre (pp.79-80)
  • Function ATTACHMENT-SCORE-BY-WORD (list-sent1 list-sent2 &key (labeled t) (ignore-punct nil) (simple-dep nil))
    Attachment score by word (micro-average). The attachment score is the percentage of words that have correct arcs to their heads. The unlabeled attachment score (UAS) considers only who is the head of the token, while the labeled attachment score (LAS) considers both the head and the arc label (dependency label / syntactic class). In order to choose between labeled or unlabeled, set the key argument LABELED. References: - Dependency Parsing. Kubler, Mcdonald and Nivre (pp.79-80)
  • Function RECALL (list-sent1 list-sent2 deprel &key (error-type '(deprel)) (simple-dep nil))
    Restricted to words which are originally of syntactic class (dependency type to head) DEPREL, returns the recall: the number of true positives divided by the number of words originally positive (that is, originally of class DEPREL). We assume that LIST-SENT1 is the classified result and LIST-SENT2 is the list of golden (correct) sentences. ERROR-TYPE defines what is considered an error (a false negative). Some usual values are: - '(deprel) :: for the deprel tagging task only - '(head) :: for considering errors for each syntactic class - '(deprel head) :: for considering correct only when both deprel and head are correct.
  • Function PRECISION (list-sent1 list-sent2 deprel &key (error-type '(deprel)) (simple-dep nil))
    Restricted to words which are classified as of syntactical class (dependency type to head) DEPREL, returns the precision: the number of true positives divided by the number of words predicted positive (that is, predicted as of class DEPREL). We assume that LIST-SENT1 is the classified (predicted) result and LIST-SENT2 is the list of golden (correct) sentences. ERROR-TYPE defines what is considered an error (a false negative). Some usual values are: - '(deprel) :: for the deprel tagging task only - '(head) :: for considering errors for each syntactic class - '(deprel head) :: for considering correct only when both deprel and head are correct.
  • Function NON-PROJECTIVITY-ACCURACY (list-sent1 list-sent2)
  • Function NON-PROJECTIVITY-PRECISION (list-sent1 list-sent2)
  • Function NON-PROJECTIVITY-RECALL (list-sent1 list-sent2)
  • Function EXACT-MATCH (list-sent1 list-sent2 &key (compared-fields '(upostag feats head deprel)) (identity-fields '(id form)) (test #'equal) (simple-dep nil) (ignore-punct nil))
    Returns the list of sentences of LIST-SENT1 that are an exact match to the corresponding sentence of LIST-SENT2 (same position in list). LIST-SENT1 and LIST-SENT2 must have the same size with corresponding sentences in order.
  • Function EXACT-MATCH-SCORE (list-sent1 list-sent2 &key (compared-fields '(upostag feats head deprel)) (identity-fields '(id form)) (test #'equal) (simple-dep nil) (ignore-punct nil))
    Returns the percentage of sentences of LIST-SENT1 that are an exact match to LIST-SENT2. LIST-SENT1 and LIST-SENT2 must have the same size with corresponding sentences in order. The typical use case is comparing the result of a tagger (or parser) against a test set, where an exact match is a completely correct tagging (or parse) for the sentence. References: - Dependency Parsing. Kubler, Mcdonald and Nivre (p.79)
  • Class CONFUSION-MATRIX
    CORPUS-ID   Accessor: CONFUSION-MATRIX-CORPUS-ID
    Identifier of the corpus or experiment.
    KEY-FN   Accessor: CONFUSION-MATRIX-KEY-FN
    Function used to label each token.
    TEST-FN   Accessor: CONFUSION-MATRIX-TEST-FN
    Function which compares two labels. Typically a form of equality.
    SORT-FN   Accessor: CONFUSION-MATRIX-SORT-FN
    Function which sorts labels. By default, converts labels to string and uses lexicographical order
    ROWS   Accessor: CONFUSION-MATRIX-ROWS
    Parameter which contains the contents of the confusion matrix.
  • Function CONFUSION-MATRIX-ROWS-LABELS (cm)
    Returns the list of labels occuring in the rows of the confusion matix CM.
  • Function CONFUSION-MATRIX-COLUMNS-LABELS (cm)
  • Function CONFUSION-MATRIX-LABELS (cm)
    Returns the list of all labels in the confusion matrix CM.
  • Function CONFUSION-MATRIX-CELLS-LABELS (cm)
    Returns a list of '(LABEL1 LABEL2) for each cell in the confusion matrix CM.
  • Function CONFUSION-MATRIX-CELL-COUNT (label1 label2 cm &key default-if-undefined)
    Returns the number of tokens that are contained in the cell defined by LABEL1 LABEL2 in the confusion matrix CM. If DEFAULT-IF-UNDEFINED, returns 0. Otherwise, raises an error in case there is no such cell.
  • Function CONFUSION-MATRIX-CELL-TOKENS (label1 label2 cm &key default-if-undefined)
    Returns the list of (SENT-ID . TOKEN-ID) of tokens in the cell LABEL1 LABEL2. If DEFAULT-IF-UNDEFINED, returns the empty list. Otherwise, raises an error in case there is no such cell.
  • Function MAKE-CONFUSION-MATRIX (list-sent1 list-sent2 &key corpus-id (key-fn #'token-upostag) (test-fn #'equal) (sort-fn #'(lambda (x y) (string<= (format nil "~a" x) (format nil "~a" y)))))
    Creates a new confusion matrix from the lists of sentences LIST-SENT1 and LIST-SENT2.
  • Function CONFUSION-MATRIX-UPDATE (list-sent1 list-sent2 cm)
    Updates an existing confusion matrix by a list of sentences LIST-SENT1 and LIST-SENT2.
  • Function CONFUSION-MATRIX-NORMALIZE (cm)
    Returns a new CONFUSION-MATRIX with new empty cells for each pair (LABEL1 LABEL2) of labels in (confusion-matrix-labels CM) that are undefined in CM.