csv-validator

2023-06-18

Validates tabular CSV data using predefined validations, similar to its Python counterpart 'Great Expectations'.

Upstream URL

github.com/KoenvdBerg/csv-validator

Author

Koen van den Berg <k.vandenberg@insertdata.nl>

License

BSD-3, BSD
README
csv-validator ============= .. image:: images/example.gif :width: 925 :alt: Alternative text This project outlines an attempt to define a way of thinking about validating data in tabular format (CSV) inspired from its Python homologue Great Expectations. It's been optimized for performance and flexibility. It works by defining a set of validations that can be repeatedly applied to a tabular dataset in CSV format. This is especially useful in a datapipeline scenario where similar tabular (CSV) data is collected from different sources, and you like to validate the data before it enters downstream processing. The validations are stored in a validation suite that is written entirely in Common Lisp syntax. **Features:** - **Performant**: single core validation speed is ~10MB/s and multicore validation speed is ~30MB/s. Please see `Benchmark`_. - **Extensible**: validations are written in Common Lisp and stored in validation suites. New validations can be easily added. - **No-nonsense output**: output is a CSV file that contains location of values that did not pass the validation, including the erronuous value. - **Well tested**: each validation is unit-tested and compiles are tested to work on SBCL, ECL, CLISP, CCL and ABCL on Linux OS. .. contents:: **Table of Contents** Installation --------- This package has been added to quicklisp. You can install it as follows: 1. Open the LISP REPL and run:: (ql:quickload 'csv-validator) If the package doesn't seem available, please update your quicklisp distribution to the latest version with: ``(ql:update-dist "quicklisp")``. 2. Test if installation was succesfull by running:: (csv-validator:check-not-null "test-value") If this returns ``T`` the installation was successful. Validating a simple CSV --------- In the ``data/`` folder the file ``energy_sample.csv`` is located. This small csv file is used to illustrate how the csv-validator works. Let's start with defining a small validation suite and using it to perform a validation on the ``energy_sample.csv`` data. Open a new common lisp script and make sure that the csv-validator is correctly installed. Then define and run:: (defparameter *test_suite* (list (list :column "ID" ;The column to be validated :depends (list "ID") ;The value to use in the logic (see below) :label "max-5-chars" ;The text to include in output in case of failed validation :logic (lambda (ID) ;The logic for this validation. (<= (length ID) 5))))) ;The value in the ID column should have 5 or less characters. (csv-validator:validate-csv "/path/to/data/energy_sample.csv" "/path/to/output/folder/" *test_suite* :delim #\; :threads 1) Make sure that you include the correct path to the ``energy_sample.csv`` input data and that the output folder that you're writing to exists. Now open the file named ``csv-validator_validations.csv`` in the output folder and explore the result:: index;column;erronuous_value;label 11;ID;999999;max-5-chars As you can see the result itself is a CSV file (';' as delimiter). It correctly points out that the value in the ``ID`` column at index 11 is longer than 5 characters. Header validations --------- Header validations are checks if the header row of the CSV file is as expected. This is especially usefull for data pipelines where the presence of certain headers in the CSV file is expected. Header validation will be performed for every validation in your validation suite. Example:: (defparameter *test_suite* (list (list :column "ID" :depends (list "ID") :label "max-5-chars" :logic (lambda (ID) (<= (length ID) 5))) (list :column "not_existing_column" :depends (list "not_existing_column") :label "max-10-chars" :logic (lambda (x) (<= (length x) 10))))) Here a second validation is added to the validation suite that describes a column that doesn't exist in the data. This will give the following result after running the validation (through ``csv-validator:validate-csv ...``:: index;column;erronuous_value;label 0;not_existing_column;not_existing_column;missing-header 11;ID;999999;max-5-chars At index 0 (header row) the error with label "missing-header" shows up, because this header is not present in the CSV file. Since the ``ID`` column is present, we get the regular output for that column. Record validations --------- Record validations are validations that are performed on each record in the CSV file. The ``csv-validator`` has several build-in validations, that are unit-tested. However, it's also perfectly possible to define your own functions or use lambda functions for validations. Lambda validations ~~~~~~~~~~~~~ Literally every function can be used as a validation, including lambda functions. As long as the used function returns a non-nil value for correct values, and ``nil`` for incorrect values. For example:: (defparameter *test_suite* (list (list :column "country" :depends (list "country") :label "not-allowed-country" :logic (lambda (x) (position x '("AT" "BG" "CH") :test #'string=))))) This lambda function validates if the country code is in the allowed list of country codes (in the file ``energy_sample.csv``). The value from the column "country" will be used as ``x`` in the lambda function (as defined under ``:depends``). For example the country code "ES" will result in an error in the result file like this:: index;column;erronuous_value;label 12;country;ES;not-allowed-country Consider another example:: (defparameter *test_suite* (list (list :column "energy_source_level_0" :depends (list "energy_source_level_0") :label "true-or-false" :logic (lambda (x) (or (string= (string-downcase x) "true") (string= (string-downcase x) "false")))))) This lambda function validates the "energy_source_0" column and only allows for string-values "TRUE" or "FALSE", independent of case. The erronuous value "INVALID" in this column will result in:: index;column;erronuous_value;label 12;energy_source_level_0;INVALID;true-or-false Conditional validations ~~~~~~~~~~~~~ Conditional validations are validations that are only performed on a subset of the column that is true for the condition. Using this concept, it's possible to define validations that are dependent on values in other columns in the same record. It's best illustrated using an example. In this example, the column "weblink" cannot have the value "link unavailable" when the source of the data is "REE":: (defparameter *test_suite* (list (list :column "weblink" :depends (list "weblink" "source") :label "unavailable-not-allowed" :logic (lambda (weblink source) (or (not (string= source "REE")) (not (string= (string-downcase weblink) "link unavailable"))))))) This validation will only check if the value in the "weblink" column is "link unavailable" when the value in the "source" column is "REE". It'll give the following output upon running this suite against the ``energy_example.csv`` file:: index;column;erronuous_value;label 9;weblink;link unavailable;unavailable-not-allowed Consider this other example:: (defparameter *test_suite* (list (list :column "capacity" :depends (list "capacity") :label "number" :logic (lambda (x) (or (csv-validator:check-null x) (csv-validator:check-number-string x)))))) This validation validates if the values in the "capacity" column are numeric values, but only if these values are non-null values. Thus values like "null" or "3.14" will pass the validation, but a value like "ui44" won't (it's not a number). It uses several build-in validations that will be discussed in the sections below. Numerical validations ~~~~~~~~~~~~~ **Validate integers** Integer validation can be done using the build-in function: ``csv-validator:check-integer-string``. It works as follows:: (csv-validator:check-integer-string "54") --> t (csv-validator:check-integer-string "1.45") --> nil (csv-validator:check-integer-string "-99") --> t (csv-validator:check-integer-string "lk93") --> nil Use it in a validation suite like this:: (defparameter *test_suite* (list (list :column "capacity" :depends (list "capacity") :label "expect-integer" :logic (lambda (x) (csv-validator:check-integer-string x))))) **Validate floats** Float validation can be done using the build-in function: ``csv-validator:check-float-string``. It works as follows:: (csv-validator:check-float-string "54") --> nil (csv-validator:check-float-string "1.45") --> t (csv-validator:check-float-string "-99") --> nil (csv-validator:check-float-string "lk93") --> nil Use it in a validation suite like this:: (defparameter *test_suite* (list (list :column "capacity" :depends (list "capacity") :label "expect-float" :logic (lambda (x) (csv-validator:check-float-string x))))) **Validate scientific numbers** Scientific number validation can be done using the build-in function: ``csv-validator:check-scientific-number-string``. It works as follows:: (csv-validator:check-scientific-number-string "54") --> nil (csv-validator:check-scientific-number-string "1.45e10") --> t (csv-validator:check-scientific-number-string "-1.2E-1") --> t (csv-validator:check-scientific-number-string "lk93") --> nil Use it in a validation suite like this:: (defparameter *test_suite* (list (list :column "capacity" :depends (list "capacity") :label "expect-scientific-number" :logic (lambda (x) (csv-validator:check-scientific-number-string x))))) **Validate numerical values** Number validation can be done using the build-in function: ``csv-validator:check-number-string``. It works as follows:: (csv-validator:check-number-string "54") --> t (csv-validator:check-number-string "1.45") --> t (csv-validator:check-number-string "-1.2E-1") --> t (csv-validator:check-number-string "lk93") --> nil Use it in a validation suite like this:: (defparameter *test_suite* (list (list :column "capacity" :depends (list "capacity") :label "expect-numerical-value" :logic (lambda (x) (csv-validator:check-number-string x))))) **Validate if number is within range** Number in range validation can be done using the build-in function: ``csv-validator:check-number-in-range``. It works as follows:: (csv-validator:check-number-in-range "23" 0 50) --> t (csv-validator:check-number-in-range "3.14" -4 10) --> t (csv-validator:check-number-in-range "100" 0 nil) --> t ; no upper bound (csv-validator:check-number-in-range "10" nil 5) --> nil ; no lower bound (csv-validator:check-number-in-range "kldsj" 0 50) --> nil Use it in a validation suite like this:: (defparameter *test_suite* (list (list :column "capacity" :depends (list "capacity") :label "greater-than-zero" :logic (lambda (x) (or (csv-validator:check-null x) (csv-validator:check-number-in-range x -1 nil)))) Date validations ~~~~~~~~~~~~~ **Validate if value is date** Date validation can be done using the build-in function: ``csv-validator:check-date-parsable``. It expects dates in the formats ``yyyy-mm-dd`` and ``yyyy-mm-dd hh:mm:ss`` and works as follows:: (csv-validator:check-date-parsable "2020-01-01") --> t (csv-validator:check-date-parsable "2020-01-01 00:00:00") --> t (csv-validator:check-date-parsable "2020-02-30") --> nil (csv-validator:check-date-parsable "2022/1/1") --> nil (csv-validator:check-date-parsable "lksdjf") --> nil Use it in a validation suite like this:: (defparameter *test_suite* (list (list :column "reporting_date" :depends (list "reporting_date") :label "date-format-yyyy-mm-dd" :logic (symbol-function 'csv-validator:check-date-parsable)))) **Validate if value is timezone date** Date validation can be done using the build-in function: ``csv-validator:check-tz-parsable``. It expects dates in the format ``yyyy-mm-ddThh:mm:ssZ`` and works as follows:: (csv-validator:check-date-parsable "2020-01-01") --> nil (csv-validator:check-date-parsable "2020-01-01T00:00:00Z") --> t (csv-validator:check-date-parsable "2020-02-30") --> nil (csv-validator:check-date-parsable "2022/1/1") --> nil (csv-validator:check-date-parsable "lksdjf") --> nil Use it in a validation suite like this:: (defparameter *test_suite* (list (list :column "reporting_date" :depends (list "reporting_date") :label "date-format-yyyy-mm-ddThh:mm:ssZ" :logic (symbol-function 'csv-validator:check-tz-parsable)))) **Validate if date is not in future** To validate if a date is not in the future, use the build-in function: ``csv-validator:check-date-before-today``. It expects dates in the format ``yyyy-mm-dd`` and works as follows:: (csv-validator:check-date-before-today "2020-01-01") --> t (csv-validator:check-date-before-today "2023-01-20") --> t (csv-validator:check-date-before-today "2028-11-20") --> nil (csv-validator:check-date-before-today "klsd") --> ignored (csv-validator:check-date-before-today "2023/01/20") --> ignored Use it in a validation suite like this:: (defparameter *test_suite* (list (list :column "reporting_date" :depends (list "reporting_date") :label "date-format-yyyy-mm-ddThh:mm:ssZ" :logic (symbol-function 'csv-validator:check-date-before-today)))) **Validate if date A is greater than date B** To validate if a date A is greater than date B, use the build-in function: ``csv-validator:check-compare-two-dates``. It expects dates in the format ``yyyy-mm-dd`` and works as follows:: (csv-validator:check-compare-two-dates "2022-02-02" "2022-03-03") --> nil (csv-validator:check-compare-two-dates "2022-12-02" "1999-03-03") --> t (csv-validator:check-compare-two-dates "ksd" "nkdsl") --> ignored Use it in a validation suite like this:: (defparameter *test_suite* (list (list :column "date_col_A" :depends (list "date_col_A" "date_col_B") :label "date_A > date_B" :logic (lambda (date-A date-B) (csv-validator:check-compare-two-dates date-A date-B))))) Missing data validations ~~~~~~~~~~~~~ Missing value validation can be done using the build-in functions: ``csv-validator:check-null`` and ``csv-validator:check-not-null``. They work as follows:: (csv-validator:check-null "1.23") -> nil (csv-validator:check-null "NA") -> t (csv-validator:check-null "klsdjfkl") -> nil (csv-validator:check-null "") -> t (csv-validator:check-not-null "1.23") -> t (csv-validator:check-not-null "NA") -> nil (csv-validator:check-not-null "klsdjfkl") -> t (csv-validator:check-not-null "") -> nil Use it in a validation suite like this:: ;; validates column for missing values (defparameter *test_suite* (list (list :column "ID" :depends (list "ID") :label "ID-not-missing" :logic (lambda (x) (csv-validator:check-not-null x))))) Example validation suite ~~~~~~~~~~~~~ Below the validation suite used to validate the energy data:: (defparameter *energy_suite* (list (list :column "ID" :depends (list "ID") :label "integer" :logic (symbol-function 'csv-validator:check-integer-string)) (list :column "technology" :depends (list "technology") :label "string-length" :logic (lambda (x) (< (length x) 50))) (list :column "source" :depends (list "source") :label "not-null" :logic (symbol-function 'csv-validator:check-not-null)) (list :column "source_type" :depends (list "source_type") :label "not-null" :logic (symbol-function 'csv-validator:check-not-null)) (list :column "source" :depends (list "source") :label "not-null" :logic (symbol-function 'csv-validator:check-not-null)) (list :column "weblink" :depends (list "weblink" "source") :label "unavailable-not-allowed" :logic (lambda (weblink source) (or (not (string= source "REE")) (not (string= (string-downcase weblink) "link unavailable"))))) (list :column "year" :depends (list "year") :label "integer-YYYY" :logic (lambda (x) (and (= (length x) 4) (csv-validator:check-integer-string x)))) (list :column "country" :depends (list "country") :label "not-allowed-country" :logic (lambda (x) (position x '("AL" "AT" "BA" "BE" "BG" "CH" "CY" "CZ" "DE" "DK" "EE" "ES" "FI" "FR" "GB" "GE" "GR" "HR" "HU" "IE" "IS" "IT" "LT" "LU" "LV" "MD" "ME" "MK" "MT" "NI" "NL" "NO" "PL" "PT" "RO" "RS" "SE" "SI" "SK" "TR" "UA" "XK") :test #'string=))) (list :column "capacity" :depends (list "capacity") :label "number" :logic (lambda (x) (or (csv-validator:check-null x) (csv-validator:check-number-string x)))) (list :column "capacity" :depends (list "capacity") :label "greater-than-zero" :logic (lambda (x) (or (csv-validator:check-null x) (csv-validator:check-number-in-range x -1 nil)))) (list :column "capacity_definition" :depends (list "capacity_definition") :label "missing-category" :logic (lambda (x) (let ((lowercase-x (string-downcase x))) (or (string= lowercase-x "gross capacity") (string= lowercase-x "net capacity") (string= lowercase-x "unknown"))))) (list :column "energy_source_level_0" :depends (list "energy_source_level_0") :label "true-or-false" :logic (lambda (x) (or (string= (string-downcase x) "true") (string= (string-downcase x) "false")))) (list :column "energy_source_level_1" :depends (list "energy_source_level_0") :label "true-or-false" :logic (lambda (x) (or (string= (string-downcase x) "true") (string= (string-downcase x) "false")))) (list :column "energy_source_level_2" :depends (list "energy_source_level_0") :label "true-or-false" :logic (lambda (x) (or (string= (string-downcase x) "true") (string= (string-downcase x) "false")))) (list :column "energy_source_level_3" :depends (list "energy_source_level_0") :label "true-or-false" :logic (lambda (x) (or (string= (string-downcase x) "true") (string= (string-downcase x) "false")))) (list :column "technology_level" :depends (list "technology_level") :label "true-or-false" :logic (lambda (x) (or (string= (string-downcase x) "true") (string= (string-downcase x) "false")))) (list :column "reporting_date" :depends (list "reporting_date") :label "date-not-in-future" :logic (symbol-function 'csv-validator:check-date-before-today)) (list :column "reporting_date" :depends (list "reporting_date") :label "date-format-yyyy-mm-dd" :logic (symbol-function 'csv-validator:check-date-parsable)))) Benchmark --------- This benchmark was performed using: +---------------------+-----------------------------------------------------------------------+ | CPU | AMD Ryzen™ 7 PRO 6850U with Radeon™ Graphics × 16 | +---------------------+-----------------------------------------------------------------------+ | LISP implementation | SBCL 2.2.11 | +---------------------+-----------------------------------------------------------------------+ | OS | Fedora Linux 37 (Workstation Edition) | +---------------------+-----------------------------------------------------------------------+ | | name: National generation capacity | | +-----------------------------------------------------------------------+ | | description: Aggregated generation capacity by technology and country | | Dataset +-----------------------------------------------------------------------+ | | consulted on: 2023-01-24 | | +-----------------------------------------------------------------------+ | | link: source_ | +---------------------+-----------------------------------------------------------------------+ Csv-files with filesizes up to 200MB were validated using a validation suite with 18 or 36 validations and using 1 core or 4 cores. Each combination was sampled three times and averaged thereafter. The results are shown in the table below: +---------------+---------------------------------------------+ | filesize (MB) | time (s) | | +----------------------+----------------------+ | | 18 validations | 36 validations | | +----------+-----------+----------+-----------+ | | 1 thread | 4 threads | 1 thread | 4 threads | +===============+==========+===========+==========+===========+ | 0.0032 | 0.009 | 0.015 | 0.011 | 0.018 | +---------------+----------+-----------+----------+-----------+ | 0.026 | 0.009 | 0.012 | 0.009 | 0.014 | +---------------+----------+-----------+----------+-----------+ | 0.2754 | 0.019 | 0.019 | 0.029 | 0.021 | +---------------+----------+-----------+----------+-----------+ | 1.8 | 0.122 | 0.062 | 0.205 | 0.090 | +---------------+----------+-----------+----------+-----------+ | 6.8 | 0.428 | 0.185 | 0.748 | 0.289 | +---------------+----------+-----------+----------+-----------+ | 13.6 | 0.838 | 0.349 | 1.466 | 0.540 | +---------------+----------+-----------+----------+-----------+ | 109.2 | 6.623 | 2.369 | 11.765 | 3.846 | +---------------+----------+-----------+----------+-----------+ | 218.4 | 13.259 | 4.712 | 23.908 | 7.647 | +---------------+----------+-----------+----------+-----------+ Next the average speed (in MB/s) was calculated skipping filesizes lower than 1MB. The table below shows the results: +---------------+----------------------+ | | average speed (MB/s) | | n validations +----------+-----------+ | | 1 thread | 4 threads | +===============+==========+===========+ | 18 | 15.971 | 39.432 | +---------------+----------+-----------+ | 36 | 9.109 | 25.139 | +---------------+----------+-----------+ *Conclusions* - For file sizes up to 1MB it doesn't make sense to run the csv-validator using multiple threads. - The more validations are applied to the csv data, the slower the csv-validator is. Contributing --------- Feel free to create a pull-request on this code-base. Please make sure that all the tests pass (run: ``make test``) and add new tests for new validations Contact: reach out to me at: k.vandenberg@insertdata.nl Code repository has been created using cookiecutter with template: https://github.com/vindarel/cl-cookieproject. .. _source: https://data.open-power-system-data.org/national_generation_capacity/2020-10-01

Dependencies (4)

  • fiveam
  • local-time
  • lparallel
  • parse-float

Dependents (0)

    • GitHub
    • Quicklisp