Академический Документы
Профессиональный Документы
Культура Документы
datakit
CSV file manipulation and more.
Please use my another tool: csvtk, Another cross-platform, efficient and practical CSV/TSV
tool kit
intersection
Intersecion of multiple (>=2) files.
unique
uniq with no need pre-sorting.
csv2tab
csv2tab
positional arguments:
csvfile Input file(s)
optional arguments:
-h, --help show this help message and exit
-f F Field separator [,]
-q Q Quote char["]
csv_grep.py
** Please use golang version of csv_grep**
Grepping CSV file, tab-delimited file by default, by exactly matching or query by regluar
expression, multiple keys (indice) supported. The query patterns could be given from command
line or file.
Usage:
usage: csv_grep [-h] [-v] [-o OUTFILE] [-k KEY] [-H] [-F FS] [-Fo FS_OUT]
[-Q QC] [-t] [-p [PATTERN]] [-pf [PATTERNFILE]] [-pk [PK]]
[-r] [-d] [-i]
[csvfile [csvfile ...]]
positional arguments:
https://github.com/shenwei356/datakit 1/4
05/11/2018 GitHub - shenwei356/datakit: CSV/TSV file manipulation and more. Please use my another tool: csvtk, https://gith…
positional arguments:
csvfile Input file(s)
optional arguments:
-h, --help show this help message and exit
-v, --verbose Verbosely print information
-o OUTFILE, --outfile OUTFILE
Output file [STDOUT]
-k KEY, --key KEY Column number of key in csvfile. Multiple values shoud
be separated by comma
-H, --ignoretitle Ignore title
-F FS, --fs FS Field separator [,]
-Fo FS_OUT, --fs-out FS_OUT
Field separator of ouput [same as --fs]
-Q QC, --qc QC Quote char["]
-t Field separator is "\t". Quote char is "\t"
-p [PATTERN], --pattern [PATTERN]
Query pattern
-pf [PATTERNFILE], --patternfile [PATTERNFILE]
Pattern file
-pk [PK] Column number of key in pattern file. Multiple values
shoud be separated by comma
-r, --regexp Pattern is regular expression
-d, --speedup Delete matched pattern when matching one record
-i, --invert Invert match (do not match)
https://github.com/shenwei356/datakit
Examples
1. For a table file. Note that the 3rd column of 4th line contains "\t".
$ cat testdata/data.tab column1 column 2 3rd c str 123 abde 123 134 我 245 135 "string with
tab"
Find lines of which the 2nd column are digitals, ignoring title
Find lines that have ID (first column, by default) in (or NOT in) a given ID files.
https://github.com/shenwei356/datakit 2/4
05/11/2018 GitHub - shenwei356/datakit: CSV/TSV file manipulation and more. Please use my another tool: csvtk, https://gith…
2. Find common records with same headers in two fasta files. fasta2tab transforms the FASTA
fromat to two-column table, fist column is the header and the second is
sequence. tab2fasta just tranform the table back to FASTA format.
3. Find common records of two GTF file. The columns 1,4,5,7 together make up the key of a
record.
cat a.gff | csv_grep -t -k 1,4,5,7 -pk 1,4,5,7 -pf b.gff > commom.gff
csv_grep
Golang version. Faster than python version with concurrency.
Usage:
NAME:
csv_grep - grep for csv format
USAGE:
csv_grep [global options] command [command options] [arguments...]
VERSION:
1.0
AUTHOR(S):
Wei Shen <https://github.com/shenwei356/datakit>
COMMANDS:
help, h Shows a list of commands or help for one command
GLOBAL OPTIONS:
-k, --key "1" column number of key in csvfile. Multiple values sho
-H, --ignoretitle ignore title
--fs "," field separator [,]
--fs-out field separator of ouput [same as --fs]
-t, --tab field separator is "\t". Quote char is "\t"
-p, --pattern query pattern
--pf, --patternfile pattern file
--pk "1" column number of key in pattern file. Multiple value
--pfs "," field separator of pattern file [,]
-r, --use-regexp use regular expression
-d, --speedup delete matched pattern when matching one record
-i, --invert invert match (do not match)
-j, --ncpus "4" CPU number [4]
https://github.com/shenwei356/datakit 3/4
05/11/2018 GitHub - shenwei356/datakit: CSV/TSV file manipulation and more. Please use my another tool: csvtk, https://gith…
csv_join v2.0
Merge CSV files. Multiple keys supported. v2.0
Usage
usage: csv_join [-h] [-k [KEY [KEY ...]]] [-f F] [-q Q] [-of OF] [-t] [-s]
[-keep]
csvfile [csvfile ...]
positional arguments:
csvfile CSV files
optional arguments:
-h, --help show this help message and exit
-k [KEY [KEY ...]], --key [KEY [KEY ...]]
column number of key in csvfile. [1 for all files]
-f F field separator [,]
-q Q quote char ["]
-of OF field separator [,]
-t quote char in all files are "\t"
-s, --simplify simplify the result, by removing keys
-keep, --keep-unmatched
keep unmatched record in PREVIOUS files
https://github.com/shenwei356/datakit
Examples
1. for a lot of tab-delimited files in two-column key-value format
csv_join -t testdata/*.tsv 1 123 1 234 1 what 2 abc 2 opq 2 jjj key value1 key value2 key value3
csv_join -t testdata/*.tsv -keep 1 123 1 234 1 what 2 abc 2 opq 2 jjj 3 ccc key value1 key
value2 key value3
csv_join -t testdata/*.tsv -s 1 123 234 what 2 abc opq jjj key value1 value2 value3
https://github.com/shenwei356/datakit 4/4