SeqTailor Data Collection and Pre-Processing





SeqTailor Sequence Extraction Framework





Input / Output Description

Module Input Type Input Format & Sample Output Type Output Format
DNA Sequence
Extraction
Genomic Variants (VCF) the first 5 columns in VCF format
'CHROM POS ID REF ALT'
ref./alt. DNA sequences in forward/reverse strands FASTA
DNA Sequence
Extraction
Genomic Ranges (BED) the first 3 columns in BED format
'CHROM START END'
ref. DNA sequences in forward/reverse strands FASTA
Protein Sequence
Extraction
Genomic Variants (VCF) the first 5 columns in VCF format
'CHROM POS ID REF ALT'
ref./alt. amino acid sequences FASTA
(The input fields should be tab/space-delimited. The ID field should not be empty, if no ID information, fill this field by a dot '.' symbol.)



Supported Organisms

Organism Sci. Name Family Assembly Genome Regions # Genes # Transcripts # Splice Sites # Proteins
Human Homo sapiens Mammal GRCh37 1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,X,Y,MT
35,091 138,613 2,018,124 95,304
Human Homo sapiens Mammal GRCh38 1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,X,Y,MT, and 329 alt loci
36,100 155,341 2,259,292 107,498
Chimpanzee Pan troglodytes Mammal Pan_tro_3.0 1,2A,2B,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,X,Y,MT
17,381 41,504 947,134 41,468
Mouse Mus musculus Mammal GRCm38 1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,
X,Y,MT
36,148 94,462 1,350,332 65,679
Rat Rattus norvegicus Mammal Rnor_6.0 1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
X,Y,MT
23,671 31,110 588,376 28,897
Cow Bos taurus Mammal ARS-UCD1.2 1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,X,MT
16,515 31,214 785,250 31,188
Chicken Gallus gallus Bird GRCg6a 1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,30,
31,32,33,W,Z,MT
12,303 22,158 573,360 22,158
Lizard Anolis carolinensis Reptile AnoCar2.0 1,2,3,4,5,6,MT 6,105 6,321 155,556 6,321
Zebrafish Danio rerio Fish GRCz11 1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,MT
25,864 49,308 908,190 45,633
Fruitfly Drosophila melanogaster Fly BDGP6 2L,2R,3L,3R,4,X,Y,MT 14,226 30,804 363,118 30,478
Arabidopsis Arabidopsis thaliana Plant TAIR10 1,2,3,4,5,MT 12,702 23,376 336,118 23,376
Rice Oryza sativa Plant IRGSP-1.0 1,2,3,4,5,6,7,8,9,10,11,12,MT 2,779 12,455 134,960 12,452




Standalone Programs

The programs are coded in Python 2.7, and users should have Biopython library installed. Currently, we provided three standalone programs:
(1) ‘Seqtailor_DNA_VCF_independent.py’ for extracting DNA sequences for genomic variants independently in VCF files;
(2) ‘SeqTailor_DNA_VCF_neighborhood.py’ for extracting DNA sequences for genomic variants in VCF files with the consideration of the neighboring variants falling inside the given window size;
(3) ‘SeqTailor_DNA_BED.py’ for extracting DNA sequences from genomic ranges in BED files.
The feature to annotate the nearest splice sites in DNA sequence extraction, and the module for protein sequences extraction are not yet available for the standalone version, at this moment.
All arguments are required, except the OUTPUT and the REPORT filenames. If the user do not give these two arguments, the program will suffix (.DNA.fasta) and (.report.txt) to the INPUT filenames respectively.


DNA seqeucne extraction for genomic variants in VCF files, independently.
SeqTailor_DNA_VCF_independent.py

usage: SeqTailor_DNA_VCF_independent.py [-h] [-g GENOME] [-c {0,1}]
                                        [-s {BOTH,FORWARD,REVERSE}]
                                        [-wd WINDOW_DOWN] [-wu WINDOW_UP]
                                        [-q {BOTH,WT,MT}] [-i INPUT]
                                        [-o OUTPUT] [-r REPORT]

SeqTailor DNA seqeuence extraction for genomic variants (independent) in VCF format

arguments:
  -h, --help            show this help message and exit
  -g GENOME, --genome GENOME
                        genome sequence filename (FASTA / FASTA.GZ)
  -c {0,1}, --coordinate {0,1}
                        coordinate indexing
  -s {BOTH,FORWARD,REVERSE}, --strand {BOTH,FORWARD,REVERSE}
                        strand
  -wd WINDOW_DOWN, --window_down WINDOW_DOWN
                        window size downstream
  -wu WINDOW_UP, --window_up WINDOW_UP
                        window size upstream
  -q {BOTH,WT,MT}, --seq_type {BOTH,WT,MT}
                        output sequence type
  -i INPUT, --input INPUT
                        input filename (VCF)
  -o OUTPUT, --output OUTPUT
                        output filename. default: sufix (.DNA.fasta)
  -r REPORT, --report REPORT
                        report filename. default: sufix (.report.txt)


example:
python SeqTailor_DNA_VCF_independent.py -g genome.fa -c 1 -s BOTH -wd 25 -wu 25 -q BOTH -i mutations.vcf

DNA seqeucne extraction for genomic variants in VCF files, considering the neighboring variants within the given window.
SeqTailor_DNA_VCF_neighborhood.py

usage: SeqTailor_DNA_VCF_neighborhood.py [-h] [-g GENOME] [-c {0,1}]
                                         [-s {BOTH,FORWARD,REVERSE}]
                                         [-wd WINDOW_DOWN] [-wu WINDOW_UP]
                                         [-q {BOTH,WT,MT}] [-i INPUT]
                                         [-o OUTPUT] [-r REPORT]

SeqTailor DNA seqeuence extraction for genomic variants (neghbourhood) in VCF format

arguments:
  -h, --help            show this help message and exit
  -g GENOME, --genome GENOME
                        genome sequence filename (FASTA / FASTA.GZ)
  -c {0,1}, --coordinate {0,1}
                        coordinate indexing
  -s {BOTH,FORWARD,REVERSE}, --strand {BOTH,FORWARD,REVERSE}
                        strand
  -wd WINDOW_DOWN, --window_down WINDOW_DOWN
                        window size downstream
  -wu WINDOW_UP, --window_up WINDOW_UP
                        window size upstream
  -q {BOTH,WT,MT}, --seq_type {BOTH,WT,MT}
                        output sequence type
  -i INPUT, --input INPUT
                        input filename (VCF)
  -o OUTPUT, --output OUTPUT
                        output filename. default: sufix (.DNA.fasta)
  -r REPORT, --report REPORT
                        report filename. default: sufix (.report.txt)


example:
python SeqTailor_DNA_VCF_neighborhood.py -g genome.fa -c 1 -s BOTH -wd 25 -wu 25 -q BOTH -i mutations.vcf

DNA seqeucne extraction from genomic ranges in BED files.
SeqTailor_DNA_BED.py

usage: SeqTailor_DNA_BED.py [-h] [-g GENOME] [-c {0,1}]
                            [-s {BOTH,FORWARD,REVERSE}] [-i INPUT] [-o OUTPUT]
                            [-r REPORT]

SeqTailor DNA seqeuence extraction for genomic ranges in BED format

arguments:
  -h, --help            show this help message and exit
  -g GENOME, --genome GENOME
                        genome sequence filename (FASTA / FASTA.GZ)
  -c {0,1}, --coordinate {0,1}
                        coordinate indexing
  -s {BOTH,FORWARD,REVERSE}, --strand {BOTH,FORWARD,REVERSE}
                        strand
  -i INPUT, --input INPUT
                        input filename (BED)
  -o OUTPUT, --output OUTPUT
                        output filename. default: sufix (.DNA.fasta)
  -r REPORT, --report REPORT
                        report filename. default: sufix (.report.txt)


example:
python SeqTailor_DNA_BED.py -g genome.fa -c 1 -s BOTH -i ranges.bed



Case Study

SeqTailor
Extraction
Gene Variant in VCF Disease Consequence Bridged
Application
ClinVar
Link
DNA Seq MSH2 chr2 47635062 . T G Lynch syndrome intronic, new donor site NetGene2
DNA Seq BRCA2 chr13 32954282 . GG TA hereditary breast-ovarian cancer essential splicing NNSPLICE
DNA Seq IL2RG chrX 70330553 . T C X-linked severe combined immunodeficiency intronic, new acceptor site HSF
Protein Seq BRAF chr7 140481402 . C T cardio-facio-cutaneous syndrome missense PFam, PolyPhen2
Protein Seq GJB2 chr13 20763554 . AG G hearing loss and deafness frameshift n/a




Browser Compatibility

OS Version Chrome Firefox Edge Safari
MacOS Mojave 71.0 64.0 n/a 12.0
Linux RHEL 6.10 not tested 60.2 n/a n/a
Linux Ubuntu 18.04 not tested 64.0 n/a n/a
Windows Win10 71.0 64.0 18 n/a



BACK TO HOMEPAGE