Preparations Steps

1.        For submission to TSA, certain formatting conventions must be adhered to. Failure to follow these conventions will result in a failed submission, so it is important to properly format the data before submitting.

        a.  Contig criteria: Contigs must be longer than 199 bp; must not contain more than 10% N’s; must not start or end with N; and, must not contain stretches of more than 14 N’s in a row. Contigs that do not meet these requirements should be removed from the assembly before attempting TSA submission.

        b.   Sequence definition line: This is the header line of each contig, which starts with “>” and ends with a newline character. The definition line must not be longer than 50 characters, including spaces, and must begin with a unique identifier (e.g. “>contig_001”). Additional modifiers can be used (see https://www.ncbi.nlm.nih.gov/Sequin/modifiers.html for a complete list). These follow the format “[modifier=text]” and can include organism, sex, and other details. NCBI advises that all TSA submissions include “[moltype=mRNA]” and “[tech=TSA]” in the definition line. Many assembly programs will produce definition lines that contain information about contig length, assembly path, etc. All these values must be removed. See the accompanying webpage for more details.

To see current definition line:


Commands used:

head


To modify the definition line to meet NCBI recommendations:


Commands used:

sed ‘s/len.*$/[organism=Good citizenii] [bioproject= PRJNA472791] [moltype=transcribed_RNA] [tech=TSA]/g' Gcitizenii_Spec7_Trinity.fasta' \

> Gcitizenii_Spec7_assembly.fasta


To see new definition line:


Commands used:

head


        c.   File name: The assembly must have the extension .fsa, not .fasta. To rename, do:


Commands used:

mv Gcitizenii_Spec7_assembly.fasta Gcitizenii_Spec7_assembly.fsa


        d.  File format: The user must decide between submitting the assembly as a FASTA file or as an ASN.1 file. The ASN.1 format is mandatory if the submitter also plans to provide annotation; otherwise, either method can be chosen. We recommend the ASN.1 format because it embeds data in the submission that otherwise (if FASTA is chosen) must be entered manually on the TSA submission page.

 

A TSA submission can contain only one assembly. Thus, if a BioProject contains, for example, three BioSamples (with corresponding SRA files for each), three distinct TSA submissions will be required.