Learning Python: Technique to loop through fasta file and choosing which contents to be worked/printed on.


dna1.fasta file has the following contents repetitively.

>gi|142022655|gb|EQ086233.1|91 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
CTCGCGTTGCAGGCCGGCGTGTCGCGCAACGACGTGTGGGGCCTGACGGGCAGGGAGGATCTCGGCGGCG
CCAACTATGCGGTCTTTCGGCTCGAAAGCCAGTTCCAGACCTCCGACGGCGCGCTGACCGTGCCCGGCTC
CGCATTCAGTTCGCAAGCCTACGTCGGGCTCGGCGGCGACTGGGGGACCGTGACGCTCGGGCGCCAGTTC

Code: Open the file, loop through it and read through everything.
file_handler = open("dna1.fasta", "r")
for file_contents in file_handler:
    print file_contents

Output

>gi|142022655|gb|EQ086233.1|527 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
GAGAACCGGGAACCGGAACCATGACAGCCCCGCGCCGGTTTTACGCGAGATAGCCGGAAACGCCGTCCCA
GAGCAGTTTCAATGCGGTCACCGCCAGCAATCCGTAGCAGCTCCGGTAGATCAGGCGCTGGTCCAGCCTG
CCGTGAAGCCGCCAGCCGAACACCACGCCGGCCGGAATGGCAAGCAGGCACACCGCCATCAACGCCCAGA
CGTTCGCGGTCGGCTGCACGATCAGCAGCCACGGCACTGCCTTGATCGCATTGCCCACGGTGAAGAACAG
GCTCGTCGTTCCCGCGTACATCTCCTTGCTGAGGCCAAGCGGCAGCAGATACATCGCGAGCGGCGGCCCG

Code: when the file content has line that 'start with' ">gi", start looping and printing that line. No other contents to be included.

file_handler = open("dna1.fasta", "r")
for file_contents in file_handler:
    if file_contents.startswith('>gi'):
         print file_contents


Output


>gi|142022655|gb|EQ086233.1|91 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
>gi|142022655|gb|EQ086233.1|304 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
>gi|142022655|gb|EQ086233.1|255 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
>gi|142022655|gb|EQ086233.1|45 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
>gi|142022655|gb|EQ086233.1|396 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence

Codes: if the file content has line start with ">gi" print anything outside of it. This is the negation of the true condition/statement. That's the keyword "not"

file_handler = open("dna1.fasta", "r")
for file_contents in file_handler:
     if not file_contents.startswith('>gi'):
         print file_contents


Output

AATTACCGTCGCCGCCAAGGAGCAGAGCACGGGGATCGAGCAGGTGAATCAGGCTGTGTCGCAACTCGAC
AATGCGACGCAGCAGAACGCGGCGCTCGTCGAGCAGTCGGCGGCGGCCGCGACATTGCTGCGCGAGCAGG
CCGCCAGGCTCGCGCAGACGGTCGGCGAGTTCAAGCTCGAGGACCGCCGCGCGATGACGTTGCAG
CCGCGAAGGCCGCGTTCGCCACGCCCGCTGCCAACAGCGATCTCGCCGGCACCACGTTGCGTGTCGCAAC
CTACAAGGGTGGCTGGCGCGCGCTGCTGCAGGCGGCCGGGCTGGCCGACACACCGTACCGGATCGACTGG
CGCGAGCTGAACAACGGCGTGCTGCATATCGAGGCGCTCAACGCGGATGCGCTCGACATCGGTTCGGGAA

Making entire block more elegant.

for file_contents in open("dna1.fasta"):
     if file_contents.startswith('>gi'): print file_contents