Development of the PyroTRF-ID bioinformatics methodology The PyroTRF-ID bioinformatics methodology for identification of T-RFs from pyrosequencing datasets was coded in Python for compatibility with the BioLinux open Selleck GDC973 software strategy [42]. PyroTRF-ID runs were run on the Vital-IT high performance computing center (HPCC) of the Swiss Institute of Bioinformatics (Switzerland). All documentation needed for implementing
the methodology check details is available at http://bbcf.epfl.ch/PyroTRF-ID/. The flowchart description of PyroTRF-ID is depicted in Figure 1, and computational parameters are described hereafter. Figure 1 Data workflow in the PyroTRF-ID bioinformatics methodology. Experimental pyrosequencing and T-RFLP input datasets (black parallelograms), reference input databases (white parallelograms), data processing (white rectangles), output
files (grey sheets). Input files Input 454 tag-encoded pyrosequencing datasets were used either in raw standard flowgram (.sff), or as pre-denoised fasta format (.fasta) as presented below. Input eT-RFLP datasets were provided in coma-separated-values format (.csv). Denoising Sequence denoising was integrated in the PyroTRF-ID workflow but this feature can be disabled by the user. It requires the independent installation of the QIIME software [43] to decompose and denoise the .sff files containing the whole pyrosequencing information into .sff.txt, .fasta and .qual GSK2118436 mw files. Briefly, the script split_libraries.py was used first to remove tags and primers. Sequences were then filtered based on two criteria: (i) a sequence length
ranging from the minimum (default value of 300 bp) and maximum 500-bp amplicon length, and (ii) a PHRED sequencing quality score above 20 according to Ewing and Green [44]. Denoising for the removal of classical 454 pyrosequencing flowgram errors such as homopolymers [45, 46] was carried out with the script denoise_wrapper.py. Denoised sequences were processed using the script inflate_denoiser_output.py in order to generate clusters of sequences with at least 97% identity as conventionally used in the microbial ecology community [47]. Based on computation of statistical distance matrices, RVX-208 one representative sequence (centroid) was selected for each cluster. With this procedure, a new file was created containing cluster centroids inflated according to the original cluster sizes as well as non-clustering sequences (singletons). The denoising step on the HPCC typically lasted approximately 13 h and 5 h for HighRA and LowRA datasets, respectively. Mapping Mapping of sequences was performed using the Burrows-Wheeler Aligner′s Smith-Waterman (BWA-SW) alignment algorithm [48] against the Greengenes database [49]. The SW score was used as mapping quality criterion [50, 51]. It can be set by the user according to research needs. Sequences with SW scores below 150 were removed from the pipeline.