Development of the PyroTRF-ID bioinformatics methodology The PyroTRF-ID bioinformatics methodology for identification of T-RFs from pyrosequencing datasets was coded in Python for compatibility with the BioLinux open Selleck GDC973 software strategy . PyroTRF-ID runs were run on the Vital-IT high performance computing center (HPCC) of the Swiss Institute of Bioinformatics (Switzerland). All documentation needed for implementing
the methodology check details is available at http://bbcf.epfl.ch/PyroTRF-ID/. The flowchart description of PyroTRF-ID is depicted in Figure 1, and computational parameters are described hereafter. Figure 1 Data workflow in the PyroTRF-ID bioinformatics methodology. Experimental pyrosequencing and T-RFLP input datasets (black parallelograms), reference input databases (white parallelograms), data processing (white rectangles), output
files (grey sheets). Input files Input 454 tag-encoded pyrosequencing datasets were used either in raw standard flowgram (.sff), or as pre-denoised fasta format (.fasta) as presented below. Input eT-RFLP datasets were provided in coma-separated-values format (.csv). Denoising Sequence denoising was integrated in the PyroTRF-ID workflow but this feature can be disabled by the user. It requires the independent installation of the QIIME software  to decompose and denoise the .sff files containing the whole pyrosequencing information into .sff.txt, .fasta and .qual GSK2118436 mw files. Briefly, the script split_libraries.py was used first to remove tags and primers. Sequences were then filtered based on two criteria: (i) a sequence length
ranging from the minimum (default value of 300 bp) and maximum 500-bp amplicon length, and (ii) a PHRED sequencing quality score above 20 according to Ewing and Green . Denoising for the removal of classical 454 pyrosequencing flowgram errors such as homopolymers [45, 46] was carried out with the script denoise_wrapper.py. Denoised sequences were processed using the script inflate_denoiser_output.py in order to generate clusters of sequences with at least 97% identity as conventionally used in the microbial ecology community . Based on computation of statistical distance matrices, RVX-208 one representative sequence (centroid) was selected for each cluster. With this procedure, a new file was created containing cluster centroids inflated according to the original cluster sizes as well as non-clustering sequences (singletons). The denoising step on the HPCC typically lasted approximately 13 h and 5 h for HighRA and LowRA datasets, respectively. Mapping Mapping of sequences was performed using the Burrows-Wheeler Aligner′s Smith-Waterman (BWA-SW) alignment algorithm  against the Greengenes database . The SW score was used as mapping quality criterion [50, 51]. It can be set by the user according to research needs. Sequences with SW scores below 150 were removed from the pipeline.