SAM-Profiler is entirely written in C# and can be run under Windows, Linux and Apple Mac OS operative systems (using the Mono runtime library). It has been developed using streaming techniques and limited memory footprint requirements. It is typically able to smoothly run on 4 Gigabytes RAM memory desktop or notebook systems. To improve its performance, our tool make use of parallel programming techniques. Specifically, two different internal pipelines are implemented, depending on whether a SAM or a BAM file is under processing. In presence of SAM files, a parallel, two steps Producer/Consumer blocking First-In First-Out collection is implemented, where the producer, dedicated to the acquisition and preprocessing of each read, feeds the consuming algorithm which qualitatively analyzes the preprocessed reads.
Although BAM files are preferable over SAM for several reasons, among them the reduced file size and the ability to quickly extract reads within a specified position range, their use is computationally intensive, mostly because of the overhead required by the BGZF/gzip blocks decompression algorithms and, in case of paired-read experiments, by the read-matching algorithms. To take into account this problem, in order to process BAM files, SAM-Profiler generates a three-steps Producer/Consumer blocking collection, where a first thread processes the BAM/BAI files and extract individual reads, instantiating dedicated BamRead reference objects. The BamRead objects are used to internally feed a second thread, which is mainly devoted to the preprocessing and, in case of paired-end reads, to the read-matching algorithms. Finally, a third thread qualitatively analyzes the preprocessed, matched reads.
SAM-Profiler includes several quality reporting algorithms, namely: mean and per-chromosome Read quality, Mapping quality, Paired-end Read Quality, Duplicate Analysis, Exonic, Intronic, Intergenic Coverage, mean Exonic fold-coverage, Fragment Size Distribution, Mapping Profile, Single or Paired-Read Mismatch Distribution, CG Distribution and Nucleotides Distribution. The output file is then processed to generate the graphical report.