Quick FASTQ File Parsing Via Memory Mapping In C/C++

I recently had a need to speedily parse through 8GiB+ .fastq text files to calculate a simple statistic of genomic data. My initial “pfastqcount” implementation in Ruby worked fine, but with many files to process took longer than I had hoped in addition to consuming an alarming amount of CPU. I ended up reimplementing¬†the pfastqcount command-line program in C, which takes one or more .fastq files, memory maps them, and creates the statistic. Simply dropping my algorithm down to raw C significantly sped up the process and reduced CPU usage, especially coming from an interpreted language. If any of you bioinformaticians find the need to implement a FASTQ data processing algorithm in C, I encourage you to fork the project and use it as a template. The project is Apache 2.0 licensed for your convenience and publicly available on GitHub.

2 thoughts on “Quick FASTQ File Parsing Via Memory Mapping In C/C++”

  1. Hi Preston- I was a student of yours in CST200 fall ’10; was wondering if (this has nothing to do with your post here) you might have any references to OMR Java libraries? I’ve since graduated and work for a SW dev co.- I’m researching a potential project that will involve the reading of a “play slip” and I’d like to assemble the job in Java. So far I’ve found zilch where OMR libraries are concerned. any ideas? (It’s looking like C# will be the viable alternative here)

Leave a Reply

Your email address will not be published. Required fields are marked *