Skip to content

Posts tagged ‘unix’

FASTerQ to Count Matrices for Single-Cell RNA-Seq

July 9, 2019

Sina Booeshaghi

tl;dr Stream reads from the ENA to generate count matrices without ever saving the FASTQs using kallisto | bustools; faster than downloading the FASTQs.


The UNIX operating system was originally developed at Bell Labs for research purposes in the 1960s and has developed today into operating systems such as Mac OS and GNU/Linux. One of the “maxims” that took hold among the initial builders and users of UNIX systems was the following:

Expect the output of every program to become the input to another, as yet unknown, program. Don’t clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don’t insist on interactive input.

UNIX Time-Sharing System: Forward. McIlroy, M.D.; Pinson, E.N.; Tague, B.A.

This maxim is extremely powerful and thanks to Páll Melsted’s foresight and execution, was adhered to during the development of kallisto (Bray et al.) and now the kallisto | bustools single-cell pre-processing workflow (Melsted, Booeshaghi et al.).

Pre-processing single-cell RNA-seq starts with downloading large FASTQ files (Figure 1) and then aligning the reads using your favorite alignment program. Anyone who has lived through the age of Limewire knows that downloading, while much faster nowadays, is still incredibly slow for gigabyte sized FASTQ files. 


Figure 1: The size of the kallisto | bustools files (y-axis) vs. the size of the FASTQ files (x-axis) for 21 single-cell RNA-Seq datasets.

Frustrated by having to wait for my reads to download before I could process them and inspired by the design of bustools I sought a way to download and stream reads into the kallisto | bustools workflow from a download link. This would allow me to (a) never have to write the FASTQ files to disk, thereby conserving disk space and (b) obtain gene count matrices in the time it takes to download FASTQ files. 

After a bit of Google-fu I came across a UNIX object called a named-pipe. A named-pipe is essentially a temporary “file” that data can be written to and read from in a FIFO (first in first out) manner. A neat quirk of named-pipes is that data that is read from it is removed from it. So if we read data from the named-pipe at the same speed that we write to it (or faster), the disk footprint of the named pipe is effectively zero. Making named-pipes is easy with the mkfifo command.

Now all that we have to do is figure out a way to download data and stream them to the named pipe. That is where curl comes in to play. The curl command allows users to download files from a link and in the typical UNIX fashion curl allows one to stream the data to a file as it is being downloaded.

A worked example

Using curl and mkfifo we can stream reads from a download link directly to kallisto bus and subsequently to bustools. Here is an example you can try for yourself on your UNIX-based terminal. I am following the kallisto | bustools Getting Started tutorial processing 8,860,361 reads from single mouse retinal cells SRR8599150 (Koren et al., 2019).

Once you have installed the necessary programs (kallisto and bustools) then you can proceed. First install the necessary files (index, whitelist, and transcripts_to_genes).

## Download necessary files
$ curl -L https://github.com/BUStools/getting_started/releases/download/getting_started/Mus_musculus.GRCm38.cdna.all.idx.gz -o Mus_musculus.GRCm38.cdna.all.idx.gz
$ curl -L https://github.com/BUStools/getting_started/releases/download/getting_started/10xv2_whitelist.txt -o 10xv2_whitelist.txt
$ curl -L https://github.com/BUStools/getting_started/releases/download/getting_started/transcripts_to_genes.txt -o transcripts_to_genes.txt
$ gunzip Mus_musculus.GRCm38.cdna.all.idx.gz

Now you can stream reads. The code needed to do so is a single line which you can past onto the command line. Below we provide both the line, and the “expanded” line for clarity.

## One-liner: Stream to Count Matrices
$ mkfifo R1.gz R2.gz; curl -Ls https://github.com/bustools/getting_started/releases/download/getting_started/SRR8599150_S1_L001_R1_001.fastq.gz  > R1.gz & curl -Ls https://github.com/bustools/getting_started/releases/download/getting_started/SRR8599150_S1_L001_R2_001.fastq.gz  > R2.gz & kallisto bus -i Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 -t 4 -o bus_out/ R1.gz R2.gz; cd bus_out/; mkdir tmp/; bustools correct -w ../10xv2_whitelist.txt -p output.bus | bustools sort -t 4 -T tmp/ -p - | bustools count -o genes -g ../transcripts_to_genes.txt -e matrix.ec -t transcripts.txt --genecounts -

## Expanded: Stream to Count Matrices
$ mkfifo R1.gz R2.gz
$ curl -Ls https://github.com/bustools/getting_started/releases/download/getting_started/SRR8599150_S1_L001_R1_001.fastq.gz  > R1.gz &
$ curl -Ls https://github.com/bustools/getting_started/releases/download/getting_started/SRR8599150_S1_L001_R2_001.fastq.gz  > R2.gz &
$ kallisto bus -i Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 -t 4 -o bus_out/ R1.gz R2.gz
$ cd bus_out/
$ mkdir tmp/
$ bustools correct -w ../10xv2_whitelist.txt -p output.bus | bustools sort -t 4 -T tmp/ -p - | bustools count -o genes -g ../transcripts_to_genes.txt -e matrix.ec -t transcripts.txt --genecounts -

I went ahead and made two Snakemakes to compare the runtimes of (a) stream processing: streaming the reads as explained above and (b) download processing: downloading FASTQs then pre-processing. The streaming workflow processes data faster than it takes to download the data. Here are the raw numbers:

DATASET: SRR8599150 (8,860,361 reads)

STREAM + RUN
download time: 0 sec
kallisto time: 93 sec
bustools time: 17 sec
Total: 110 sec

DOWNLOAD + RUN
download time: 112 sec
kallisto time: 41 sec
bustools time: 17 sec
Total: 170 sec

Does Usain Bolt ever really touch the ground?

Streaming Speedup: 35% faster

This workflow is constant disk space and constant memory. This means you have a smaller digital footprint and you have your count matrices made before your friend has even downloaded their data.

Comparison with Cell Ranger

To understand the significance of this workflow I examined the disk usage of Cell Ranger using the Snakemake benchmark. Here are the results for 20 datasets ordered by increasing number of reads:

Dataset kallisto | bustools [Gb]Cell Ranger [Gb]
SRR8599150_v20.00317.90
SRR8611943_v20.4129.16
SRR6998058_v20.6876.25
hgmm1k_v31.26113.15
pbmc1k_v31.20108.46
hgmm1k_v21.78126.63
heart1k_v31.80136.19
SRR8206317_v22.03154.11
heart1k_v22.05129.82
SRR8524760_v21.96158.23
SRR7299563_v22.01181.70
SRR8513910_v23.94271.85
SRR6956073_v23.29199.79
SRR8257100_v25.32330.09
SRR8327928_v23.64328.36
EMTAB7320_v28.22698.53
neuron10k_v36.99577.88
SRR8639063_v210.56617.51
pbmc10k_v311.281,022.07
hgmm10k_v315.731,310.56

The number indicates the number of bytes written to disk. In the case of kallisto, it is the size of the BUS file and accessory files only, since no temporary files are written to disk. Cell Ranger may use well over a terabyte of disk space for pre-processing FASTQs.

Stream arbitrarily large datasets from anywhere

This workflow is easily extended to reads that are hosted on the ENA. An exercise for the reader: stream reads from this ENA dataset to make a count matrix. Here is the solution. Even though the dataset is composed of 505,958,821 reads you can produce count matrices on your laptop in a few hours!