vsearch

Versatile open-source tool for microbiome analysis

View on GitHub

NAME

sff — a binary file format used to encode pyrosequencing results

DESCRIPTION

The standard flowgram format (sff), was designed by 454 Life Sciences, the Whitehead Institute for Biomedical Research, and the Sanger Institute. It was used to store reads produced by the Roche 454 sequencing platforms, and early versions of the Ion Torrent PGM sequencing platforms.

Original NCBI documentation available at:

https://www.ncbi.nlm.nih.gov/Traces/trace.cgi?view=doc_formats#sffhere

If unavailable, it can be recovered from https://web.archive.org. We provide this modified version for preservation purposes.

The proposed SFF file format (version 1) is a container file for storing one or many 454 reads. 454 reads differ from standard sequencing reads in that the 454 data does not provide individual base measurements from which basecalls can be derived. Instead, it provides measurements that estimate the length of the next homopolymer stretch in the sequence (i.e., in “AAATGG”, “AAA” is a 3-mer stretch of A’s, “T” is a 1-mer stretch of T’s and “GG” is a 2-mer stretch of G’s). A basecalled sequence is then derived by converting each estimate into a homopolymer stretch of that length and concatenating the homopolymers.

The file format consists of different sections: a common header section occurring once in the file, then for each read stored in the file, a read header section and a read data section. An optional index section can also be present after the common header section, or anywhere among the reads, or at the end of the file.

The data in each section consists of a combination of numeric and character data, where the specific fields for each section are defined below. The sections adhere to the following rules:

Common Header Section

The common header section consists of the following fields:

 magic_number               uint32_t
 version                    char[4]
 index_offset               uint64_t
 index_length               uint32_t
 number_of_reads            uint32_t
 header_length              uint16_t
 key_length                 uint16_t
 number_of_flows_per_read   uint16_t
 flowgram_format_code       uint8_t
 flow_chars                 char[number_of_flows_per_read]
 key_sequence               char[key_length]
 eight_byte_padding         uint8_t[*]

where the following properties are true for these fields:

Read Header Section

The rest of the file contains the information about the reads, namely number_of_reads entries consisting of read header and read data sections. The read header section consists of the following fields:

 read_header_length         uint16_t
 name_length                uint16_t
 number_of_bases            uint32_t
 clip_qual_left             uint16_t
 clip_qual_right            uint16_t
 clip_adapter_left          uint16_t
 clip_adapter_right         uint16_t
 name                       char[name_length]
 eight_byte_padding         uint8_t[*]

where these fields have the following properties:

Read Data Section

The read data section consists of the following fields:

 flowgram_values            uint*_t[number_of_flows]
 flow_index_per_base        uint8_t[number_of_bases]
 bases                      char[number_of_bases]
 quality_scores             uint8_t[number_of_bases]
 eight_byte_padding         uint8_t[*]

where the fields have the following properties:

Index Section (optional)

If an index is included in the file, the index_offset and index_length values in the common header should point to the section of the file containing the index. Note that the index section can be placed after the common header section, or anywhere among the reads, or at the end of the file. To support different indexing methods, the index section should begin with the following two fields:

 index_magic_number         uint32_t
 index_version              char[4]

followed by the actual index data (n bytes) and a eight_byte_padding field, so that the total length of the index section is divisible by 8. The format of the rest of the index data is specific to the indexing method used. Currently, there are no officially supported indexing formats. The index_length given in the common header should include the bytes of these fields and the padding.

Computing Lengths and Scanning the File

The length of each read’s section will be different, because of different length accession numbers and different length nucleotide sequences. However, the various flow, name and bases lengths given in the common and read headers can be used to scan the file, accessing each read’s information or skipping read sections in the file. The following pseudocode gives an example method to scanning the file and accessing each read’s data:

SEE ALSO

vsearch-sff_convert(1), vsearch-fasta(5), vsearch-fastq(5)

CITATION

Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584 doi: 10.7717/peerj.2584

REPORTING BUGS

Submit suggestions and bug-reports at https://github.com/torognes/vsearch/issues, send a pull request on https://github.com/torognes/vsearch, or compose a friendly or curmudgeont e-mail to Torbjørn Rognes (torognes@ifi.uio.no).

AVAILABILITY

Source code and binaries are available at https://github.com/torognes/vsearch.

COPYRIGHT

Copyright (C) 2014-2026, Torbjørn Rognes, Frédéric Mahé and Tomás Flouri

All rights reserved.

Contact: Torbjørn Rognes torognes@ifi.uio.no, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway

This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License.

GNU General Public License version 3

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

The BSD 2-Clause License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

ACKNOWLEDGMENTS

We would like to thank the authors of the following projects for making their source code available: