NAME
sam — a text-based format for representing sequence alignments
DESCRIPTION
SAM (Sequence Alignment/Map) is a tab-delimited text format for storing the alignment of one or more query sequences against one or more reference sequences. It was designed by the 1000 Genomes Project and is maintained by the hts-specs working group:
https://github.com/samtools/hts-specs
The format is defined in the SAMv1 specification, and optional alignment tags are listed in the SAMtags specification. vsearch produces SAM version 1.0 and writes each record on a single line with fields separated by a single horizontal tab character (ASCII 9). Lines are terminated by a single line feed character (ASCII 10).
A SAM file produced by vsearch consists of an optional header followed
by zero or more alignment records. The --samout filename option
selects SAM as the output format for the --allpairs_global,
--cluster_fast, --cluster_size, --cluster_smallmem,
--cluster_unoise, --search_exact, and --usearch_global commands.
The --samheader option requests that the optional header be written;
by default no header is produced.
vsearch does not read SAM files and does not produce the binary BAM or
CRAM formats. Conversion between SAM, BAM, and CRAM can be performed
with external tools such as samtools(1).
Header
When --samheader is given, vsearch writes a header consisting of one
@HD line, one @SQ line per reference sequence, and one @PG line,
in that order. All header lines start with the character @ immediately
followed by a two-letter record type code, and contain tab-separated
TAG:value fields:
@HD VN:1.0 SO:unsorted GO:query
@SQ SN:name LN:length M5:md5 UR:file:dbname
@PG ID:vsearch VN:version CL:command-line
@HD
File-level metadata. VN is the SAM format version (1.0). SO is the
sort order of alignment records (unsorted). GO is the grouping of
alignment records (query: records for the same query are consecutive).
@SQ
Reference sequence dictionary. One @SQ line is emitted per sequence in
the reference database, in database order. SN is the reference name,
taken from the fasta header up to the first white space (or the entire
header line with --notrunclabels). LN is the reference length in
residues. M5 is the MD5 digest of the reference sequence, as 32
lowercase hexadecimal digits. UR is the location of the reference,
written as file: followed by the database filename as given on the
command line.
@PG
Program record. ID is the program name (vsearch). VN is the
vsearch version. CL is the full command line used to invoke vsearch.
vsearch does not emit @RG (read group) or @CO (comment) header
lines.
Alignment records
Each alignment record occupies a single line with eleven mandatory tab-separated fields followed by zero or more optional tags:
QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL [TAG:TYPE:VALUE ...]
QNAME— query template name. The query label as it appears in the input fasta or fastq file.FLAG— bitwise flag, written as a decimal integer. vsearch only sets the following bits:0x004(4): the query is unmapped (no alignment was produced). Set when the query has no hit and--output_no_hitsis in effect.0x010(16): the query aligned on the reverse strand of the reference. Set for reverse-strand hits produced with--strand both.0x100(256): the record is a secondary alignment. Set for the second and following hits reported for a given query when--maxacceptsand--top_hits_onlyallow multiple hits. All other bits are always0; in particular, vsearch never sets the paired-end bits (0x001,0x002,0x040,0x080), the mate bits (0x008,0x020), the quality-control bit (0x200), the duplicate bit (0x400), or the supplementary bit (0x800).
RNAME— reference sequence name. The target label as it appears in the reference database, matching theSNvalue of the corresponding@SQheader line.*for unmapped queries.POS— 1-based leftmost mapping position of the query on the reference. Always1for mapped queries (vsearch performs global alignment and reports the alignment relative to the start of the reference).0for unmapped queries.MAPQ— mapping quality. Always255, the value defined by the SAM specification to mean mapping quality not available.CIGAR— alignment operations in CIGAR format, with the operation lettersM(match or mismatch),I(insertion in the query relative to the reference), andD(deletion in the query relative to the reference). vsearch always emits explicit run-lengths and does not use the=andXoperations.*for unmapped queries. Seevsearch-cigar(5).RNEXT— reference sequence name of the mate. Always*(vsearch does not output paired-end information).PNEXT— position of the mate. Always0.TLEN— observed template length. Always0.SEQ— query segment. For forward-strand hits, the original query sequence. For reverse-strand hits (flag bit0x010set), the reverse complement of the original query sequence. For unmapped queries, the original query sequence. vsearch never uses*in this field.QUAL— ASCII-encoded Phred quality scores (base 33) for each residue inSEQ. Always*(not available), even when the input is a fastq file.
For each mapped query, vsearch emits one alignment record per reported
hit. The number and ordering of hits are controlled by --maxaccepts,
--maxrejects, --top_hits_only, and other search-related options. The
first record of a group has flag bit 0x100 cleared and represents the
primary alignment; subsequent records for the same query have flag bit
0x100 set and represent secondary alignments.
Optional tags
After the eleven mandatory fields, vsearch appends the following
optional tags, in order, for each mapped query. Each tag is written as
TAG:TYPE:VALUE and separated from the previous field by a tab.
Unmapped queries carry no optional tags.
AS:i:integer
Alignment score. vsearch stores the percent identity of the alignment,
rounded to the nearest integer. This differs from the SAMtags
recommendation, which describes AS as an aligner-specific alignment
score; vsearch follows the convention established by usearch.
XN:i:0
Next best alignment score. Always 0 (not computed by vsearch).
XM:i:integer
Number of mismatches in the alignment, excluding terminal gaps.
XO:i:integer
Number of gap opens in the alignment, excluding terminal gaps.
XG:i:integer
Total number of gap extensions in the alignment, excluding terminal
gaps. Equivalent to the total length of internal gaps.
NM:i:integer
Edit distance to the reference, defined as the sum of XM and XG
(mismatches plus total internal gap length).
MD:Z:string
Mismatching positions and deleted reference bases, encoded as described
in the SAMtags specification. The MD string allows the reference
sequence within the aligned region to be reconstructed from SEQ and
CIGAR. It consists of a sequence of decimal numbers (counts of
consecutive matching bases), uppercase letters (mismatched reference
bases), and ^-prefixed uppercase letter runs (reference bases deleted
from the query). Insertions in the query are not represented in MD;
they are recorded only in CIGAR.
YT:Z:UU
Alignment type in bowtie2 notation. Always UU (unpaired read aligned
to a single unpaired reference). vsearch does not align paired reads.
The XN, XM, XO, XG, MD, and YT tags are not part of the
SAMv1 standard but are widely used and originate from bowtie2 and
usearch.
Unmapped queries
By default, a query that produces no hit does not generate a SAM record.
When --output_no_hits is given, vsearch emits a single record for each
such query with FLAG set to 4 (unmapped), RNAME set to *, POS
set to 0, CIGAR set to *, and no optional tags. The SEQ field
contains the original query sequence.
Character set and line endings
vsearch writes ASCII characters only. Lines are terminated by a single line feed character (ASCII 10); no carriage return is emitted. Fields and tags are separated by a single tab character (ASCII 9). vsearch does not pad fields and does not emit trailing tabs.
EXAMPLES
A minimal SAM file with header, one forward-strand hit, one
reverse-strand hit for a second query, and one unmapped query
(--output_no_hits in effect):
@HD VN:1.0 SO:unsorted GO:query
@SQ SN:ref1 LN:10 M5:0123456789abcdef0123456789abcdef UR:file:refs.fasta
@SQ SN:ref2 LN:12 M5:fedcba9876543210fedcba9876543210 UR:file:refs.fasta
@PG ID:vsearch VN:2.30.6 CL:vsearch --usearch_global queries.fasta --db refs.fasta --id 0.9 --samout out.sam --samheader
q1 0 ref1 1 255 10M * 0 0 ACGTACGTAC * AS:i:100 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:10 YT:Z:UU
q2 16 ref2 1 255 5M1I6M * 0 0 ACGTAGCGTACG * AS:i:92 XN:i:0 XM:i:0 XO:i:1 XG:i:1 NM:i:1 MD:Z:11 YT:Z:UU
q3 4 * 0 255 * * 0 0 TTTTTTTT *
The second record has the reverse-strand bit 0x010 (decimal 16) set,
meaning that SEQ is the reverse complement of the original query and
the alignment lies on the reverse strand of ref2. The third record is
unmapped.
SEE ALSO
vsearch-allpairs_global(1),
vsearch-cluster_fast(1),
vsearch-cluster_size(1),
vsearch-cluster_smallmem(1),
vsearch-cluster_unoise(1),
vsearch-search_exact(1),
vsearch-usearch_global(1),
vsearch-cigar(5),
vsearch-fasta(5),
vsearch-fastq(5)
The SAM and SAMtags specifications are maintained at https://github.com/samtools/hts-specs.
CITATION
Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584 doi: 10.7717/peerj.2584
REPORTING BUGS
Submit suggestions and bug-reports at https://github.com/torognes/vsearch/issues, send a pull request on https://github.com/torognes/vsearch, or compose a friendly or curmudgeont e-mail to Torbjørn Rognes (torognes@ifi.uio.no).
AVAILABILITY
Source code and binaries are available at https://github.com/torognes/vsearch.
COPYRIGHT
Copyright (C) 2014-2026, Torbjørn Rognes, Frédéric Mahé and Tomás Flouri
All rights reserved.
Contact: Torbjørn Rognes torognes@ifi.uio.no, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway
This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License.
GNU General Public License version 3
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
The BSD 2-Clause License
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
-
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
-
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
ACKNOWLEDGMENTS
We would like to thank the authors of the following projects for making their source code available:
- vsearch includes code from Google’s CityHash project by Geoff Pike and Jyrki Alakuijala, providing some excellent hash functions available under a MIT license.
- vsearch includes code derived from Tatusov and Lipman’s DUST program that is in the public domain.
- vsearch includes public domain code written by Alexander Peslyak for the MD5 message digest algorithm.
- vsearch includes public domain code written by Steve Reid and others for the SHA1 message digest algorithm.
- vsearch binaries may include code from the zlib library, copyright Jean-Loup Gailly and Mark Adler.
- vsearch binaries may include code from the bzip2 library, copyright Julian R. Seward.