NAME

vsearch — a versatile open-source tool for metabarcoding and metagenomics

SYNOPSIS

vsearch <command> [file] [options]

(see below for a list of all available commands)

DESCRIPTION

vsearch is a versatile open-source tool for microbiome analysis, including chimera detection, clustering, dereplication and rereplication, extraction, FASTA/FASTQ/SFF file processing, masking, orienting, pairwise alignment, restriction site cutting, searching, shuffling, sorting, subsampling, and taxonomic classification of amplicon sequences for metabarcoding, metagenomics, genomics, and population genetics.

Each command is described in a dedicated manpage. For example, type man vsearch-usearch_global to read about the --usearch_global command. Command manpages belong to section 1 (executable programs). Format and reference manpages belong to sections 5 and 7 respectively; for example, type man 5 vsearch-fastq to read about the fastq format as used by vsearch.

VSEARCH COMMANDS

General

vsearch-help(1)
List available commands and options.

vsearch-version(1)
Write version information, citation, and compression support status.

Chimera detection

vsearch-uchime_denovo(1)
Detect chimeras de novo using the UCHIME algorithm.

vsearch-uchime2_denovo(1)
Detect chimeras de novo using the UCHIME2 algorithm.

vsearch-uchime3_denovo(1)
Detect chimeras de novo using the UCHIME2 algorithm with a stricter abundance skew.

vsearch-uchime_ref(1)
Detect chimeras using a reference database.

vsearch-chimeras_denovo(1)
Detect chimeras de novo in long exact sequences.

Clustering

vsearch-cluster_fast(1)
Clusterize sequences sorted by decreasing length.

vsearch-cluster_size(1)
Clusterize sequences sorted by decreasing abundance.

vsearch-cluster_smallmem(1)
Clusterize pre-sorted sequences using minimal memory.

vsearch-cluster_unoise(1)
Denoise amplicon sequences using the UNOISE3 algorithm.

Dereplication and rereplication

vsearch-fastx_uniques(1)
Merge strictly identical fasta or fastq sequences.

vsearch-derep_fulllength(1)
Merge strictly identical fasta or fastq sequences (fasta-only output).

vsearch-derep_id(1)
Merge identical fasta or fastq sequences sharing the same label.

vsearch-derep_prefix(1)
Merge fasta or fastq sequences with identical prefixes.

vsearch-derep_smallmem(1)
Merge strictly identical sequences using minimal memory.

vsearch-rereplicate(1)
Use abundance values to rereplicate fasta sequences.

Extraction of sequences

vsearch-fastx_getseq(1)
Extract a sequence from a fasta or fastq file by label.

vsearch-fastx_getseqs(1)
Extract sequences from a fasta or fastq file by label.

vsearch-fastx_getsubseq(1)
Extract a subsequence from a fasta or fastq file by label.

FASTA/FASTQ/SFF file processing

vsearch-fasta2fastq(1)
Convert fasta entries into fastq entries with fake quality scores.

vsearch-fastq_chars(1)
Analyze a fastq file to identify the quality encoding and range of quality score values used.

vsearch-fastq_convert(1)
Convert between fastq encoding variants.

vsearch-fastq_eestats(1)
Report per-position quality and expected error statistics.

vsearch-fastq_eestats2(1)
Report read retention across combinations of length and expected error cutoffs.

vsearch-fastq_filter(1)
Trim and filter fastq sequences.

vsearch-fastq_join(1)
Join paired-end reads into one sequence with a gap.

vsearch-fastq_mergepairs(1)
Merge paired-end reads by aligning overlapping regions.

vsearch-fastq_stats(1)
Analyze fastq sequences and output detailed statistics.

vsearch-fastx_filter(1)
Trim and filter fasta or fastq sequences.

vsearch-fastx_revcomp(1)
Reverse-complement fasta or fastq sequences.

vsearch-fastx_syncpairs(1)
Reorder paired reads so mates match across both files.

vsearch-sff_convert(1)
Convert an SFF file to fastq.

Masking

vsearch-fastx_mask(1)
Mask low-complexity regions in fasta or fastq sequences.

vsearch-maskfasta(1)
Mask low-complexity regions in fasta sequences (deprecated; use --fastx_mask).

Orienting

vsearch-orient(1)
Use a reference database to orient fasta or fastq sequences.

Pairwise alignment

vsearch-allpairs_global(1)
Perform global pairwise alignments of all sequence pairs.

Restriction site cutting

vsearch-cut(1)
Use a restriction pattern to cut fasta sequences.

Searching

vsearch-search_exact(1)
Search for exact full-length matches against a database.

vsearch-usearch_global(1)
Search sequences against a reference database using global alignment.

Shuffling and sorting

vsearch-shuffle(1)
Randomize the order of fasta or fastq entries.

vsearch-sortbylength(1)
Sort fasta or fastq sequences by decreasing length.

vsearch-sortbysize(1)
Sort fasta or fastq sequences by decreasing abundance.

Subsampling

vsearch-fastx_subsample(1)
Randomly subsample fasta or fastq sequences.

Taxonomic classification

vsearch-sintax(1)
Classify sequences using the SINTAX algorithm.

UDB database handling

vsearch-makeudb_usearch(1)
Create a UDB database file from a fasta file.

vsearch-udb2fasta(1)
Extract sequences from a UDB database file into a fasta file.

vsearch-udbinfo(1)
Display information about a UDB database file.

vsearch-udbstats(1)
Report statistics about indexed words in a UDB database file.

FILE FORMATS

vsearch-cigar(5)
The CIGAR (Compact Idiosyncratic Gapped Alignment Report) format, used by vsearch to encode pairwise alignments.

vsearch-fasta(5)
The fasta format, as used by vsearch.

vsearch-fastq(5)
The fastq format, as used by vsearch.

vsearch-sff(5)
The Standard Flowgram Format (SFF), used by Roche 454 and early Ion Torrent PGM sequencing platforms.

vsearch-udb(5)
The UDB (USEARCH database) binary format, containing fasta sequences and a pre-computed k-mer index.

REFERENCE PAGES

vsearch-expected_error(7)
A quality summary metric for fastq sequences.

vsearch-nucleotides(7)
The IUPAC nucleotide symbols accepted by vsearch.

vsearch-pairwise_alignment_parameters(7)
The pairwise alignment model implemented in vsearch.

vsearch-userfields(7)
The output fields available with the --userout option.

CITATION

Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584 doi: 10.7717/peerj.2584

REPORTING BUGS

Submit suggestions and bug-reports at https://github.com/torognes/vsearch/issues, send a pull request on https://github.com/torognes/vsearch, or compose a friendly or curmudgeont e-mail to Torbjørn Rognes (torognes@ifi.uio.no).

AVAILABILITY

Source code and binaries are available at https://github.com/torognes/vsearch.

COPYRIGHT

Contact: Torbjørn Rognes torognes@ifi.uio.no, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway

This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License.

GNU General Public License version 3

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

The BSD 2-Clause License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

ACKNOWLEDGMENTS

We would like to thank the authors of the following projects for making their source code available:

vsearch includes code from Google’s CityHash project by Geoff Pike and Jyrki Alakuijala, providing some excellent hash functions available under a MIT license.
vsearch includes code derived from Tatusov and Lipman’s DUST program that is in the public domain.
vsearch includes public domain code written by Alexander Peslyak for the MD5 message digest algorithm.
vsearch includes public domain code written by Steve Reid and others for the SHA1 message digest algorithm.
vsearch binaries may include code from the zlib library, copyright Jean-Loup Gailly and Mark Adler.
vsearch binaries may include code from the bzip2 library, copyright Julian R. Seward.

VERSION HISTORY

New features and important modifications of vsearch (short-lived or minor bug releases may not be mentioned):

v1.0.0 released November 28th, 2014
First public release.

v1.0.1 released December 1st, 2014
Bug fixes (sortbysize, semicolon after size annotation in headers) and minor changes (labels as secondary sort key for most sorts, treat T and U as identical for dereplication, only output size in --dbmatched file if --sizeout specified).

v1.0.2 released December 6th, 2014
Bug fixes (ssse3/sse4.1 requirement, memory leak).

v1.0.3 released December 6th, 2014
Bug fix (now writes help to stdout instead of stderr).

v1.0.4 released December 8th, 2014
Added --allpairs_global option. Reduce memory requirements slightly and eliminate memory leaks.

v1.0.5 released December 9th, 2014
Fixes a minor bug with --allpairs_global and --acceptall options.

v1.0.6 released December 14th, 2014
Fixes a memory allocation bug in chimera detection (--uchime_ref option).

v1.0.7 released December 19th, 2014
Fixes a bug in the output from chimera detection with the --uchimeout option.

v1.0.8 released January 22nd, 2015
Introduces several changes and bug fixes:

a new linear memory aligner for alignment of sequences longer than 5,000 nucleotides,
a new --cluster_size command that sorts sequences by decreasing abundance before clustering,
meaning of userfields qlo, qhi, tlo, thi changed for compatibility with usearch,
new userfields qilo, qihi, tilo, tihi give alignment coordinates ignoring terminal gaps,
in --uc output files, a perfect alignment is indicated with a = sign,
the option --cluster_fast now sorts sequences by decreasing length, then by decreasing abundance and finally by sequence identifier,
default --maxseqlength value set to 50,000 nucleotides,
fix for bug in alignment in rare cases,
fix for lack of detection of under- or overflow in SIMD aligner.

v1.0.9 released January 22nd, 2015
Fixes a bug in the function sorting sequences by decreasing abundance (--sortbysize).

v1.0.10 released January 23rd, 2015
Fixes a bug where the --sizein option was ignored and always treated as on, affecting clustering and dereplication commands.

v1.0.11 released February 5th, 2015
Introduces the possibility to output results in SAM format (for clustering, pairwise alignment and searching).

v1.0.12 released February 6th, 2015
Temporarily fixes a problem with long headers in FASTA files.

v1.0.13 released February 17th, 2015
Fix a memory allocation problem when computing multiple sequence alignments with the --msaout and --consout options, as well as a memory leak. Also increased line buffer for reading FASTA files to 4MB.

v1.0.14 released February 17th, 2015
Fix a bug where the multiple alignment and consensus sequence computed after clustering ignored the strand of the sequences. Also decreased size of line buffer for reading FASTA files to 1MB again due to excessive stack memory usage.

v1.0.15 released February 18th, 2015
Fix bug in calculation of identity metric between sequences when using the MBL definition (--iddef 3).

v1.0.16 released February 19th, 2015
Integrated patches from Debian for increased compatibility with various architectures.

v1.1.0 released February 20th, 2015
Added the --quiet option to suppress all output to stdout and stderr except for warnings and fatal errors. Added the --log option to write messages to a log file.

v1.1.1 released February 20th, 2015
Added info about --log and --quiet options to help text.

v1.1.2 released March 18th, 2015
Fix bug with large datasets. Fix format of help info.

v1.1.3 released March 18th, 2015
Fix more bugs with large datasets.

v1.2.0–v1.2.19 released July 6th to September 8th, 2015
Several new commands and options added. Bugs fixed. Documentation updated.

v1.3.0 released September 9th, 2015
Changed to autotools build system.

v1.3.1 released September 14th, 2015
Several new commands and options. Bug fixes.

v1.3.2 released September 15th, 2015
Fixed memory leaks. Added -h shortcut for help. Removed extra v in version number.

v1.3.3 released September 15th, 2015
Fixed bug in hexadecimal digits of MD5 and SHA1 digests. Added --samheader option.

v1.3.4 released September 16th, 2015
Fixed compilation problems with zlib and bzip2lib.

v1.3.5 released September 17th, 2015
Minor configuration/makefile changes to compile to native CPU and simplify makefile.

v1.4.0 released September 25th, 2015
Added --sizeorder option.

v1.4.1 released September 29th, 2015
Inserted public domain MD5 and SHA1 code to eliminate dependency on crypto and openssl libraries and their licensing issues.

v1.4.2 released October 2nd, 2015
Dynamic loading of libraries for reading gzip and bzip2 compressed files if available. Circumvention of missing gzoffset function in zlib 1.2.3 and earlier.

v1.4.3 released October 3rd, 2015
Fix a bug with determining amount of memory on some versions of Apple OS X.

v1.4.4 released October 3rd, 2015
Remove debug message.

v1.4.5 released October 6th, 2015
Fix memory allocation bug when reading long FASTA sequences.

v1.4.6 released October 6th, 2015
Fix subtle bug in SIMD alignment code that reduced accuracy.

v1.4.7 released October 7th, 2015
Fixes a problem with searching for or clustering sequences with repeats. In this new version, vsearch looks at all words occurring at least once in the sequences in the initial step. Previously only words occurring exactly once were considered. In addition, vsearch now requires at least 10 words to be shared by the sequences, previously only 6 were required. If the query contains less than 10 words, all words must be present for a match. This change seems to lead to slightly reduced recall, but somewhat increased precision, ending up with slightly improved overall accuracy.

v1.5.0 released October 7th, 2015
Introduces the new option --minwordmatches that allows the user to specify the minimum number of matching unique words before a sequence is considered further. New default values for different word lengths are also set. The minimum word length is increased to 7.

v1.6.0 released October 9th, 2015
Adds the relabeling options (--relabel, --relabel_md5 and --relabel_sha1) to the shuffle command. Also adds the --xsize option to the clustering, dereplication, shuffling and sorting commands.

v1.6.1 released October 14th, 2015
Fix bugs and update manual and help text regarding relabelling. Add all relabelling options to the subsampling command. Add the --xsize option to chimera detection, dereplication and fastq filtering commands. Refactoring of code.

v1.7.0 released October 14th, 2015
Add --relabel_keep option.

v1.8.0 released October 19th, 2015
Added --search_exact, --fastx_mask and --fastq_convert commands. Changed most commands to read FASTQ input files as well as FASTA files. Modified --fastx_revcomp and --fastx_subsample to write FASTQ files.

v1.8.1 released November 2nd, 2015
Fixes for compatibility with QIIME and older OS X versions.

v1.9.0 released November 12th, 2015
Added the --fastq_mergepairs command and associated options. This command has not been tested well yet. Included additional files to avoid dependency of autoconf for compilation. Fixed an error where identifiers in fasta headers were not truncated at tabs, just spaces. Fixed a bug in detection of the file format (FASTA/FASTQ) of a gzip compressed input file.

v1.9.1 released November 13th, 2015
Fixed memory leak and a bug in score computation in --fastq_mergepairs, and improved speed.

v1.9.2 released November 17th, 2015
Fixed a bug in the computation of some values with --fastq_stats.

v1.9.3 released November 19th, 2015
Workaround for missing x86intrin.h with old compilers.

v1.9.4 released December 3rd, 2015
Fixed incrementation of counter when relabeling dereplicated sequences.

v1.9.5 released December 3rd, 2015
Fixed bug resulting in inferior chimera detection performance.

v1.9.6 released January 8th, 2016
Fixed bug in aligned sequences produced with --fastapairs and --userout (qrow, trow) options.

v1.9.7 released January 12th, 2016
Masking behaviour is changed somewhat to keep the letter case of the input sequences unchanged when no masking is performed. Masking is now performed also during chimera detection. Documentation updated.

v1.9.8 released January 22nd, 2016
Fixed bug causing segfault when chimera detection is performed on extremely short sequences.

v1.9.9 released January 22nd, 2016
Adjusted default minimum number of word matches during searches for improved performance.

v1.9.10 released January 25th, 2016
Fixed bug related to masking and lower case database sequences.

v1.10.0 released February 11th, 2016
Parallelized and improved merging of paired-end reads and adjusted some defaults. Removed progress indicator when stderr is not a terminal. Added --fasta_score option to report chimera scores in FASTA files. Added --rereplicate and --fastq_eestats commands. Fixed typos. Added relabelling to files produced with --consout and --profile options.

v1.10.1 released February 23rd, 2016
Fixed a bug affecting the --fastq_mergepairs command causing FASTQ headers to be truncated at first space (despite the bug fix release 1.9.0 of November 12th, 2015). Full headers are now included in the output (no matter if --notrunclabels is in effect or not).

v1.10.2 released March 18th, 2016
Fixed a bug causing a segmentation fault when running --usearch_global with an empty query sequence. Also fixed a bug causing imperfect alignments to be reported with an alignment string of = in uc output files. Fixed typos in man file. Fixed fasta/fastq processing code regarding presence or absence of compression library header files.

v1.11.1 released April 13th, 2016
Added strand information in UC file for --derep_fulllength and --derep_prefix. Added expected errors (ee) to header of FASTA files specified with --fastaout and --fastaout_discarded when --eeout or --fastq_eeout option is in effect for fastq_filter and fastq_mergepairs. The options --eeout and --fastq_eeout are now equivalent.

v1.11.2 released June 21st, 2016
Two bugs were fixed. The first issue was related to the --query_cov option that used a different coverage definition than the qcov userfield. The coverage is now defined as the fraction of the whole query sequence length that is aligned with matching or mismatching residues in the target. All gaps are ignored. The other issue was related to the consensus sequences produced during clustering when only N’s were present in some positions. Previously these would be converted to A’s in the consensus. The behaviour is changed so that N’s are produced in the consensus, and it should now be more compatible with usearch.

v2.0.0 released June 24th, 2016
This major new version supports reading from pipes. Two new options are added: --gzip_decompress and --bzip2_decompress. One of these options must be specified if reading compressed input from a pipe, but are not required when reading from ordinary files. The vsearch header that was previously written to stdout is now written to stderr. This enables piping of results for further processing. The file name - now represents standard input (/dev/stdin) or standard output (/dev/stdout) when reading or writing files, respectively. Code for reading FASTA and FASTQ files has been refactored.

v2.0.1 released June 30th, 2016
Avoid segmentation fault when masking very long sequences.

v2.0.2 released July 5th, 2016
Avoid warnings when compiling with GCC 6.

v2.0.3 released August 2nd, 2016
Fixed bad compiler options resulting in Illegal instruction errors when running precompiled binaries.

v2.0.4 released September 1st, 2016
Improved error message for bad FASTQ quality values. Improved manual.

v2.0.5 released September 9th, 2016
Add options --fastaout_discarded and --fastqout_discarded to output discarded sequences from subsampling to separate files. Updated manual.

v2.1.0 released September 16th, 2016
New command: --fastx_filter. New options: --fastq_maxlen, --fastq_truncee. Allow --minwordmatches down to 3.

v2.1.1 released September 23rd, 2016
Fixed bugs in output to UC-files. Improved help text and manual.

v2.1.2 released September 28th, 2016
Fixed incorrect abundance output from fastx_filter and fastq_filter when relabelling.

v2.2.0 released October 7th, 2016
Added OTU table generation options --biomout, --mothur_shared_out and --otutabout to the clustering and searching commands.

v2.3.0 released October 10th, 2016
Allowed zero-length sequences in FASTA and FASTQ files. Added --fastq_trunclen_keep option. Fixed bug with output of OTU tables to pipes.

v2.3.1 released November 16th, 2016
Fixed bug where --minwordmatches 0 was interpreted as the default minimum word matches for the given word length instead of zero. When used in combination with --maxaccepts 0 and --maxrejects 0 it will allow complete bypass of kmer-based heuristics.

v2.3.2 released November 18th, 2016
Fixed bug where vsearch reported the ordinal number of the target sequence instead of the cluster number in column 2 on H-lines in the uc output file after clustering. For search and alignment commands both usearch and vsearch reports the target sequence number here.

v2.3.3 released December 5th, 2016
A minor speed improvement.

v2.3.4 released December 9th, 2016
Fixed bug in output of sequence profiles and updated documentation.

v2.4.0 released February 8th, 2017
Added support for Linux on Power8 systems (ppc64le) and Windows on x86_64. Improved detection of pipes when reading FASTA and FASTQ files. Corrected option for specifying output from fastq_eestats command in help text.

v2.4.1 released March 1st, 2017
Fixed an overflow bug in fastq_stats and fastq_eestats affecting analysis of very large FASTQ files. Fixed maximum memory usage reporting on Windows.

v2.4.2 released March 10th, 2017
Default value for fastq_minovlen increased to 16 in accordance with help text and for compatibility with usearch. Minor changes for improved accuracy of paired-end read merging.

v2.4.3 released April 6th, 2017
Fixed bug with progress bar for shuffling. Fixed missing N-lines in UC files with usearch_global, search_exact and allpairs_global when the output_no_hits option was not specified.

v2.4.4 released August 28th, 2017
Fixed a few minor bugs, improved error messages and updated documentation.

v2.5.0 released October 5th, 2017
Support for UDB database files. New commands: fastq_stripright, fastq_eestats2, makeudb_usearch, udb2fasta, udbinfo, and udbstats. New general option: no_progress. New options minsize and maxsize to fastx_filter. Minor bug fixes, error message improvements and documentation updates.

v2.5.1 released October 25th, 2017
Fixed bug with bad default value of 1 instead of 32 for minseqlength when using the makeudb_usearch command.

v2.5.2 released October 30th, 2017
Fixed bug where - as an argument to the fastq_eestats2 option was treated literally instead of equivalent to stdin.

v2.6.0 released November 10th, 2017
Rewritten paired-end reads merger with improved accuracy. Decreased default value for fastq_minovlen option from 16 to 10. The default value for the fastq_maxdiffs option is increased from 5 to 10. There are now other more important restrictions that will avoid merging reads that cannot be reliably aligned.

v2.6.1 released December 8th, 2017
Improved parallelisation of paired end reads merging.

v2.6.2 released December 18th, 2017
Fixed option xsize that was partially inactive for commands uchime_denovo, uchime_ref, and fastx_filter.

v2.7.0 released February 13th, 2018
Added commands cluster_unoise, uchime2_denovo and uchime3_denovo contributed by Davide Albanese based on Robert Edgar’s papers. Refactored fasta and fastq print functions as well as code for extraction of abundance and other attributes from the headers.

v2.7.1 released February 16th, 2018
Fix several bugs on Windows related to large files, use of - as a file name to mean stdin or stdout, alignment errors, missed kmers and corrupted UDB files. Added documentation of UDB-related commands.

v2.7.2 released April 20th, 2018
Added the sintax command for taxonomic classification. Fixed a bug with incorrect FASTA headers of consensus sequences after clustering.

v2.8.0 released April 24th, 2018
Added the fastq_maxdiffpct option to the fastq_mergepairs command.

v2.8.1 released June 22nd, 2018
Fixes for compilation warnings with GCC 8.

v2.8.2 released August 21st, 2018
Fix for wrong placement of semicolons in header lines in some cases when using the sizeout or xsize options. Reduced memory requirements for full-length dereplication in cases with many duplicate sequences. Improved wording of fastq_mergepairs report. Updated manual regarding use of sizein and sizeout with dereplication. Changed a compiler option.

v2.8.3 released August 31st, 2018
Fix for segmentation fault for --derep_fulllength with --uc.

v2.8.4 released September 3rd, 2018
Further reduce memory requirements for dereplication when not using the uc option. Fix output during subsampling when quiet or log options are in effect.

v2.8.5 released September 26th, 2018
Fixed a bug in fastq_eestats2 that caused the values for large lengths to be much too high when the input sequences had varying lengths.

v2.8.6 released October 9th, 2018
Fixed a bug introduced in version 2.8.2 that caused derep_fulllength to include the full FASTA header in its output instead of stopping at the first space (unless the notrunclabels option is in effect).

v2.9.0 released October 10th, 2018
Added the fastq_join command.

v2.9.1 released October 29th, 2018
Changed compiler options that select the target cpu and tuning to allow the software to run on any 64-bit x86 system, while tuning for more modern variants. Avoid illegal instruction error on some architectures. Update documentation of rereplicate command.

v2.10.0 released December 6th, 2018
Added the sff_convert command to convert SFF files to FASTQ. Added some additional option argument checks. Fixed segmentation fault bug after some fatal errors when a log file was specified.

v2.10.1 released December 7th, 2018
Improved sff_convert command. It will now read several variants of the SFF format. It is also able to read from a pipe. Warnings are given if there are minor problems. Error messages have been improved. Minor speed and memory usage improvements.

v2.10.2 released December 10th, 2018
Fixed bug in sintax with reversed order of domain and kingdom.

v2.10.3 released December 19th, 2018
Ported to Linux on ARMv8 (aarch64). Fixed compilation warning with gcc version 8.1.0 and 8.2.0.

v2.10.4 released January 4th, 2019
Fixed serious bug in x86_64 SIMD alignment code introduced in version 2.10.3. Added link to BioConda in README. Fixed bug in fastq_stats with sequence length 1. Fixed use of equals symbol in UC files for identical sequences with cluster_fast.

v2.11.0 released February 13th, 2019
Added ability to trim and filter paired-end reads using the reverse option with the fastx_filter and fastq_filter commands. Added --xee option to remove ee attributes from FASTA headers. Minor invisible improvement to the progress indicator.

v2.11.1 released February 28th, 2019
Minor change to the handling of the weak_id and id options when using cluster_unoise.

v2.12.0 released March 19th, 2019
Take sequence abundance into account when computing consensus sequences or profiles after clustering. Warn when rereplicating sequences without abundance info. Guess offset 33 in more cases with fastq_chars. Stricter checking of option arguments and option combinations.

v2.13.0 released April 11th, 2019
Added the --fastx_getseq, --fastx_getseqs and --fastx_getsubseq commands to extract sequences from a FASTA or FASTQ file based on their labels. Improved handling of ambiguous nucleotide symbols. Corrected behaviour of --uchime_ref command with options --self and --selfid. Strict detection of illegal options for each command.

v2.13.1 released April 26th, 2019
Minor changes to the allowed options for each command. All commands now allow the log, quiet and threads options. If more than 1 thread is specified for commands that are not multi-threaded, a warning will be issued. Minor changes to the manual.

v2.13.2 released April 30th, 2019
Fixed bug related to improper handling of newlines on Windows. Allowed option strand plus to uchime_ref for compatibility.

v2.13.3 released April 30th, 2019
Fixed bug in FASTQ parsing introduced in version 2.13.2.

v2.13.4 released May 10th, 2019
Added information about support for gzip- and bzip2-compressed input files to the output of the version command. Adapted source code for compilation on FreeBSD and NetBSD systems.

v2.13.5 released July 2nd, 2019
Added cut command to fragment sequences at restriction sites. Silenced output from the fastq_stats command if quiet option was given. Updated manual.

v2.13.6 released July 2nd, 2019
Added info about cut command to output of help command.

v2.13.7 released September 2nd, 2019
Fixed bug in consensus sequence introduced in version 2.13.0.

v2.14.0 released September 11th, 2019
Added relabel_self option. Made fasta_width, sizein, sizeout and relabelling options valid for certain commands.

v2.14.1 released September 18th, 2019
Fixed bug with sequences written to file specified with fastaout_rev for commands fastx_filter and fastq_filter.

v2.14.2 released January 28th, 2020
Fixed some issues with the cut, fastx_revcomp, fastq_convert, fastq_mergepairs, and makeudb_usearch commands. Updated manual.

v2.15.0 released June 19th, 2020
Update manual and documentation. Turn on notrunclabels option for sintax command by default. Change maxhits 0 to mean unlimited hits, like the default. Allow non-ascii characters in headers, with a warning. Sort centroids and uc too when clusterout_sort specified. Add cluster id to centroids output when clusterout_id specified. Improve error messages when parsing FASTQ files. Add missing fastq_qminout option and fix label_suffix option for fastq_mergepairs. Add derep_id command that dereplicates based on both label and sequence. Remove compilation warnings.

v2.15.1 released October 28th, 2020
Fix for dereplication when including reverse complement sequences and headers. Make some extra checks when loading compression libraries and add more diagnostic output about them to the output of the version command. Report an error when fastx_filter is used with FASTA input and options that require FASTQ input. Update manual.

v2.15.2 released January 26th, 2021
No real functional changes, but some code and compilation changes. Compiles successfully on macOS running on Apple Silicon (ARMv8). Binaries available. Code updated for C++11. Minor adaptations for Windows compatibility, including the use of the C++ standard library for regular expressions. Minor changes for compatibility with Power8. Switch to C++ header files.

v2.16.0 released March 22nd, 2021
Adds the orient command. Handles empty input files properly. Documentation has been updated.

v2.17.0 released March 29th, 2021
The fastq_mergepairs command has been changed. It now allows merging of sequences with overlaps as short as 5 bp if the --fastq_minovlen option has been adjusted down from the default 10. In addition, much fewer pairs of reads should now be rejected with the reason ‘multiple potential alignments’ as the algorithm for detecting those have been changed.

v2.17.1 released June 14th, 2021
Modernized code. Minor changes to help info.

v2.18.0 released August 27th, 2021
Added the fasta2fastq command. Fixed search bug on ppc64le. Fixed bug with removal of size and ee info in uc files. Fixed compilation errors in some cases. Made some general code improvements. Updated manual.

v2.19.0 released December 21st, 2021
Added the lcaout and lca_cutoff options to enable the output of last common ancestor (LCA) information about hits when searching. The randseed option was added as a valid option to the sintax command. Code improvements.

v2.20.0 released January 10th, 2022
Added the fastx_uniques command and the fastq_qout_max option for dereplication of FASTQ files. Some code cleaning.

v2.20.1 released January 11th, 2022
Fixes a bug in fastq_mergepair that caused an occasional hang at the end when using multiple threads.

v2.21.0 released January 12th, 2022
Adds the sample, qsegout and tsegout options. Enables the use of UDB databases with uchime_ref.

v2.21.1 released January 18th, 2022
Fix a problem with dereplication of empty input files. Update Altivec code on ppc64le for improved compiler compatibility (vector->__vector).

v2.21.2 released September 12th, 2022
Fix problems with the lcaout option when using maxaccepts above 1 and either lca_cutoff below 1 or with top_hits_only enabled. Update documentation. Update code to avoid compiler warnings.

v2.22.0 released September 19th, 2022
Add the derep_smallmem command for dereplication using little memory.

v2.22.1 released September 19th, 2022
Fix compiler warning.

v2.23.0 released July 7th, 2023
Update documentation. Add citation file. Modernize and improve code. Fix several minor bugs. Fix compilation with GCC 13. Print stats after fastq_mergepairs to log file instead of stderr. Handle sizein option correctly with dbmatched option for usearch_global. Allow maxseqlength option for makeudb_usearch. Fix memory allocation problem with chimera detection. Add lengthout and xlength options. Increase precision for eeout option. Add warning about sintax algorithm, random seed and multiple threads. Refactor chimera detection code. Add undocumented experimental long_chimeras_denovo command. Fix segfault with clustering. Add more references.

v2.24.0 released October 26th, 2023
Update documentation. Improve code. Allow up to 20 parents for the undocumented and experimental chimeras_denovo command. Fix compilation warnings for sha1.c. Compile for release (not debug) by default.

v2.25.0 released November 10th, 2023
Allow a given percentage of mismatches between chimeras and parents for the experimental chimeras_denovo command.

v2.26.0 released November 24th, 2023
Enable the maxseqlength and minseqlength options for the chimera detection commands. When the usearch_global or search_exact commands are used, OTU tables will include samples and OTUs with no matches.

v2.26.1 released November 25th, 2023
No real changes, but the previous version was released without proper updates to the source code.

v2.27.0 released January 19th, 2024
The usearch_global and search_exact commands now support FASTQ files as well as FASTA files as input. This version of vsearch includes clarifications and updates to the manual. Some code has been refactored. Generic Dockerfiles for major Linux distributions have been included. Some warnings from compilers and other tools have been eliminated. The release for Windows will also include DLLs for the two compression libraries.

v2.27.1 released April 6th, 2024
Fixes the weak_id option and makes searches report weak hits in some cases. Updates the names of the compression libraries to libz.so.1 and libbz2.so.1 on Linux to make them work on common Linux distributions without installing additional packages. README.md has been updated with information about compression libraries on Windows.

v2.28.0 released April 26th, 2024
The sintax command has been improved in several ways in this version of vsearch. Please note that several details of this algorithm are not clearly described in the preprint, and the implementation in vsearch differs from that in usearch. The former vsearch version did not always choose the most common taxonomic entity over the 100 bootstraps among the database sequences with the highest amount of word similarity to the query. Instead, if several sequences had an equal similarity with the query, the sequence encountered in the earliest bootstrap was chosen. The confidence level was calculated based on this sequence compared to the selected sequences from the other 99 bootstraps. This could lead to a suboptimal choice with a low confidence. In the new version, the most common of the sequences with the highest amount of word similarity across the 100 bootstraps will be selected, and ties will be broken randomly. Another problem with the old implementation was that if several sequences had the same amount of word similarity, the shortest one in the reference database would be chosen, and if they were equally long, the earliest in the database file would be chosen. A new option called sintax_random has now been introduced. This option will randomly select one of the sequences with the highest number of shared words with the query, without considering their length or position. This avoids a bias towards shorter reference sequences. This option is strongly recommended and will probably soon be the default. Furthermore, a ninth taxonomic rank, strain (letter t), is now recognized. The speed of the sintax command has also been significantly improved at least in some cases. Run vsearch with the randseed option and 1 thread to ensure reproducibility of the random choices in the algorithm.

v2.28.1 released April 26th, 2024
Fix a segmentation fault that could occur with the blast6out and output_no_hits options.

v2.29.0 released September 26th, 2024
Fixes seven bugs (see changelog below), adds initial support for RISC-V architectures, and improves code quality and code testing (1,210 new tests):

add: experimental support for RISCV64 and other 64-bit little-endian architectures, thanks to Michael R. Crusoe and his fellow Debian developers (issue #566),
add: official support for clang-19 and gcc 14,
add: beta support for clang-20,
remove: unused --output option for command --fastq_stats (issue #572),
fix: bug in --sintax when selecting the best lineage (only low confidence values below 0.5 were affected) (issue #573),
fix: out-of-bounds error in --fastq_stats when processing empty reads (issue #571),
fix: bug in --cut, patterns with multiple cutting sites were not detected (commit 4c4f9fa),
fix: memory error (segmentation fault) when using --derep_id and --strand (issue #565),
fix: --fastq_join now obeys --quiet and --log options (commit 87f968b),
fix: --fastq_join quality padding is now also set to Q40 when quality offset is 64 (commit be0bf9b),
fix: (partial) --fastq_join’s handling of abundance annotations (commit f2bbcb4),
improve: additional safeguards to validate input values and to make sure that they are within acceptable limits. Changes concern options --abskew (commit a530dd8) and --fastq_maxdiffs (commit 4b254db),
improve: code quality (1.3k+ commits, 6k+ clang-tidy warnings eliminated),
improve: documentation and help messages (issue #568),
improve: complete refactoring and modernization of a subset of commands (--sortbylength, --sortbysize, --shuffle, --rereplicate, --cut, --fastq_join, --fasta2fastq, --fastq_chars),
improve: code coverage of our test-suite for the above-mentioned commands (1,210 new tests, 4,753 in total).

v2.29.1 released October 24th, 2024
Fix a segmentation fault that could occur during alignment in version 2.29.0, for example with --uchime_ref. Some improvements to code and documentation.

v2.29.2 released December 20th, 2024
Fix a segmentation fault during clustering when the set of clusters is empty. Initial documentation in markdown format available on GitHub Pages.

v2.29.3 released February 3rd, 2025
Released to mitigate a bug that occurs when compiling the align_simd.cc file on x86_64 systems with the GNU C++ compiler version 9 or later with the -O3 optimization option. It results in incorrect code that may cause bad alignments in some circumstances. We are investigating this issue further, but for now we recommend compiling with the -O2 flag. The README.md file and the Dockerfiles have been updated to reflect this. The binaries released with this version will include this fix.

v2.29.4 released February 14th, 2025
Adjust the window size used for chimera detection down from 64 to 32. The window size was by accident increased from 32 to 64 in version 2.23.0, leading to somewhat fewer chimeras being predicted. In addition, a compiler pragma has been included in align_simd.cc to further protect the compiler from generating wrong code.

v2.30.0 released February 27th, 2025
Add options --n_mismatch, --fastq_minqual, and --fastq_truncee_rate. The --n_mismatch option will count N’s as mismatches in alignments, which may be useful to get sensible alignments for sequences with lots of N’s. By default N’s are counted as matches. Both the scoring and the counting of matches are affected. The new --fastq_minqual option for the --fastq_filter and --fastx_filter commands will discard sequences with any bases with a quality score below the given value. The new --fastq_truncee_rate option for the same commands will truncate sequences at the first position where the number of expected errors per base is above the given value.

v2.30.1 released October 3rd, 2025
Incorporates many code improvements, more extensive testing, better documentation and some minor bug fixes:

fix: use-after-free introduced in commit de6c1d8 (Jun 13, 2024),
fix: (harmless) out-of-bounds memory issue in --derep_prefix (commit 8a0a508b),
fix: (harmless) memory leak in --fastx_getseqs --label_field (commit a9c42713),
fix: (harmless) memory leak when using option --userfields (--allpairs_global commit 03b95bcf; --cluster_* commit 2fde5472; --search_exact commit 45cd56d6; --usearch_global commit d83bfee9),
fix: (harmless) valgrind error, use of uninitialized values (commit 8bab2444), also eliminates a pesky compilation warning,
change: passing a negative value to --fastq_truncee_rate is now an error (commit a120f371),
change: passing a negative value to --fastq_minqual is now an error (commit ff5b0c99),
change: passing a non-ASCII symbol to --join_padgap or --join_padgapq is now an error (commit a708f5b3),
change: when using --gapopen "*" to forbid gap opening, the penalty is now set to INT_MAX (rather than 1,000). This might change alignment results for users who relied on the old behavior (thanks to Denis Filloux, issue #602, commit 96e9cf9e),
change: when using command --chimeras_denovo, --tabbedout or --alnout can be the only output files specified (commit d51f0a4),
change: when using command --chimeras_denovo, option --lengthout is now accepted (commit 3f55cc6b),
change: when using command --chimeras_denovo, option --xlength is now accepted (commit 0fd346cd),
add: compilation option GLIBCXX_DEBUG when compiling for debugging (commit 5cf4a6c1),
add: official support for clang 20,
add: initial support for clang 21, initial support for GCC 15,
add: experimental support for clang 22,
improve: more accurate line number when reporting illegal characters in fastq headers (commit 539084e9),
improve: more accurate line number when reporting non-ASCII characters in fastq headers (commit 98a851ed),
improve: remove checks for unneeded libraries during compilation (commit 249bb5d5),
improve: code quality (8,536 clang-tidy warnings eliminated),
improve: documentation and help messages (issue #604),
improve: complete refactoring and modernization of the command --fastq_stats,
improve: command --fastq_stats is now up to twice faster (tested on x86-64),
improve: extensive test-suites for --fastq_stats and --sff_convert,
improve: code coverage of our test-suite.

v2.30.2 released December 12th, 2025
Fixes two minor issues. Allow a UDB file to be written to a pipe or stdout (issue #599). Correct computation of median cluster size after dereplication (issue #611).

v2.30.3 released January 12th, 2026
Fixes memory allocation bugs in the chimera detection code that caused a segmentation fault in rare cases (issue #615).

v2.30.4 released January 19th, 2026
Fixes issue #617. Due to a bug, incorrect scores were reported in the FASTA file headers of some non-chimeric sequences during chimera detection. The non-chimeric status was correct.

v2.30.5 released March 10th, 2026
Includes the following changes:

fix: out-of-bound look-ups when printing alignment rows (commit 2f3d1ef). Thanks to user @gbbio for reporting this issue (#618). This is a regression, introduced in v2.30.1,
fix: options --xlength and --xee were not removing length and ee attributes as expected (commit 69fccdbc). This is a regression, introduced in v2.30.1,
fix: range checking for the option --max_unmasked_pct (commit e1a5611505),
add: individual manpages for the commands --shuffle and --cut,
improve: eliminated 50 clang-tidy warnings,
improve: code coverage of our test-suite.

v2.30.6 released March 27th, 2026
Includes the following changes:

fix: out-of-bound look-ups when printing empty last common ancestor (LCA) results with the --lcaout option. Thanks to user AntoninLCH for reporting this issue,
improve: code refactoring,
improve: code coverage of our test-suite.

v2.31.0 released April 29th, 2026
Includes the following changes:

fix: --fastx_getseqs --label_words out-of-bounds memory issue
fix: --fastx_getsubseq now writes non-matching sequences in full (no trimming)
fix: sintax’s tie-breaking mechanism
fix: limit the scope of decompress options (--gzip_decompress and --bzip2_decompress) to stdin
fix: option --sample now properly truncates at first ‘;’ or blank
fix: option --wordlength now checks for values outside the range 3-15
fix: option --fasta_width now checks for negative values
fix: option --maxseqlength now checks for null and negative values
change: restore progress indicator when stderr is not a terminal. One line per progress step. This changes behavior present since v1.10.0.
change: --fastq_mergepairs now honors --sizeout and appends ;size=1 to merged and not-merged output headers
change: --fastq_filter and --fastx_filter: now reject out-of-domain values for --fastq_maxee, --fastq_maxee_rate, --fastq_truncee, --fastq_maxlen, --fastq_maxns, --fastq_minlen, --fastq_trunclen, --fastq_trunclen_keep, --fastq_truncqual, --fastq_stripleft, and --fastq_stripright
change: option --eetabbedout now prints EE values with more digits (more precise values)
change: --fastx_getseqs --label_words now silently skips empty lines
add: online documentation for all vsearch commands (https://torognes.github.io/vsearch/). Note to users: this is still a draft, the manpages are not yet available from the command line, only online
add: command --orient now supports option --hardmask
add: compilation flag -fno-exceptions
add: static library API for vsearch (contributed by @wasade)
improve: code refactoring
improve: code coverage of our test-suite (8,930 active tests)

vsearch

Versatile open-source tool for microbiome analysis