Welcome to kvasirHGT’s documentation!

Identification of horizontal gene transfer between sequenced microbial genomes

Kvasir is a Norse god associated with fermented beverages. According to Wikipedia,

Extremely wise, Kvasir traveled far and wide, teaching and spreading knowledge. This continued until the dwarfs Fjalar and Galar killed Kvasir and drained him of his blood. The two mixed his blood with honey, resulting in the Mead of Poetry, a mead which imbues the drinker with skaldship and wisdom, and the spread of which eventually resulted in the introduction of poetry to mankind.

Installation

The easiest way to use Kvasir is to install it from pypi:

$ pip install kvasirHGT

This should add the necessary scripts to your PATH.

You will also need install the other python Dependencies (biopython, pandas, pymongo), and mongodb

Usage

Mongod

Kvasir uses a MongoDB database to store genome and HGT information, so you’ll need to install mongodb and PyMongo before doing anything.

MongoDB will build a database and index on your local system, so you should make a folder to put them in. You’ll never have to interact with the folder manually, so I just put it in a hidden folder in my home directory.

$ mkdir ~/.mongo_databases

Before you run any commands with kvasir, you need to launch a local MongoDB server using the mongod command:

$ mongod --dbpath ~/.mongo_databases

Adding Genomes

Kvasir can import genomes in Genbank format. You can pass either a single file, or a folder with many genbank files. Only files with extensions .gb or .gbk will be imported.

Each genome set should have a unique name - this will be the first argument for every script. For the following examples, I’ll be looking at cheese genomes and the set will have the name cheese1

$ kv_import.py cheese1 -i path/to/genbank/files

I haven’t yet built a way to check if you’re trying to import the same genome multiple times, so use with caution! If you’re not sure if you’ve already imported a file, you can query your MongoDB with the species name (see below for info on what is used as the species name). If you’re not sure if you’ve imported Awesomeus speciesus strain A, launch python, and try the following:

import pymongo
db = pymongo.MongoClient()["cheese1"] # substitute "cheese1" for the name of your genome set
db["genes"].find_one({"species":"Awesomeus speciesus strain A"})

If nothing is returned, then you haven’t imported it yet. The text has to be exact though. If you’re not sure, you can also get a list of all the species with any genes in the database:

db["genes"].distinct("species")

A note on Genbank formats

If your files aren’t formatted correctly, you ~~may~~ will have issues. At one point, I tried to write in a way to accommodate all kinds of messed up formats, but it’s just not worth it. So check your genomes with care.

In particular, be sure that each contig in the genome has a unique id (that’s the first field in the LOCUS line). Kvasir uses this to determine which genes are located near each other, if there are duplicates, it may think that genes on separate contigs are next to each other. Each gene should also have a unique locus_tag field.

Most genbank files have a SOURCE identifier, which contains the genus, species and strain that a particular contig comes from. This is what kvasir will use to identify genes from the same species. If your genbank file doesn’t have this, kvasir will use the name of the genbank file as the name of the species - but be careful! A single genbank file can contain contigs from multiple species - if your files are organized this way, make sure that they have a SOURCE field.

At some point, I’ll write a convenience script to check through your genomes to make sure that they’ll work. But until then - use with caution.

BLAST Searching

To make the next part work you need one more thing - the BLAST+ command line tools. This needs to be installed and accessible from your $PATH. In other words, you should be able to run which blastn and get a path spit back at you. If not - do some googling. I can’t help you.

Assuming this is working (you’ve still got mongod running right?), it’s time to start blasting. There are a couple of commands in the blast.py script. First, you’ve gotta make a blast database with all of the the genes in your database using makedb. This command will make 3 files, in the current directory by default, or in the directory you specify with -b

$ kv_blast.py cheese1 makedb -v -b ~/blastdbs
INFO:root:making BLAST database with protein coding sequences from cheese1 at /Users/ksb/blastdbs
INFO:root:BLAST db created at /Users/ksb/blastdbs

Next, blastall will go through each species in your database and blast it against that database. Each hit will then be stored in your MongoDB with some metadata. This can take a long time - use the -v (verbose) flag to get status updates.

$ kv_blast.py cheese1 blastall -v -b ~/blastdbs
INFO:root:blasting Awesomeus speciesus strain A
INFO:root:Blasting all by all
INFO:root:Getting Blast Records
INFO:root:---> 500 blast records retrieved
...

In this case - I’ve put in a check so if you run this a second time, it will skip blasting any species that already have hits found in the database. You can use the -f flag to override this, but I wouldn’t do that, because then you’ll have a bunch of duplicates. If you made a mistake and want to re-run it, I suggest clearing out the blast_results collection of your database. In python:

import pymongo
db = pymongo.MongoClient()["cheese1"] # substitute "cheese1" for the name of your genome set
db["blast_results"].delete_many({"query_species":"Awesomeus speciesus strain A"})

… or if you want to remove all blast results and start over:

db["blast_results"].drop()

Eventually, I will add a tool so that you can just add a single species or a group of new species and blast them separately, but for now, you’ll have to start from scratch each time.

Analysis

If your genome set contains only one example of any given genus, you’re in luck! At this step, you can run the HGT analysis, which will look for all protein coding genes that are >99% identical, and put them into groups (see below). If you have groups of genomes that are in the same genus, however, it’s worth verifying that they are not too closely related.

Dealing with closely related genomes

There are a few ways to approach this, and in kvasir they’re unified under the idea of “species distance” and there are some useful functions bin/kv_distance.py.

If the genbank files you imported have a SOURCE line for the name of your organism, and it’s formatted as <genus> <species> <strain>, you can use the script to calculate the distance based on Average Nucleotide Identity (ANI). This is calculated using a ruby script from enveomics. Measuring ANI using this method is only appropriate for closely related genomes, and as it stands, the script grabs all species in your MongoDB and calls the ANI script on pairs that have the same genus.

$ kv_distance.py cheese1 ani

This script pulls species from the MongoDB and then looks for all pairs of names to see if they’re the same genus. It does that a bit naively, using split()[0].split("_")[0], which will split the species name at spaces or _, and then take the first thing. So if you’re looking at “Awesomeus speciesus strain A”, this command will check if you have anything else with “Awesomeus” as the genus, like “Awesomeus_speciesus_strain_B”, and will calculate the ANI between them and add them to a “species_distance” database in the MongoDB.

But be careful! If instead your species are “strain_1” and “strain_2”, it will think they both have the genus “strain”, whether or not they’re related at all.

Once you’ve done this, you can export a distance matrix using these ANI calculations. Identical species will have an ANI distance = 0, and anything that is not the same genus will have a distance of 1. I highly recommend doing this to make sure the output is what you expect.

$ kv_distance.py cheese1 distance_matrix -o ~/Desktop/dm.csv

If ANI is not appropriate, or you have a different way to measure species distance (eg 16S similarity), you can make your own distance_matrix and import it. You have to make sure that the names of columns and rows are identical to the names from the SOURCE line of your genbank file (of if there’s no SOURCE line, the file names). The table should have the structure:

  Awesomeus speciesus strain A Awesomeus_speciesus_strain_B
Awesomeus speciesus strain A 0 0.1
Awesomeus_speciesus_strain_B 0.1 0

Assuming this is saved in a file ~/Desktop/ssu_dm.csv, you can do:

$ kv_distance.py cheese1 distance_matrix -i ~/Desktop/ssu_dm.csv -t ssu

The -t parameter is the “distance type”, which can be anything you want. The ANI script above uses -t ani by default. If you want to get your ssu distance matrix out at a different time, use:

$ kv_distance.py cheese1 distance_matrix -o ~/Desktop/returned_ssu_dm.csv -t ssu

Be sure your distance is actually a distance, and is between 0:1. So if two species have a 16S gene that is 85% identical, the distance should be 0.15.

In the next section, the default is to not use the species distance parameter or more precisely, the minimum distance (min_dist) parameter is set to 0. If you have closely related species, but don’t set this, it will look like you have a lot more HGT than you actually do. In the paper, for ANI we used min_dist = 0.11.

Getting HGT groups

Now the fun stuff!

$ kv_analysis.py cheese1 groups -o ~/Desktop/groups.csv

This could take some time, depending on the number of species and how much HGT there is.

You can also tweak some parameters like minimum size of a protein coding sequence, distance between genes to be considered the same group, minimum identity to be considered HGT, and minimum species distance with various flags. Try using kv_analysis.py -h to see what options you have.

Eventually, there will be more options for analysis, but that’s all for the moment.

Indices and tables