This is information about the actual presentation and some information will be made available
at the links that you see at the bottom.
I will give an introduction to the pathogen detection isolates browser.
Some of you may not be familiar with it. I will give you background of the project. It
started five or six years ago. It came about because the food safety agencies in the U.S.,
the FDA and the CDC were thinking about switching from pulsed field gel electrophoreses to whole
genome sequencing. And the FDA started that by forming what they call the genome tracker
network and you see on the left-hand side the number of labs that are part of the genome
tracker network and they wanted to focus on salmonella but the CDC in the summer of 2013
said we should focus on Listeria as the pilot project. All Listeria collected in the U.S.
for both clinical, food and environmental samples would be sequenced and submitted in
real time. We had a meeting where we agreed to contribute to that project.
The reason to focus on Listeria is it has a very low incidence in the population so
it is a tractable problem and all the isolates could be sequenced in real time and it has
a high morbidity and mortality rate. Sequencing these would have a major benefit to the health
of Americans.
This is a slide from one of our collaborators at FDA showing how the network functions now.
Samples from clinical/human, food, animal and environmental are taken by these agencies
as part of the network. That includes the FDA's GenomeTrakr, CDC's PulseNet, State clinical
health labs, and the USDA Food Safety Inspection Service who take samples as well. They submit
the raw genomic sequence data to NCBI and these are typically from Illumina instruments,
either 2x250 or 2x150, and they supply minimum metadata . All the data is publicly available.
Of course they have highly protected metadata that they store locally that we do not see.
So what facility this isolate was taken from and information about the patient.
We built an analysis pipeline that wants to answer two things at this point in time. Are
these isolates clonally related? Is there a point source for a food outbreak for example?
And the second thing I will touch briefly on, what is the set of genes that encode antibiotic
resistance in these isolates?
I won't go into a lot of details of the analysis pipeline. We are making changes to it and
we have publications coming out at some point describing this in detail, but the basic idea
is the data coming from the surveillance network goes into SRA we assemble it and do some quality
filtering. We also pull in assemblies from GenBank for the same organism, let's say Salmonella.
And we cluster them together using single linkage clustering. Right now we use a 50
SNP max breakpoint. We will be switching to multi-locus whole genome sequence typing in
April. The idea is we want to form clonal clusters of closely related isolates and we
are not intending to do the full phylogenetic tree of all salmonella this way. So for salmonella
we have several thousand clusters size 2 through size several thousand. Within each cluster
we do a phylogenetic reconstruction and we make that available and I will show you examples
in a few minutes.
For the antimicrobial resistance we first put together a reference database of acquired
antimicrobial resistance genes and cwe have created software to identify those genes and
we are working on a manuscript to describe it called AMRFinder and we attached antibiograms
to BioSample submissions. So the antibiograms are tabular formats of the antibody susceptibility
tests, either MICs or disc diffusion and we attach those to the BioSample database if
the submitter is willing to supply the data.
We integrate that all into the isolates browser. So the Pathogen Detection Isolates Browser
is using new technologies a NCBI including things like solr but what that means is because
we are trying to do this quickly and get it out to our collaborators as soon as possible,
it's not fully integrated into other resources.
So some of you will be familiar with the drop-down database menu option on the left and you won't
see Pathogen there or in the global database list. It is kind of hidden under Health, which
I'm sure you are all familiar with.
This is a screenshot and then after this I will give you a demo. This is the Pathogen
Detection Isolates Browser in Beta which means it's a work in progress. We are continually
evolving the capabilities and it's not something we would have done in the past where we would
have spent two years building something and release it. We released in 2016 and have been
continually adding features to it. That means there is no help documentation yet but there
is a fact sheet that you see in the upper right menu under learn more. There is a PDF
that gives you basics of how to use the resources. It's similar to what I cover in this webinar.
We will work on help documentation in the future. Part of the reason for that is that
the search capabilities are things we are playing with. We want to make sure the search
is as streamlined as possible for people to use and not document it in an incomplete form
now. What that does mean is that the search syntax is not the same that you would typically
see in Entrez databases like Nucleotide or Genome.
So at this point I will give you a demo.
It's available at the base URL/ pathogens. This is the webinar we are giving now. This
gives you a brief description of what is going on. A couple of example searches and then
I will cover this in a minute. And then you have some basic information including the
fact sheet I told you about and I will open this for a second. It covers a couple basic
examples of how to do searches. Information on antimicrobial resistance and some of the
reference database we put together and how to submit data. If you are interested in submitting
data to us you can contact us and follow these links. For exploration of data we have some
options.
We have all the data available on FTP so if you want to do batch downloads you go there
but I won't cover FTP today. And we have Find Isolates that takes you directly to the browser
and we also break it down by the organism. Here we have the top four foodborne pathogens
in the U.S., we have salmonella, E. coli and Shigella, Listeria, and a Campylobacter. And
we see the total number of isolates. We have over hundred thousand salmonella in the system.
And we have the new isolates. If you recall I said we make single linkage clusters by
50 SNP's. We attempt to do that within every 24 hours. Sometimes it doesn't work. You see
that more fully here on the full list. These are all the organisms we are currently clustering.
Many are not foodborne pathogens so I won't cover those in much detail. Some big examples
you won't find here right now or Staphylococcus aureus. We plan to add those later on this
year we make the switch to whole genome MLST clustering.
Let's look at the top row, salmonella. This is the version that was released on March
16. The latest isolate that was added that was included in this released was from March
14. We have a two day delay and trying to get that down . Basically it means from the
previous release which is version 1147 we had 78 new isolates added to the system. The
breakdown of that is 51 clinical isolates and 27 environmental isolates. All these links
go into the pathogen browser. As well as this link here.
The browser is basically unlike most of the databases at NCBI, it presents information
without having to do a search, in sort of a tabular format.
For the isolate browser every row in the table as an assembled genome. Either assembled through
our system of assembly and annotation or one we collected from GenBank. You have a search
box which I will go into detail. We have some default organism groups and those match the
ones in the table I showed you. We have columns that include metadata supplied by the submitter.
This particular isolate was collected in New York. Some things that we calculate and I
will cover a couple of those, but there are additional columns and you can choose which
columns you want to show with this tool here.
There are a couple of critical columns I want to cover in more detail and that is the minimum
diff. We make clusters of 50 SNPs and we also categorize isolates by two types, environmental
or clinical.
So this column is the minimum SNP distance from this isolate to one of the opposite type.
The first row is not a SNP cluster and not related to anything in our system so it could
be 51 SNPs away or 500. We don't know at this point in time. We are trying to identify clinical
isolates that may be a point source of a foodborne outbreak. The second row is an isolate also
from New York and it is 12 SNPs away from something of the opposite type and it's an
environmental type and it is 12 SNPs away from a clinical type.
If we scroll down here in row 16 we see another isolate and another Listeria is 33 SNPs away
from something of the opposite type. It's clinical and that means it's 33 SNPs away
from something that's a food or environmental source. So if you sort in this column you
can identify things that are clonally related. So I will give you an example. I will do a
search for the new isolates.
And if I sort that and focus on Listeria, there are 22 new isolates for Listeria. Here
is an isolate that is 12 SNPs away and that grows. These SNP distances might be sufficient
for someone in public-health labs to look at this and determine they do not need to
do further investigation.
If I switch to something like salmonella, now we get down to two SNPs and this is an
isolate from Virginia and two SNPs away from a clinical isolate. So not only do we do these
SNP distances but we also provide the SNP phylogentic tree and I will open that up.
This is what we call the SNP tree viewer. We have three panels in this view. We have
a navigation panel and a show you how that works in a few minutes. We have a table with
all the metadata similar to the table on the front page and then we have the tree viewdown
here. This is like Google maps and you can pick it up an drag it and zoom in or out.
You see this isolate is in a SNP tree of 978 salmonella isolates.
This is the section for that. This obviously is a pretty large SNP cluster you can see
parts of the tree are very closely related and others are more distant. The interesting
thing is you can make selections both on the table and the tree. From the search I had
identified this isolate as a new isolate so it came in on the 14th. I can actually highlight
another isolate.
What happens when I have more than one isolate selection is I get the SNP distance measurement
up on the navigation panel. You see there are only two isolate so the minimum, maximum
and average are all the same. 4 SNPs separate these two isolates from each other. From August
last year to March this year. The other interesting thing and I will go back to the search view
is we intersect the searches with a number of isolates.
Here we see when I search for the new isolates, five isolates are in the tree that I was just
looking at. Now this navigation panel becomes useful because not only does it tell you SNP
distances and the breakdown, you can see these are environmental isolates from Maryland,
Virginia and New York and a clinical isolate. When you click on this, it jumps immediately
to where that isolate is on the tree.
We built this part because these large trees were increasingly much more difficult to drag
around like in Google maps. You see how long it takes to get to the next isolate. Whereas
this, you just zoom around landmarks in a map.
We also have a filter and this is the only filtering the metadata that's here so I could
say I only want the ones from Maryland and those come to the top, and you can clear that
filter.
We also have a search box. I can do a search for chicken . Now it will highlight everything
in the tree that is from chicken. I can add that to the selection set. Now you can see
almost everything in this tree comes from chicken. That's a reminder to properly cook
the chicken you are eating. Please.
We also get this breakdown by year. This navigation panel is something we are evolving now and
it will probably change in a few months or sometime in the summer. This search box allows
you to both highlight items in the tree and also to add to the selection set on the left-hand
panel. It will also do subtractions. I see things from Maryland and I think this will
work. I will do a subtraction. And the Maryland isolate's disappear and I think there was
only one so he isolates drop from 539 to 538.
You can clear this and make a selection directly on the tree itself for multiple isolates by
selecting all leaves. So here I have highlighted a small set of isolates, looks like they are
all from the Boston or the U.S. from two different years and I can make a subtree.
If I click on this button now I just have the isolate that I selected by this action
node . And you can collapse it as well and expand it if you want. This allows you to
do an export. We have a warning if you try to export large trees but exporting small
trees is easy and you can dump it in a PNG file. That's not a very good viewer but you
get the idea. You can export in PDF and as a newick tree
if you want to put it in your own tree software do additional work with that. You can share
this view. If you hit this button, you now have a box that you click on and you can copy
and paste that into another viewer or send it by email if you find something interesting
to send to colleagues.
You can just use the share button. When you are in the subtree view it tells you you're
in the subtree view up here and you can go back to the full tree here. What it does is
causes a collapsing view. This is a feature we were testing and it collapses nodes not
selected and when you highlight those nodes you can see the breakdown. You see there are
224 isolates in the subtree mix of various sorts of clinical and environmental.
This is something still being worked on and not fully worked out yet . If you want to
go back to the full tree view just hit hold Treeview when you get back to the original
view we came in on.
This isolate, you can hit the information button and you get an additional information
on metadata but that similar to what you see when you hover over the tree. Some people
might find this annoying you can turn this off. Than that does not happen. You can do
things like control spacing of the tree. So you can separate it for better visibility.
You can collapse that if you want. There are a few other options. You can expand and contract
the tree branches. This is the SNP distances to make it more
spread out if you want.
I should also point out the selections allow you to control what is selected in the table
view. You can see we selected seven isolates. Out of the many we have, and you can download
this. There is a download button here that dumps a TSV file that is just the rows and
isolates suggested. You can choose what columns going the table similar to the one on the
main page and the selection controls which isolates are selected for that tabular download.
You can make selections on the tree, in the table to control the download for whatever
selection of isolates you are interested in.
The only other thing I want to touch on is the antimicrobial resistance. Let's look at
something that is highly antibiotic resistant, Klebsiella. And we make two columns available.
One is the resistance phenotypes as supplied by the submitter, and the genotypes. And you
can highlight that in the filter tab. There are number of filters and two of them are
the phenotypes and genotypes, so I will select the phenotypes, there are 516 of the total
Klebsiella that have supplier-submitted phenotypic data. Now you see this column has information
and you can expand that and you can see the breakdown. Basically in this panel we put
resistance calls, the SIR calls, from whatever interpretation criteria we used. You see that
colistin doesn't have interpretation criteria so it's in the category of other. We also
put the genotype information.
First I want to cover the AST briefly and this is what is stored in the BioSample database.
Now you have this tabular format and it gives you the actual MICs, the interpretations and
you see that many are done by the CLSI standard, actual measurements and some information is
optional and some not supplied. You see things like colistin is not defined because it's
missing and there is no CLSI standard at this point in time.
By adding it to this you can do some interesting searches. I won't cover those in much detail.
I will go back to the main page and see that we have a search for isolates encoding a mobile
colistin resistance gene and a KPC beta-lactamase.
If I click on this I get six isolates that have both MCR and a block KPC allele. They
have Assembly links so they have all been submitted to GenBank and not assembled by
our pipeline. You see they are from Brazil, Portugal and Italy . You see the list of genes
here, we have KPC- three and mcr 1.1, KPC-2 and mcr 1, etc.
You can do searches and there are examples in the fact sheet I pointed out. We are working
on, those of you who know things about aminoglycoside modifying enzymes , you see that the encoding
of that genetic information is troublesome to many computer search programs so we are
working to make the searches for things like aphc3 prime prime 1b easier so you could just
plunk it in to the search box and do the search. It doesn't happen now. So it's a new feature
we will add in the summer.
You may ask I'm not a public health lab so why is this useful for me? Not only could
you get antibiotic resistance or sets of isolates with antibiotic resistance, but we are planning
to expand the sort of genes we make available in the system. So that includes virulence
genes, metal-resistance at biocide resistance we are expanding this tool to incorporate
those other genes of interest to allow you to subset and select the full data set.
So right now we have over 107,000 salmonella and we hope this interface will allow you
to more effectively subset the data rather than doing let's say a blast search with a
gene of interest. You can imagine searching across all 100,007 would be time-consuming.
So we hope these interfaces for large-scale data more effective at NCBI in general.
I think I will stop there and mention these are the people that worked on the project.
You see this email address highlighted in yellow.
pd-help@ncbi.nlm.nih.gov. If you have questions after the seminar, and
this address is linked on our Pathogen page, so just send us an email and we will answer
your questions. I think at this point in time I will take any questions, Peter.
[Peter} I'm afraid I lost power at the beginning of the webinar so I can not participate in
the question part on the phone.
[Bill] Sure.for those of you connecting remotely Washington DC is in a massive snowstorm so
many are losing power. I will go through a couple of questions. What W G MSLT scheme
is used? It was developed in-house at NCBI. We will be making them available at some point.
Basically we developed them for the four foodborn pathogens from TB and I think for C. difficile
and we have been trying to coordinate with CDC on some schemes and we have an upcoming
meeting after the ASM general meeting in Atlanta in June where we are supposed to talk about
that as well.
Another question, urinary tract infection of foodborne disease?
The organisms that causes UTIs like E. coli is often found as foodborne disease. So we
do see a mix of both of those in our system. It's not strictly related to just foodborne.
S. typhi is a hn obligate human pathogen not typically associated with food but we also
see those, so we basically pull in all the isolates under a particular species from our
surveillance network plus k.
Is there any correlation between SNP clustering and classic STCC?
I think the answer is they will highly correlate so the question is are there correlations
between SNP clusters and classic sequence types and clonal complexes? So simply based
on the seven or whatever gene sequence types that we are classically defined before, you
would see a high correlation but you can't guarantee that because of course a small change
in an allele would give you a different allele and possibly sequence type, and different
clonal complex even though it might be part of the same SNP cluster.
So it is not guaranteed and it is something we thought about adding as a feature and doing
the classical sequence typing and adding that as information so you can see ST258 isolates
for example.
Another question, is the number of SNP differences between two isolates the absolute number without
filters, so the SNPs passed filtering or the number of compatible SNPs from the compat
program? The SNP distances now are from the Compat program. I didn't go into details on
this but one of our colleagues at NCBI improved a method that is 30 or 40 years old called
maximum compatibility for the phylogenetic reconstruction. It's now published and that
software is available. It is very useful for highly clonal isolates which is what our system
was developed for. It's not very useful for highly divergent isolates so the maximum compatibility
system looks for columns that are compatible with the phylogenetic reconstructions which
means sometimes it throws away some columns. Why would you want to do that? We found sometimes
when we look at GenBank genomes they have incorrect SNPs with respect to the tree and
we suspect many times it's because of assembly problems. So the system helps filter out some
of the bad data.
Next question how to decide if it's a new isolate? We do the SNP clustering every 24
hours if new data is submitted. So that New is a recency check on whether something has
been submitted since the last time we did the calculation.
Can we use the pipeline to analyze our own gene sets? I didn't touch on this but we built
the system for public health with the idea from colleagues that they would submit data
to us and make it publicly available. That is something we are pushing that people who
want to integrate their isolates into the system are publishing them as part of papers
or research or surveillance networks to make the data publicly available. Right now the
pipeline is not available for download. We will make certain parts of the pipeline available.
I didn't touch on this but we are making a new assembler available. I think the paper
will be submitted by the end of the month. We will put links on the main page at some
point showcasing some features that will be made available.
Is there publication related to this? As I said we will describe that at some point in
the future.
How can I add this as a project to my undergraduate students?
I'm not sure if you are saying how you can get your undergraduate students to use this
project? I'd be happy to touch base with you after this.
Is this connected to Patric? Patric is a NIAID funded system as part of the bioinformatics
resource Center and it is not directly connected to Patric but we certainly coordinate with
them on things like antibiotic resistance.
Would it be possible to add a data column with classic mlst sequence typing? Yes, that
something we are thinking of doing in the future.
One are the future plans of NCBI for GenBank of such disease causing pathogens? Besides
providing a system like this that allows you to easily interrogate for interesting features
it's something we need to talk about because we have 100,000+ salmonella and that will
quickly be 1 million salmonella in a few years so we need to think how to deliver effectively
to researchers when we have such large volumes of data. We cannot expect someone to download
and do this analysis themselves. We are interested in hearing from people on use cases for what
they would use this data for or things their research is interested in investigating across
such large volumes.
Are there better ways of submitting data on a weekly basis other than SRA wizard? SRA
has multiple ways to submit. We have the web-based wizard for submission portal But there is
a completely automated XML-based submission format that they've developed and they are
all completely automated. So I urge you to contact SRA about that. I'm not an expert
on those. If you are interested in submitting data to us, please contact them 1st.
How can I see salmonella serotype on the Treeview? We make the serotype, serovar, available as
a field but that is based on what the submitter sent to us. As you can imagine, well I will
give you an example . Let's get rid of the search and switch to salmonella.
Here is the serovar here and you see it's Saintpaul, Barielly, etc. and if you're talking
about adding the label of serovar onto the tree, that is something not yet built into
the system. I want to point out that we find out very often that this serotype is incorrectly
made and we are thinking of adding additional in silico calculation of the serotype based
on tools available. That will be later on this year.
That's a follow-up on the SRA. If you have problems with SRA submissions , Peter, I think
they can contact the helpdesk and you can help guide them through that stuff. I don't
work for SRA.
Yes I'm here. If you are having trouble with submissions, right to the info address which
is info at ncbi dot nlm dot nih dot gov. We will get that to someone who can help you
with that. And Bill we are out of time so we need to wrap this up.
If there are any other questions we will endeavor to answer the remaining questions in writing
and I will send that to everyone when the recording is available on YouTube and that
is written up.
I think there is just one more question.
What's the difference in the relationship in the following resources, the national sequence
database of resistancd pathogens.
Basically they are all -- the first two are integrated into this system. So saying it's
just a national database of antibiotic resistance pathogens. We say that we are making a database
of pathogens and reporting on antibiotic resistance genotypes reported within those pathogens.
The third one is the Resistance gene database and those are genes that are reference set
of genes and alleles we use to make the calls of the genotypes. I didn't have time to go
into details, I think we can have a separate webinar on anabiotic resistance. I suggest
anyone who has questions to send us an email. And the last question was submission and again
you can always write to the submit-help or the info address for help on submissions.
So I'd like to thank everyone for their time . I will stop the recording now and you can
always send us emails and we will follow up with you individually.
Thank you, Bill.
No comments:
Post a Comment