Good afternoon everyone and welcome to the webinar about API keys for getting faster
access to NCBI data.
My name is Eric Sayers and I'll be taking you through a brief overview regarding keys.
These materials the PowerPoint and the PDF version of the PowerPoint will be available
on our FTP site.
There is a link there at the bottom of the slide that will take you there.
https://go.usa.gov/xnj9m.
Peter Cooper who is my colleague assisting with the webinar will be sending that out
to everyone.
We will go ahead and get started.
We would like to talk to you today about something new with the E-utilities API.
The API keys.
We are doing this to allow a variety of things to help with the API in general.
One of the things that we hope that it will do for you is to have a more stable and faster
access to the API in general.
So a quick review of what we will talk about regarding what's happening and what we're
doing.
I will quickly review what the E-Utilities are just to make sure were on the same page.
These are the APIs that will be affected by this.
What is a key and what is it for.
Who needs one?
Do you need to get one or not?
If you do get one, what benefits do you get?
If you need a key and want those benefits, how you get one.
And then once you have it, how do you use it and when do you need to have it.
so I'm going to try and answer those questions and go through that today.
What we're doing is introducing a key for the E-Utilities and after May 1, 2018, the
day to remember, after May 1 of next year we will begin rate limiting every IP to three
requests per second.
So if you are not using the API up to that limit, this may not apply to you.
If you are using the API you would make any series of requests three or more per second,
this does apply to you.
After May 1, any IP that exceeds that rate limit will start receiving error messages.
Getting an API key will grant a higher rate limit.
So 10 requests per second by default.
Even higher rates are possible and are available by request.
You can contact us and we can negotiate with you.
One of the big reasons we are doing this is to really ensure the availability and sustainability
of the APIs over time so we can continue to provide a service that is responsive 24/seven
as much as we can.
This also protects these APIs from various kinds of abuse, intentional or otherwise and
various kinds of attacks.
Those kinds of attacks primarily hurt everyone else.
They prevent other users from having the access that we would expect them to be able to have.
So having these keys protects you and NCBI as well.
It should allow more stable and sustainable access to the APIs over time.
So E-Utilities are a set of server side CGI's.
You're probably familiar with them if you are attending the webinar.
This does include esearch, esummary, efetch, elink, epost.
There's a few others, egquery that some people use, espell that a few people use for spelling
corrections.
But these are the utilities we are talking about.
These API keys are not going to affect blast or any other APIs at the present time.
Were only rolling this out initially for these E-utilities.
When we talk about three requests per second, we mean any combination of these APIs.
Is not 3 for search and 3 for summary, it is 3 for everything coming from a single IP.
So what is a key?
It's a unique string about 30 characters or so long.
It will identify you or your process to our servers.
And that API key must be included in every request that is made.
So regardless of the kinds of requests you need to have the key in every URL or any other
way your posting the request to us.
That NCBI API key is attached to a NCBI account . So for those of you who are familiar and
will go over this in a minute, for those of you familiar with My NCBI, these are NCBI
accounts . The same type of account structures.
If you already have a login and NCBI whether you're a submitter or use My NCBI or SciENcv
or my bibliography or any of those tools, you all ready have a NCBI account and you
can use that NCBI account to generate a key and have that key attached to your account.
If you don't have a NCBI account, it's very easy to get one.
That's how it works.
So the key is a string and the string is passed to NCBI through your requests and it is attached
to your NCBI account.
So who needs one?
If you're going to be posting more than three requests per second, you need one.
If you're not going to be doing that, you don't need one.
There is no penalty.
But after May 1, if for whatever reason the process you're doing does go over the three
requests per second limit, you will get hit with an error.
We'd love you to get a key anyway, because it protects you and it allows you to have
better access, but you don't need one to access the eutilities in a very casual level.
Again the primary benefit for you, from the key is that you have faster requests and more
access.
You get 10 requests per second if you have the key.
If you want more or need more, then let us know and we can work with you.
And again, this is per IP.
If there are developers out there thinking about how they want to engineer this, this
is per IP.
Not for your institution or software package.
The IP that NCBI will observe . So that's what you need to be thinking about.
You get a key by going to your NCBI account.
At the top, the top header of any NCBI page there is a sign in to NCBI on the right . That
is something you can go to and create an account.
If you do not have a NCBI account, go to the top of the page and click that and you can
create an account.
If you all ready have an account, login.
Once you are logged in on your settings page, there will be a region called API key management.
It will appear to you as you see on the screen.
Create an API key.
You click that button and you will get a key.
It is as simple as that.
You just copy the key, copy the string and start putting it into your requests.
So here is what it looks like.
Let's say I have a simple einfo request up at the top and I want to get statistics about
the protein database.
so I do, einfo.fcgi?db=protein&api_key=ABCD1234.
And here is the new thing, API underscore key is the parameter.
So you need to include that parameter and assign the value of that parameter to be your
actual key that you got from the NCBI account page.
So let's say my key is ABCD1234.
If I was going to do another search, say in PubMed, I would put the key there also.
For those of you who are familiar with E-direct, and EDirect is a Unix-based or Linux-based
interface to the Eutilities, and if you don't know about EDirect, it's beyond the scope
of this webinar, we have other webinars about E-direct and in chapter 6 of our documentation
for the Eutilities, it is all about e- direct.
It's really great if you are using this from the Unix environment.
What you can do is simply set an environmental variable.
For those of you who are in the LinuxWorld and know what I'm talking about, just setenv
NCBI_API_kEY ABCD1234.
E-direct will include the key everywhere.
It is very straightforward.
Just set the environmental variable to your key and then you are done as long as you're
working in that same environment.
Let's say at one point you want to have another key.
May be your key got stolen.
Maybe you forgot it.
You want another key.
You can replace your key at any time.
Be aware that you can only have one key per NCBI account.
If you replace the key, it's immediately dead.
So if you have a software package , you want to be careful, it will die immediately.
You will get a new key and you can replace it as many times as you want.
That is fine.
Keep in mind that you only have one key per NCBI account . Once you replace the key, it
is gone and it will not function.
When do you need to do this?
You can do it today.
It's available now.
Go to your NCBI page and create an account or login to your NCBI account , create a key
and start using it.
It won't do anything for you until May 1.
It's just a dummy variable right now.
We would encourage anyone out there to start doing that.
Get the key and start putting it into your code so that you're ready.
On May one is when you will start getting error messages if you do not have a key and
you're going about the three requests per second limit.
If you have questions about software, we would be happy to talk to you.
Our simple recommendation is to allow your users to input their own key as a setting.
Their IP is what NCBI will see and that is what will be the basis of them getting error
messages if they exceed the rate.
So if you know your software is likely to produce rates higher than three per second,
each person using that software on their own IP address will need to have a key.
The software will need a way of getting that key and supplying it to us through the requests.
So that's basically yet.
So we've got our blog which we have announced this on.
We have been announcing it on our social media site.
As we go forward, if there are any changes or updates, if we learn more about this and
have some interesting tricks and tips to provide we will certainly do that.
We would like to do that on those social media platforms.
We may have some other additional webinars on this as time goes on.
If we discover things the community would like to know more about, we could very well
be doing more webinars.
If you would like updates, please subscribe to the URL on the screen, utilities-announce@ncbi.nlm.nih.gov
That is a great way to keep abreast of these changes as they happen.
And we would love to have input from you.
You can write to info@ncbi.nlm.nih.gov.
So that's basically it.
I will open this up for questions.
This is Peter Cooper.
There's a lot of questions so far.
They circle around the topic of IP addresses.
I'm a little confused about this myself, Eric.
It seems the API key is assigned by the NCBI account.
But, you are seeing were checking the IP address for the rate.
A lot of people have asked, our company presents itself as a single IP but multiple individuals
behind that.
Presumably they would have their own NCBI account . How is that going to work for the
rates?
That is a good question.
It is something that we may continue to come back to you about.
Our understanding right now is that you are going to get an API key that is assigned to
a particular NCBI account . And the rate is set based on that.
What I will do, because this question has come up before about the relationship between
the IP and the NCBI account , I will send out an answer to that question in more detail
once we are done with the webinar.
I think that's the best thing for me to do with that.
Another question that's an interesting one, when we are talking about requests, my understanding
of that is that one request is basically one URL submitted to the server.
The question is, suppose I submit 10 requests per second but it takes a while for the responses
to come back.
There's going to be some kind of delay.
10 URLs per second is what we are throttling on.
Its not based on how long you're waiting for the results.
So, it is based on the number of requests that our servers see.
So let's say that you have something like efetch requests that are taking one second
apiece to come back.
If you were sending those sequentially, we're not going to block because we're not seeing
those at three per second.
When we received the URL is what we're talking about.
If you are posting those requests non-sequentially, you are not waiting on the return before your
post the next request, then you could get over that three requests limit pretty easily.
It's just the amount of frequency we see when the request comes to our servers.
So that's basically what you want to think about.
So if you are hitting the servers more than three requests for second, it doesn't matter
how long they take to return.
But if you're waiting, we are not going to see those requests until the prior one is
complete.
Two questions that are variations on this idea.
Can I submit requests for multiple IP addresses with the same key?
That is related to another question, he uses his laptop at work and I would have one IP
address but if I go home, it's a different IP address from the same machine and it would
be the same NCBI account.
Will that work?
That is my understanding.
That's why I want to get back about that earlier question, so I make sure I'm telling everybody
the right thing.
The key itself, there is only one key attached to the one NCBI account.
So you can have any number of requests, but the way I understand things, we will be looking
at that particular key.
So if you have more than 10 requests per second but that key, there will be a problem, unless
we negotiate a higher rate with you.
Another point, the key itself is associated with your NCBI account, so you don't need
any additional information to identify yourself.
A couple of people have asked this question.
If you do exceed the rate, what will that error message look like?
Do we have an example of that?
We do have an example of that that's on the documentation.
Let me get out of the PowerPoint and pull that up for you.
So the link in the social media are taking you to a page that looks like this.
We have an announcement here with a link to this chapter about the keys.
There is a paragraph here with an example right there, if that's visible to people.
The error will essentially be something that simply says, API rate limit exceeded, and
with the count based on the number of requests.
That's about it.
You're going to get a return, you are going to get a formatted return, but it will simply
contain that error text.
Another question, we're presuming that people are submitting URLs which I think uses GET.
Is there any difference if someone uses HTTP post?
A request is a request.
So if you send in get request or a post request it counts as one request.
So again, you can certainly submit a lot more information to us if you use post, that is
fine.
That all counts as one request.
So one HTTP request.
Another complicated question, it's an important one.
It has to do with distributed software.
If an organization licenses software where multiple users will be calling E-utilities,
should the key fundamentally be tied to the organization or to each user at the organization.
What our thoughts are right now, we want to start engaging with those of you who are in
that position, we would advise that the keys be set up for each individual customer.
So that the customers would have an account with a key and they would supply that key
to the software in some kind of setting.
The software would be able for that individual customer to include the key in the request.
That allows that individual account not to exceed the rates that might be a summation
of all the other activity going on in the software package.
That is currently what we are recommending.
The basic idea is that we want the individual customer to only be penalized if they exceed
certain rates.
And so really, it will be very case dependent depending on the type of software that you
are creating and trying to support as to how many requests will be produced, and what is
that request stream going to look like for a particular user.
If you are making software and you can confidently say a individual user is never going to or
at least almost never have more than three requests per second sent, maybe an app that
does searching.
Someone has to type in a query, hit go and it does a E-search API call and retrieves
the information, I don't want to say ever because all kinds of things could happen,
but it would be unusual for a person to do more than three of those per second.
So if its that case, maybe your customers do not need the keys.
If the software is designed to do a lot of things behind the scenes and call multiple
utilities, then maybe so.
If your software is something where you the data provider, the software provider, access
NCBI data independently or asynchronously from the customers request, so every once
in a while or every day you come to NCBI and update your own internal databases with NCBI
data, and you're using the E-utilities to get those updates, the customer is never involved
in the interaction directly.
The customer will not need a key.
You will.
You are doing the direct requests.
You can work with us to get the rates that you need to do the data transfers that you
want.
So it depends on how the user is interacting with the software and how the software is
creating E-utility requests.
I do not know if that helps, but there are a number of different scenarios.
We would like to hear from people if you haven't thought about a scenario or you have questions
on other scenarios.
A couple of things that are interesting questions.
Previously email and tool parameters were to be specified.
Am I right in thinking we don't need those anymore?
What I would say about that is, I think we would still encourage people to include them.
It is particularly helpful for software providers when you are making a tool because the name
of that tool will then be something that we can see.
That has been very helpful in the past when we work with vendors that are having particular
issues or we need to answer questions about abusive activity that somehow seems to be
involved with their product and then they can find out what the problem is and better
diagnose that.
If we don't have the value of the tool, and we only have the key, the key is not going
to tell us what the tool is.
The key will just tell us who the user was.
And that may not tell us very much information.
I would strongly recommend that software vendors continue to provide the tool address.
The email address is also helpful particularly in the case where you are distributing software
and the key is not going to be your key, but the customers key.
We might need to contact you to find out what some problem is to help you diagnose something.
We are not going to have your contact information if the email field is not complete.
So while our policy would no longer insist that the tool and emails are the reason we
register users the key will be doing that function of registration and that kind of
thing.
It is still very valuable to us and I think to you that we have the tool and email values
so we can continue to understand those activity patterns and contact you when needed.
Obviously this is an interesting topic and we have a lot of questions.
Another question that I don't know the answer to.
It has to do with when a key is part of the URL it is not encrypted.
Do we have concerns that these are being transmitted as plaintext?
I don't know if HTTPS has any impact on that.
Would you like to talk about that?
So all of the -- good question and something we can follow up with.
All of our traffic is now over HTTPS.
From that point of view there is that level of security in the transactions.
Know what I would not be able -- so I would think, once you have actually submitted the
requests, the likelihood of someone sniffing the key during the request itself is hopefully
prevented by HTTPS.
But, there are other ways in which the keys will be funneled into those requests and maybe
those processes are not as secure.
So how does the user input the key into the software?
Where is that stored?
Is the key stored in a config file or something not encrypted?
A number of questions.
I think the only think I can confidently say now is that during the transmission of the
request, the key should be encrypted by HTTPS.
But before that and how an external piece of software handles the key, that would be
a separate question.
You would need to think about that as an independent problem.
That is my first pass answer at that.
I can confirm a bit about the HTTPS and reply back with our question and answers.
This will be the last question because we're out of time.
If you have additional questions, send them directly to me peter.cooper@nih.gov.
Or to Eric, I think.
Let me just put it up here.
And this is the last question.
Is it acceptable for us to intentionally use error messages that we have exceeded the rate
as our throttle?
I'mm thinking of a situation where it's difficult for me to coordinate with others at my IP
to collectively set our rate.
We might be able to address this in a more constructive way when we think about the difference
between IPs versus the individual accounts.
If the question is that you're unsure about what your rates are, -- because I can certainly
imagine there are lots of scenarios where it's a complex process of creating requests.
It may not be straightforward to estimate exactly how many requests are being produced
given certain user activity.
We can assist you with that.
If you have concerns about that, I would contact me or info@ncbi.nlm.nih.gov and we can work
through that with you.
We can have you do some tests and we can look and see what we see and if that rate is something
that corresponds to what you expect.
You can certainly look at the error and confirm, if our rate detection is working, you have
exceeded the rate.
That is one approach.
I would like to be able to offer the ability to work with us a little bit more proactively
to investigate what your rates are ahead of time.
Then we would be able to properly set a better rate for you before May.
Of course whatever the process that you are invoking will not work until your rate drops
below that level.
So I don't know if that's a good way to limit your excess.
So I think we need to conclude now.
Thank you, everyone for coming.
If there are any questions that we did not answer, we will put this on the FTP site and
I will send it out as a link to everybody.
We will get more information about the differences between the IPS versus the NCBI account associated
with that particular user.
I will confirm, like Peter was saying, my sense is that the rate is tied to the key
itself.
Again, we are going to have a rate that is attached to each API key.
So it may not be as attached to the IP address and that's what I want to confirm for everyone.
I want to make sure the understanding is very clear.
My understaning, and I am almost certain that it's tied to the key and not the IP.
For example if you had two different machines said you were requesting things with the same
IP key from those two machines, the sum of that request activity is what will trigger.
Again, I'm pretty sure that is true, but I want to confirm that and we will update as
we need to to clarify that.
Thank you for the questions.
Thanks, Eric, we are going to sign off now.
Thanks for coming.
No comments:
Post a Comment