♫ THWACKcamp
[upbeat music]
♫ Yeah
♫ Yeah
♫ yeah
♫ yeah
♫ yeah
♫ yeah
♫ THWACKcamp
>>Hello, and welcome to our session, Orion in the Cloud:
Hybrid IT User Stories and Best Practices.
I'm Head Geek, Patrick Hubbard.
And joining me today is a good friend
and Product Manager of Network Performance Monitor,
Chris O'Brien.
>>Hey, it's great to be here,
as the network guy talking about the cloud.
Does that even make sense?
I think you just like having me co-present the sessions
you're most excited about.
>>That's actually true,
and it's also true that I wanted to present this session
of THWACKcamp since 2015. And this is the first year
that customers in surveys, and when we talk to you live
at conventions, like Cisco Live, Microsoft Ignite
and even DevOpsDays here in Austin,
you were confirming that a decent percentage of you
have either already moved your Orion servers to the cloud
or you're planning to, or at least thinking about it.
So, in this session, we're going to do a deep dive
into how SolarWinds customers are migrating their systems
to AWS, Azure and Google.
And Chris, you're an expert on the two technologies
that they're going to need--
that's the Orion platform,
and believe it or not, networking.
>>Yeah, it's easy to forget,
in all the conversations about the cloud,
that cloud really is all about networks.
Of course, cloud providers try to abstract the network,
so you don't have to worry about the details,
but while that's great in theory,
in practice, even if you have no traditional
on-premises data center, you're still dealing with
physical access, VPN tunnels, delivery infrastructure,
critical WAN failures-- all of that still applies.
>>That's all true, and you're also going to end up learning,
as a part of this process,
more about VP and performance troubleshooting
than you ever wanted to know.
And then there's the monitoring itself.
Application performance monitoring,
whether it's in the cloud or otherwise,
is really all about interfaces and protocols,
and all kinds of out-of-band communication.
And it's regardless of either of the two
fundamental approaches that you take to APM.
>>It's amazing how, even with cloud-native apps,
monitoring still seems to come late in the development
cycle.
It's a little better with more Dev in DevOps,
because you have to understand what's going on
to react to it.
But I'm still surprised how often we're still monitoring
packages, package applications or SaaS,
that smells like repackaged, third-party apps
with protocols that aren't clean API's.
Assuring all that works when the platform hides access
to network takes some getting used to.
Two fundamental approaches to APM--
tell me you're not going off into cloud-native monitoring.
>>I'm turning into a little bit of a cloud-native guy,
but you could approach it this way.
Some would say that you could either monitor applications
by watching all the elements of the infrastructure;
others suggest that dedicated tracing is the only way
to assure that users are happy.
All we know is, you have to learn both.
>>I know you're trying to be funny,
but you really are turning into Jeremy Clarkson.
>>And that would make you the Stig,
and I'm okay with that.
Okay, so let's get on with it.
We're going to be going in and out of a lot of UI
in this session, whiteboarding topologies
and covering setup and config and best practices.
>>We know it's going to take a lot of information
so, of course, this session will be available for replay.
We'll have links to some of the tools and how-to guides,
all of that stuff you'll want to review when you're doing
your migration planning.
>>Awesome.
So, you want to start with the topology
that we're going to be using,
and then get in the best practices?
>>Yeah, I like topology.
>>All right, so let's take a look at this.
What I've got here is the environment
that I set up for this demo;
so there's maybe 80ish different components
that are configured. And I've got three main pieces, right?
I've got an AWS environment that's running
in the Northeast region in Virginia.
I've got Google Cloud Platform
that's actually running in Australia.
I thought, you know, latency would be something
that would be fun to experiment with--
and Sydney should generate plenty of it for us.
And then I've got Azure running in California.
And I don't know if you've seen this many icons and gifs
all in one place.
>>Yeah, you just keep adding people's logos
until you look competent--that works.
>>Yes, that's kind of it.
But the other thing too is, remember,
documentation is going to be a big part of migrating
anything to the cloud.
It doesn't matter whether it's Orion or anything else.
So one of the great things is that all three
of these providers encourage you
to download these templates.
So, if you just google Amazon PowerPoint--
>>Yeah.
>>You will literally get a--
I think it's 35 slides with use guidelines.
And they're clear about what you can and can't do,
but they want you to use these icons.
Same thing for Google, same thing for Microsoft.
And Microsoft actually throws in all kinds of other icons,
because there's a little bit more kind of lift and shift.
And so if you're running something else,
they want to be able to show that to you.
But it's also handy because many of you,
according to the last survey, you're actually multi-cloud.
I didn't throw IBM Bluemix in here, you know,
just because, you know, that would be too much.
But when you have multiple environments,
it's handy in your diagrams to use the vendor diagram,
the vendor iconography that actually maps
to that environment, so that you can, at a glance, remember
where you are. >>Okay.
>>So, like you know, if you're going like DNS debugging,
it's kind of handy when you kind of go in and out
of the different colors and icon sets
to know immediately what you're looking at.
>>Yeah, that makes sense.
>>So, we've got here-- just sort of in general,
I'm treating AWS the way that most of us do,
and we were sort of early and first with AWS,
and so there's a little bit more kind of lift and shift.
Something that maybe would have been running
in a Vmware container is now running at an EC2 instance--
and then a few services, so Route 53, RDS,
and a little bit of Lambda.
>>And you have Docker?
>>I've got Docker.
And this is a build it yourself Docker Swarm, right?
>>Okay.
>>Because if you work early with Docker,
you would have sort of come to Swarm as opposed to using
like the container-management service,
where it's managed for you.
So, in this case, I set up a Swarm. There's a controller,
and there's an ABI to get it that's actually
pretty primitive, and we'll see what that looks like.
And then I've got my Orion Server
and the Orion SQL Server instance.
We're going to talk about that and budget,
and a couple of other things--
deciding, you know, is it actually a like,
bring your own license version or RDS.
And then, connecting that to the Google Cloud Platform
through their VPN virtual gateway.
And then there's a virtual gateway over here,
corresponding with AWS.
AWS gives you two tools out-of-the-box,
which is kind of nice. They encourage you to set those up.
And then Azure, of course, the same thing.
We have our virtual gateways.
These are all IPsec links,
and then the other one that's kind of weird
is this one between the Azure instance down here,
and this little strongSwan VPN, that's a--
>>And that's like third party VPN?
>>Yeah, it's like a VPN appliance, so we open VPN.
There's a whole host of them, I think, in the IM library
for Amazon.
I think there are about 60 different appliances.
Some of them you build yourself,
and some of them you pay for, but--
>>Why not use this just native, like you had in the others?
>>Because that doesn't seem to be easy.
I don't know why AWS and Google will speak to each other,
Google and Azure will speak to each other,
but Azure and AWS--
oh wait, that's a number one and number two,
they don't exactly assimilate. >>Yeah.
>>And maybe I just misconfigured it,
but that seems to be the way that most people are doing it.
And a lot of them will actually use--
and a number of VPN appliance images that are out there,
suggest to me, especially in documentation,
that especially like, multi-region--
>>Mm-hm.
>>Because remember, each one of these regions
is effectively it's own cloud.
So people will say, "Oh, I'm not multi-cloud,
I'm only AWS."
Oh, you don't have any replication?
"Well, we're in two regions, and three different
availability zones."
It's like, okay, you're multi-cloud.
So all of that communication, and when we get into,
we'll be talking about MTU in just a minute,
that's a big part of that.
And latency's a big part of that,
because applications are not necessarily designed
to be holistically monitored in an environment.
>>Yeah, okay.
>>Okay, so that's what we're going to talk about.
And then the only other thing is over here in Google Cloud.
This is a little bit more modern--
so this is a Cooper Netties cluster
that's actually running on a Google container engine.
So, it's a managed service providing access
to all the containers.
>>Okay, and just to orient myself
for the Orion platform stuff,
we've got the Orion Server sitting in AWS.
>>Yep.
>>Orion Scalability Engine,
like additional polling engine?
>>Mm-hm.
>>In Google Cloud.
>>I use the fancy word for it.
>>Okay.
And the Orion Scalability Engine also in Azure West.
>>That's right. Remote polling
is a big part of that.
>>Yep, okay.
>>And number one recommendation to think about here,
is as you transition, you will probably
want to have remote pollers.
>>Yep.
>>You can certainly use agents to do a lot
of what we're doing here, if your environments are small,
and if you haven't taken a look at Network Automation
Manager or Network Operations Manager,
take a look at those
because there's a little bit more flexibility.
There's no base licensing,
and you can have as many remote pollers as you need.
And you're briefly going to have all
of your existing pollers in your infrastructure,
and then kind of press out as we talk about topology here.
So, let's talk about the main approach here, right.
So, why do you think, why--
the customer's that you've spoken to
who have migrated, whether it's SolarWinds products,
but especially the products that we monitor--
why are they moving to the cloud?
>>Cost is often a big piece of it. Optimizing their costs.
There's an idea, at least, that it will be lower costs
in the cloud.
Performance is a question as well, or is a common driver.
People do, when there's a big variation of the workload,
that sort of spike workload, it's often easier to do
in the cloud or cheaper--probably both.
>>Right.
>>So, this is some of the common ones.
>>Whose idea is it typically?
Is it IT's idea,
or are they being 'voluntold?'
>>It's usually not, is it?
It's usually voluntold. Like CIOs, those sorts of people,
they know where the world is going,
and we need to get there as well.
>>Right, and the budget, it won't be double the budget
for a period of overlap during that transition, it's just--
>>Well, it's cheaper.
>>It's cheaper, but it's not instantaneously cheaper.
And then, what are the typical sort of access network issues
that you encounter?
Do people tend to use VPNs?
Do they tend to use direct connects?
Do they build something where they can take advantage
of BGP, for example, and actually have what is essentially
a distributed access network that pulls all those together
into a SQL network? Or is it a lot of, kind of, hub and spoke?
>>We're seeing a lot of VPN, just as you have here.
A lot of VPN tunnels. And then connectivity beyond that
handled by the enterprise,
usually just one, two VPN tunnels for redundancy
to the big hubs they have in the cloud.
>>Right. That was one of the things
that was kind of interesting talking to customers--
is that a number of you are using a direct connect,
where this is sort of the, let's say,
second wave of deployment-- where it's almost, you know,
kind of rogue IT or someone set it up,
and VPN was something that we as network administrators
get asked to then implement,
and then when they finally get big enough
that they say, "This is ridiculous.
My cost for operating the VPN, the bandwidth,
the rest of it, is much higher
than it would be to have direct connect."
But then you use direct connect,
and now you've got a router and probably a switch
sitting in a cage somewhere in colo.
So although that connection from that colo-provider vendor
is all magic and managed for you,
now you've still got more distributed hardware.
>>Yeah.
>>And that's one of those where they're like,
"Oh, I have to have redundant connections to the cage,
so that I can fail over to my primary link?"
>>Yeah, in my cloud deployment.
>>In my cloud deployment, that's right.
So, the basic steps for this are about what they look like
for any Orion redeployment.
I've got a dock here, and I'm just going to use this.
So if you haven't seen this before, some of you,
especially if you've been using the Orion platform
for a few years, you've probably moved it.
>>Yeah.
>>Certainly to get to, you know,
kind of as part of your hardware refresh.
But when you're moving it to a different network,
there's some extra considerations.
So, out on the Customer Success Center,
there's a guide for this, as there is everywhere else.
And we'll put the links to all of these guides
in the description.
So one of them is, okay,
so we're going to migrate to a new IP and host name.
>>Yeah, this is actually a good chance to plug.
There's a whole bunch of different migration guides,
like this one's specific to you,
"I want to migrate to a new IP address, a new hostname."
That's a very specific scenario.
We have several different of those migration guides
that make it really easy, and step- by-step.
>>Yeah, what are you trying to do,
and then read just a page or two about
what you're trying to do, instead of a very long document.
>>Right.
>>So this is the guide,
and I will net out what the steps are here first.
One, prepare your new hardware, and in this case,
it's new virtual hardware,
or it's somebody else's hardware, right?
So, this is going to be--take a look at the sizing guides
for a new install, which are also in here,
and it will actually give you specifics for installation
in Azure--recommendations for Azure and AWS.
They are surprisingly similar to the same requirements
that you would have anywhere else,
and you can actually map those to the sizes.
>>Yeah, it's almost like Azure haven't invented
their own CPU and RAM systems, so it's very compatible.
>>It is. And the other nice thing, too,
is you can just down the instance
and change the instance type and check the performance.
So you can actually do a little bit of performance testing.
So the first thing, set up the new environment,
get it installed, get all the bits where they need to go.
Then you're going to go into your
existing Orion platform install,
and you're going to release all of the product licenses.
>>Yeah.
>>If you haven't see that,
the way that you get there is you go into 'Settings,'
'All Settings,''License,' which is
down here at the bottom.
Okay, so 'License Manager,'
select the module that you want to remove,
and then click--
>>Not NPM.
>>Yeah.
And then click, 'Deactivate.'
Oh, I forgot something.
Before you do that, make sure you come over here,
and grab this license key.
Just throw this into a spreadsheet,
and make sure you've got two copies of it before you start
deactivating this.
You can reactivate it here.
The main thing is, just make sure you keep your key,
because it's a handy way to do it.
You can pull the key off of the Customer Portal,
if you lose it.
It's not a one time deal, but it's just eas--
>>Doesn't it feel better to copy/paste sometimes?
>>It does. Or just grab the whole page,
and that works too.
Okay, so you're going to deactivate all of them,
right? >>Mm-hm.
>>Then, you're going to go back to
your new environment, assuming that you're going to move
the database as a separate guide for migrating the database--
and I'm going to talk about database next,
because there are some special considerations with cloud.
But you'll relocate the database,
if that's what you're going to do.
There is one change that you're actually going to want to do--
of course--back up your database.
There's a script for it right here.
Log into the database and actually execute these scripts.
And these are basically taking some
of those really tight bindings of subnet and IP and hostname
out of the engines tables and a couple of the other tables,
so that it's going to have a clean start.
Now, in terms of the existing pollers, your remote pollers,
because chances are you're going to be connecting
to the remote pollers, which used to be adjacent
in your on-premises network.
Now, you're going to be connecting briefly back,
and this is also assuming that you're drawing down
the size of your internal network
and at least your data center.
Your delivery network's still the same,
but the data center's getting smaller,
which is the whole reason that you're now
making this transition.
>>You've got to move something to the cloud.
>>Yeah, you've passed that tipping point now
where you're like, well, more than half of my stuff
is all now operating in the cloud.
I need my monitoring platform to be adjacent,
and that's actually...
It's interesting talking to you all at conferences,
and at SWUGs especially, because I thought the tipping point
would be sort of in the middle of the natural curve, right?
It would be about 50%.
And some of them are actually saying,
once they realize the momentum for change,
once they see that the business
has actually bought in on cloud,
and they can see that it's not going to stop--
a lot of them are actually getting ahead of it.
Because moving the monitoring first,
like we talk about monitoring as a discipline--
but like, moving that first, having the dashboard first,
means that they can more immediately get accurate
performance measurements of relocated applications
instead of, well now, it's behaving differently,
my monitored app is behaving differently that it did.
Is it because I've moved it? Or is it because now
I've introduced latency or complexity
into that monitoring path?
>>Yeah, that makes sense. And one of the key pieces
of information people use to troubleshoot
is the data from your monitoring server;
so having that local makes sense.
>>Yep. So they tend to be sort of at that 30% total transfer;
that seems to be like a third in once they feel the momentum.
But anyway--so you execute a couple of scripts
that are listed out here.
Those will take care of the moving
the primary engine database records,
and then you're essentially going to rerun
the Configuration Wizard.
It will reconnect to the database,
and then you're going to relicense.
So take the licenses that you cut
and pasted into your spreadsheet,
drop them back into the license manager,
and you should be pretty much good to go.
Again, it's documented; it's pretty straightforward.
So, that's the first step.
And just consider this as any move.
>>Mm-hm.
>>The other piece of it is when you, for example,
if you're doing NetFlow, there's some other considerations,
right? Like you might want to leave your NetFlow collector
where it is in your on- premises network,
if that's where most of your network traffic is.
>>Yeah, so it isn't encrypted, for one reason.
>>Right, well that's true.
And it's also going to keep your network pretty busy.
And because NetFlow is still UDP,
adding additional links and the VP into that route,
may not be what you want to do.
In the long term it may be,
as in where it gets more simple.
I mean, it may essentially turn into a set of access points
and access networks and that's about it.
In which case, you might want to go ahead and redirect it.
But do think about that in terms of
where you'll relocate that, and the access.
One thing I do want to mention here--
and I'm going to come back to this architecture diagram
for one second--
is does latency have any effect on MTU?
And I asked the father of NetFlow this question.
>>MTU?
I think there's interaction between them, right?
Because MTU has an impact on whether fragmentation is used,
and fragmentation yields more packets
to which latency applies.
>>Mm-hm.
Well, that's exactly it. And what I found was
regular, you know,
kind of default Microsoft Server, 1,500 packet links
break down a lot on many of these IB set tunnels.
And I thought at first it was me.
And then I started thinking about it,
and I realized that we were getting packet fragmentation,
and that the latency then, the retry, takes a lot longer.
So if you're a millisecond or less,
or something in that range of retry,
is you take the window size down to figure out
what's going to reliably transmit.
That's one thing.
But applications that are already pretty inefficient
with retries, like HDP and some other ones,
those protocols-- bad.
>>Yeah, and one of the challenges,
any time you're developing applications that work
across the WANs, you have to be very careful about
number of roundtrips.
Because roundtrips, where roundtrip number two
waits on the result from roundtrip number one,
ends up as a multiplier for your latency number.
>>That's right.
So, I did a lot of experimenting. And it's interesting,
this connection between AWS and Google.
Now you would think that this being in Australia,
that this link would at least be more durable
because it's a double link-- it's got some redundancy,
and it's between two pretty large providers.
That one--I could get maybe 1,200 bytes together
in a coherent way--
and with nearly 170 milliseconds of latency,
those retries are taking longer.
>>1,200 bytes inside of one packet.
>>Yeah, but this one, through like
kind of homegrown VPN appliance,
that one I can do 1,490ish.
>>Yeah, so it may well be that the transit between you,
between AWS and Google Cloud versus between AWS and Azure,
just has different minimum MTU, right?
Because MTU is all about what is the point along the way
that has the lowest MTU--
because that's your constraint.
>>Right. Well I thought that too.
And I did a lot of reading, and it does turn out
that certain, whether the strongSwan appliance
or Google's virtual gateway,
they do have a certain minimum MTU.
But I started tracking with a little script
and started watching, and they were wobbling.
>>Hmm.
>>And they wobbled within a small zone.
And I'm still trying to understand that, but--
>>Could be Multipath.
>>It might be Multipath,
and we're going to get into that in a second--
and using NetPath to figure that out.
But in this case, obviously, this is three clouds.
And normally, you would have one or two cloud providers,
and then maybe remote offices, or especially
if you're a large environment with multiple data centers,
you're probably consolidating that--
and that cloud is one of those things
that's been sold to you.
Because it's like, oh, we're going to solve this distributed
data-processing environment that you have.
You'll just have regions wherever you need them
in this wonderful holistic environment.
Well, you're just trading latency
from one place to another;
so that doesn't go away.
But where this gets really interesting is DNS.
Is that in your environment now--
especially if you're coming from
all on-prem or mostly on-prem,
then you probably have a pretty coherent naming service.
And one of the things that was--I don't know why
it didn't occur to me before,
but each one of these have a proprietary mechanism
for managing their internal DNS.
>>Oh really?
>>Yeah. And so naming,
especially for third-party packaged applications,
whether it's a hostname or an IP,
can get kind of complicated.
For most, like if it's an Oracle database or something else,
they tend to stick to maybe an IP address and a host name--
something that they got from somewhere.
So worst case maybe on an app like that,
you modify the host file, if you have to,
and you do it once, and you move on.
But the point of cloud, like especially up here
at the Docker Swarm--
that's an elastically provisioned resource, right?
So, it gets bigger and smaller depending on the workload.
Well, I can't go in and modify the host configs,
so that it knows where to send its agent logs.
Like it's-- actually, these are all agent-based pollers,
and they're talking to the primary Orion server, right?
Well, if they don't know how to route that,
they're not going to be able to talk to that server,
and the fall-back address may or may not work.
So what I did was I used route 53.
Route 53, as it turns out, can actually do
not only external geo-routing, but it can also provide
internal VPC routing.
And then you can set up a bind instance,
and then I have both of my Google and Azure
actually exposed to that,
so that provides internal routing
for all of those addresses.
And the reason that that's particularly important
when you migrate your Orion platform,
is Orion is a device for discovering multiple hostnames.
That's what it does.
It discovers lots and lots of hostnames.
It discovers lots and lots of IP addresses,
and depending on the protocol,
if you're talking about, not so much NetFlow,
but IP address management,
especially some of the polling for services on Orion,
on Windows services, Linux script monitors,
the rest of it-- narrowing down the number of possible
changes in IPs that you have to, and host mappings
that you have to set up,
will make all of them operate better.
I mean, the Windows protocol that's used between
the remote pollers and Orion itself,
I was counting three or four different hostnames.
So DNS is a great way to make sure
in any cloud environment, but especially monitoring--
regardless of the monitoring tools or the vendor
that makes them--naming services is one of those things
that you need to plan for and don't overlook.
Okay, so one thing that's different here is--
Have you ever heard of the term 'cloud inversion?'
>>No.
>>So the idea of being, that point where you
take services that were-- not the kind of really cool
application deconstruction in the cloud-native services
sorts of cloud, but actually just lift and shift.
Like moving workloads and moving package applications.
>>Move that old crusty application into the brand new.
>>Yeah. Where they're still crusty and they still smell,
but at least you can't see them or smell them.
It's almost a way of--it's not just turning it upside down,
it's not just a relocation to a different data center,
because a lot of things become opaque.
And at the same time, you have a challenge of monitoring
a lot of new things. And so that's one where you're going
to start to experiment with new tools.
So what I want to do here is come back over here
to our Orion instance.
And I'm going to go back to the application--
or to the application dashboard, because we haven't been here yet,
and we'll talk about how to use SAM.
SAM is your best friend when monitoring cloud
for a lot of reasons.
And I know that I get on my DevOps soapbox occasionally,
and I did start as a developer.
But I have got to encourage you guys
to start playing with script monitors if you haven't--
because there is not anything that you cannot monitor
with SAM, if you're willing to do a little bit of scripting.
What I'm going to show you here is actually based out of Python
instead of Bash, but that's just
because I like Python better.
So when we talked about strongSwan,
that kind of junky thing that I built,
it's kind of nice to know if that tunnel is up, right?
>>Mm-hmm.
>>And although I have APIs that I can use
for Google and AWS's virtual gateways,
I don't have that here, so it's effectively blind to me.
So the way that I would normally monitor this,
is I'd come over here to PuTTY,
I'd pull up the appliance itself, and then I'll do sudo...
All right. So I've got my configuration information
for the tunnels that are established here.
I just call this one Azure,
and it's also telling me the protocol.
And then down here it's giving me my bytes in
'packets in' and 'packets out,'
and a little bit of other information,
like how that tunnel is configured,
and what the inside and outside gateways are, right?
>>Yep.
>>That is not a good way to monitor that.
>>Yeah.
>>But what is a great way to monitor that
is what we see here. So here it is monitored inside of SAM.
So, here's the data that we were looking at before, right?
I can see bytes in, bytes out, route--
so that's like your interesting traffic--
Status. So here's your local and public IPs--
for the peer IPs, is that what that is?
>>Yep.
>>And then the tunnel name.
And if there are multiple tunnel names,
it would list those for me.
and then over here, I have got a nice chart,
so I can actually see what my in and out traffic looks like.
That's just much more convenient.
And I can alert and report and do all kinds
of amazing things on that.
And the way that that was built, as you might guess,
is with a custom monitoring template.
We'll take a look at what that looks like.
So, I just created a template
so that I can apply it to multiple machines,
because chances are I'd have lots of them to back up.
This one uses a script monitor.
And a script monitor, as you well know, is a script
and a little bit of other information.
This one, the connections are based on SSH private keys
instead of using name and password.
And then I have a script,
and the script basically executes that command
that we saw before, and it returns it in the standard
that you specify data to come back.
So if you look here, there's message bytes in,
and a statistics bytes in,
so one of them is essentially description and then the data--
description, data, description, data.
>>That makes sense.
>>Yep, and then this one right here,
when you create these-- so like in this case,
that byte counter increases over time--
is basically just a counter.
>>Yeah.
>>There's a check box when you create it,
which is 'Count Statistics as Difference.' Set that to 'true.'
>>Yep, just like we do with interfaces.
>>Just like with interfaces.
Well, what's really cool is, with cloud,
you're going to be monitoring a lot of different technologies
that maybe are new to you, right? Maybe LAMP stack--
a lot of other things that you're going to want
to go ahead and include.
So one of those, for example, might be
Docker.
Now, instead of just Docker, Docker Swarm in our example,
right? So, I sort of did that thing--
set up that first Docker environment that was monitored
by hand, and then converted it to Swarm and added
a bunch of nodes and made it elastically provisionable.
Well, I'm using that same approach here,
where I'm actually using Docker command line API
and custom monitor, and it's coming back
with number of containers, the number of nodes,
the state that they're in--some state
other than running or shutdown--
and you can see that it's changing over time,
and I can make changes.
Let's actually do that.
We'll come up here to a little portal manager that I've got,
and we'll go to our service and just scale this thing up,
but we'll come back to this and we'll watch the numbers
on the chart actually go up. And I'll scale it up to 10.
And again, the great thing there is that this environment
is going to take care of managing that for me--
so I actually care about that.
But this is an example of one where I had
to kind of build it myself.
>>Right.
>>Because it's an old Docker Swarm that I set up.
Smarter would be if I wanted to do something like Cooper Netties,
or use the container management service from AWS
to take care of that for me.
So with Cooper Netties, the way that that looks
is there's a pretty clean API for monitoring it.
So here is a Cooper Netties monitor for the Cooper Netties
running in Sydney in that Google Cloud instance.
And so, you don't like my name here?
>>K8S? What does that mean?
>>Kates.
>>Kates?
>>All the cool kids are saying it, 'Kates.'
>>Okay.
>>Okay, well SolarWinds is then 'Sates.'
But yeah. [laughs]
Okay. So this one, I'm using a couple different charts.
So the data that's coming back is, again,
all of those components that we saw. But it also--
this API's a little richer and it gives me memory,
so I can actually walk through each one of the pods,
each one of the containers, and actually roll all that up.
>>So, this is an aggregate.
You have 12 containers using 2.6 cores across all of them
and 744 megabytes of memory.
>>Megabytes of memory. That's right.
Number of nodes. And then which ones are in
a 'not ready' state. And what 'not ready' means
is, that when you make changes, it can take a little while
for the reporting to roll back up to the control node.
So that way, I don't panic if I make a change,
and it takes five minutes before I start seeing data.
I know, okay. It's just that it's not reporting,
so that number might temporarily bump up.
And then I just broke this into a couple
of different charts, right? So, this one up here
is the same thing that we saw for Docker--
where I had containers and nodes and status--
but I've also got my memory as a chart,
and I've got my equivalent cores as a chart, too.
So, this comes into cost and a couple of other things,
but again, using the same approach for the VPN
and anything that I can get to in a command line,
I can pull into SAM.
One thing that is a little bit different
is you might ask, "Well, you're monitoring memory in Cooper Netties,
why aren't you doing that in Docker?"
Well, I could through the API, but remember that
you've got built in monitoring for AWS.
And so I'm just using that, right? So that makes sure--
>>Less at a time.
This all works out-of-the-box.
>>Less at a time, works out-of-the-box,
and it's also giving me things like volume information
and events that are related to it
that I otherwise wouldn't get. And it also,
remember it lets me kind of kill bad soldiers, right?
Because I can come into my management portal right here,
and then go to your 'Cloud' tab,
and then if you can't reboot it through the command line
because you don't feel like it,
you can actually stop and start-- in fact,
terminate instances right here, too.
>>So action for a bad actor.
>>For a bad actor.
So yeah. So that combination of custom Linux monitors,
and the built-in for AWS makes that really easy.
Now, you might want to start to think about
some other new tools, right?
Like I use Papertrail a lot.
I don't know whether you've worked with it at all.
>>Not much.
>>So Papertrail log aggregation service--
think of it as a giant syslog in the sky.
>>Yep.
>>So all of those workloads, those stress workloads
that I am spitting out in Cooper Netties and in Docker--
there's something about 100,000 messages an hour
into Papertrail here, right? And so I don't know,
a lot of times, when I migrate applications to the cloud,
and especially just migrating Orion itself,
how it's going to behave. And I might see novel issues.
And capturing those logs-- well, I don't have to log in
through the opaque interfaces of the VPC
if I have all the logs where I can get at them.
It's really handy.
>>You can aggregate from multiple clouds
versus the machines themselves--
which is all of that stuff in one spot.
>>That's right. But it also lets me alert.
Because one of the things that you need to worry about
is if I now move my primary Orion poller to the cloud,
and I lose my VPN connection to my on-prem network,
how do I know it's still alive?
>>Yeah.
>>I kind of want to know that, right?
So, two things. One, you're going to want to set up a dial up,
out-of-band VPN connection just through a gateway.
And they charge those by the hour.
Those are pretty handy. And that way, at least
you can use your mobile app to get to it.
But the other thing you can do is use the events of Orion
itself to raise alerts if Orion goes down.
And you can use Papertrail for that.
And you can actually use the free tier if you want.
So, here are the events that are coming from
that Orion poller, right?
And I was trying to think,
like, where could I get a heartbeat?
And it occurred to me that the business-layer engine,
the main service, every five minutes or so
does some backups.
>>Mm-hm.
>>And it sends a message to tell me that
it's doing a backup. And then things break,
this service stops. So I'm sending this one log
using just the regular log forwarder to Papertrail.
Then I set up an alert on it--
which is Orion running a heartbeat.
It runs every 10 minutes, and then it alerts when--
>>No new events match in that 10 minutes.
And then go ahead and send events.
>>Send in events.
Now, if I'm going to send events, I can't send an email.
I probably want to send an email too,
but I can't rely on that because if my connections down,
I'm not going to get anything.
So one of the things that you'll want to do, too--
and check out our lab episode of the integration
of Orion alerts with Slack,
because that will show you how actually to use Rest
and third-party services to send messages out.
One of the things I really like is Pushover.
That's an app. It's free to use.
And you can actually define apps.
And it's integrated with IOS and Android OS,
so you'll actually get an alert to your screen,
and push those events out that way.
So again, out-of-band monitoring--the monitor
becomes really important when you push it
out of your building.
>>Sure.
>>Another thing you may want to look at--
I gave the example of how to monitor Linux.
And with the Linux agent, it's really pretty easy to monitor
just about any type of Linux you can think of with SAM.
But if you only have a little bit of Linux,
what you might want to take a look at, too,
is Pingdom Server Monitor,
because it is a hosted method to be able to monitor those.
And so here's my strongSwan server, right?
This is that appliance again.
And what's cool about this is it's giving me
most of the metrics, and I can also go and add plugins
for things like Docker and EC2 monitor,
and a pretty long list of different plugins that I can assign
just out-of-the-box. But it's all push-based.
So you install the agent on a command line, an SSH,
or push it as a part of maybe your Chef
or Puppet build and deploy process,
and then it all pushes the metrics up into the dashboard.
So again, like Papertrail, it's pushed to the cloud
and it's aggregating all of those in one place.
And you don't manage that,
so that's one thing to take a look at.
But the other thing, too, is you will probably start
to manage data from systems that--
especially if you're using elastically provisioned resources--
where you may be spotting and killing
thousands of containers a day.
You are not going to do what I did there before,
and actually set up monitors.
Even if they're using monitor discovery or API
or, in this case, I'm actually integrating with Chef
to add those new nodes. So that as they--
the Docker instances, right?
So that it adds monitoring for it.
So one of the things that you might want to look at, too,
is check out Librato for sending those logs to.
This one--these are essentially the same metrics
that we were looking at before in a smaller dashboard.
But when you look at something like monitoring Zookeeper
for example, where you've got huge numbers
of reporting elements, and you're trying to aggregate
those altogether, it's a really handy way
to build dashboards so that we'll take care
of dynamically configured resources--
so you essentially have named paths.
And it also does multiple, tag-based analysis as well--
multidimensional tag analysis.
That's pretty handy.
The only other thing--
and I know this is going to sound like programmer stuff,
but it's not--is that distributed tracing is a part--
is that second way of doing APM monitoring.
It is something that I think you should all
take a look at and learn.
Whether it's--there are a number of different products
that do it.
I'm going to show you what that looks like in TraceView here.
But the idea that you need to be able to see
application performance that's coming outside of the cloud,
it's coming from the user's experience
and trace that back through all the layers
and actually do aggregated analysis.
And especially where applications break,
because there may be data
that's a part of the procedure calls,
like the data itself can cause it to break--
is really important.
And it will become increasingly important
over the next few years.
So it's something that you should take a little bit of time
to learn. And what I mean by that is,
like this is a-- basically a hotel booking service.
So it has a number of different components
that are all working together.
Well, a couple of them are microservices. And like
pricing and availability, and the credit card piece--
and the booking service that's actually making
the reservation is taking most of the time,
because it's integrating with the most parts.
So you would typically start with a layer of breakdown.
And if you're using WPM, or using Pingdom, for example,
you're used to sort of seeing that waterfall
of how long the transaction takes.
Well, when you start to really look at all the transactions,
you start to think more like,
'What do these transaction periods look like?'
Like, what are the patterns that begin to appear?
>>So not just the averages, but outliers.
How many outliers? How do those contribute
and make the average?
And what is the impact to your application
and your application usage because of that?
>>Right. Because when all of the resource are a variable,
you start to get into some really interesting
root-cause analysis.
But it's also being able to trace individual requests--
gets to be important too.
Because when you find those outliers that are way outside,
you need to be able to look at that.
And so we're not looking here at just sort of monitoring
from the bottom up in the infrastructure;
this is actually a transaction.
Now, we can go and take a look and see.
So here, this is a Rails-based framework, right?
So we've got Tomcat, Spring, MongoDB on the bottom end of it.
And so these are the transaction calls of a single request
that was made to the website.
And we forget how many of them there are.
And back here, we're just hammering that MongoDB, right?
So I might want to talk to the application engineers
about this. Especially if this is homegrown.
But I can look to see where I'm spending most of my time--
so especially where I have interconnected elements
that are maybe--that are each one of those deployed
inside a container, inside of my cloud provider.
Maybe it's the size, the resources that are dedicated to it.
Maybe they're in different zones,
and they need to be able to get closer together.
But also, sometimes it's handy to be able to go and look
at what the actual query was, right?
I can actually look at the value.
So if I have one that I'm constantly getting an error,
I can go back and say, you know what--
>>So it's like infrastructure toward app
versus app toward infrastructure.
>>That's exactly it. That's exactly it.
And you have to be able to monitor both
when you have a hybrid environment.
So that's the other tool to take a look at.
Okay, so there's a couple things to remember here.
First, I didn't have time to go into RDS versus
bring your own license for the SQL Server.
I will put that up as a THWACK post,
and then link it to the description
so that you guys can check it out.
There's cost considerations.
There's performance considerations.
I could not build an environment that ran as fast
as the same hardware in RDS--
that ran as fast in my own as it did in RDS,
but it was more expensive.
So if you have a large environment,
and you're already running a lot of RDS,
you won't really notice.
If you're just kind of starting out,
you might want that on a separate machine.
If you're using RDS,
there's a couple of special considerations.
You either have to restore the database into RDS,
or you have to pre-prepare the RDS with a special script
in order to get Orion to install on it.
Because it doesn't have, kind of, all the, kind of, SA level
management store procedures that are part of it.
>>That makes sense.
>>And I'll include a script for that too.
But the main thing to remember here--
and I think the thing that's most interesting
in talking to customer's about it--
is the reason that they are moving their entire platform--
not just SAM, which is really important,
but also NPM-- is that they still
have a delivery network, right?
That's never going to go away.
They are always going to have to get applications
that are running off-prem to be made available on-prem,
or there's no point to any of it.
>>You always have the users. What are they going to connect to?
WiFi or wired, you still have to connect.
>>WiFi or wired. And they've got VoIP.
And you've got NetFlow considerations.
You've got firewalls.
Like if you're looking at the new ASA monitoring
that's built into SAM for example,
into the NPM, for example,
that's one of the first things you're going to need to do
as a part of your VPN. Are my VPN tunnels working
the way that I expect?
So, it is very handy to be able to have both of those
in one place. And that was the thing
that I hadn't really expected.
I thought that customers would actually start to migrate
to some of our cloud-based tools,
especially as they start to do more and more DevOps.
And what I'm finding, instead, is that they have been using
the Orion platform for a really long time,
and it runs really well in the cloud,
and it provides capabilities that they're familiar with,
and that take care of that sort of last mile of the
cloud-to-ground part.
So, I think just be open and spend some time
in the Customer Success Center.
Learn how to script a little bit.
The ones that I showed you here,
were actually written in Python,
because I just prefer an actual language to Bash.
I know you prefer Bash, but whatever.
You've got your own thing going on there.
But this is something that you should experiment with.
Set up a lab.
That's the great thing about like the environment
that all of this was built out of was,
a lot of this is actually free-tier.
And I got started with it and set it up,
and I can build it and tear it down,
and I don't have to worry about it.
And learn them before you go and actually do this.
>>Yeah, that's how they get you addicted--
low price for addiction.
>>Yeah, they get you on the comeback.
All right. So you think we got it all covered?
>>Yeah, I think so.
I'm slightly concerned we were moving too fast
in the session. But you and I know Orion really well
at this point.
You've been running in AWS and Azure awhile, I guess.
>>Four years and two years, yeah.
>>And if I was watching this as a customer,
the takeaway might be more anxiety rather than enthusiasm.
>>That's possible.
I hope that's not the case.
And if they experiment,
they'll probably find that that's not--
but that's also why there's replay and lots of links
in the description here.
And of course, we're going to be on live chat.
So we would love to hear your questions
and talk to us about your experiences.
And there's tons of conversations about this
out on THWACK community,
about how and why they're relocating
their monitoring systems to the cloud.
And really, it's not all that hard.
I mean, it parallels most of the rest of the sort
of data center, workload migration.
It has some amazing benefits, and I think it's really cool.
>>Yeah, I bet right about now,
you wish you'd actually grabbed some
of the customer interviews at Cisco Live.
>>Yeah, because basically I'd say,
"Hey, check out some of these customer interviews
"from Cisco Live. Roll that tape."
And then you guys would basically hear what Chris and I did,
not just at Cisco Live, but at events for the last year--
which is they basically said,
"Oh, I'm kind of nervous. I feel like I'm being forced."
And then actually, it wasn't that hard.
>>You try it; it's not that hard.
>>And in some cases, it runs better than it even did on-prem.
>>Yeah, well hopefully you've enjoyed our session today.
Please keep your questions and comments coming.
We're in the THWACK forums every day.
And of course, you can always ping us directly.
Patrick, are you sad you're going to have to tear down
all of your pretty infrastructure here?
>>I am, but it was a lot to set up just for one session
and for training. But there's a certain budget
that goes along with this that I don't want to exceed.
But really, I built most of this using scripts
and using the command line tools
from all three of the cloud providers.
So I could probably respawn maybe 90% of it--
with the exception of the VPN's--in about an hour.
>>Yeah, of course.
Developers?
>>I hope to resemble that remark.
Well, thank you all for joining our session today.
And we'll see y'all in THWACK.
[upbeat music] ♫ Yeah
♫ Yeah
♫ Yeah
♫ THWACKcamp!
No comments:
Post a Comment