A healthy data center begins with a wellness self-assessment

by CXOtoday Staff    Jan 18, 2011

MarkSettleCIOBMCSoftwareIn many ways, managing your health is like maintaining a data center. You need to identify the source of ‘pain’ in the data center, and then take steps to alleviate that pain.

What does it mean for a data center to be healthy? You should be able to meet service expectations and commitments. You should be able to maintain control by defining and managing risk. The data center should help your company get the most out of every person, asset, project, and activity. And your data center should enable your IT organization to operate at the speed of the business.

Is your data center in good shape? Mark Settle, CIO, BMC Software, in this article outlines some questions to help you assess the health of your data center so that you can take the appropriate actions to achieving wellness.

Take Inventory
Gathering the vital statistics of your data center begins with finding the answer to this fundamental question:
• Do I have an accurate asset inventory of my hardware?

Having information about your current equipment gives you leverage with your vendors.
Yet, in many companies, finding the current book value of the existing blades could tie up operations staff for days — manually checking what’s out there, getting into the fixed asset records, researching other peoples’ spreadsheets, and so on.

Rather than waiting until you need the information, ask yourself these questions today:
• What do I own?
• How much of it do I have?
• Where is it located?
• What is the financial value of the technology I own?
• What’s still on my balance sheet that I’m depreciating?
• What am I leasing?
• How many generations of technology have I fallen behind?

Address Capacity Management
Virtualization in the data center and the move to cloud computing has brought renewed interest in capacity management. In the past, as many IT shops moved from the mainframe to a distributed environment, they tended to buy heterogeneous equipment that was running at 20 percent or less capacity. Servers were so inexpensive that they tended to purchase more and more of them, even though the servers might be used only 15 percent of the time. However, the pendulum is swinging back again, and we’re starting to realize that running below capacity is a frivolous waste of money. The servers not only take up space, but they also increase your energy costs.
Ask these questions to help determine the cost of your investment in servers; like how much heat are my servers generating?, how much time is my operations staff spending on maintenance and wiring?

Control Labor Costs
After you’ve purchased hardware, integrated it with the network, and paid your costs for running the network, you also have to understand the operational cost of labor. In fact, labor costs can represent one of the most significant aspects of your operational budget. Labor efficiency is a key measure of data center efficiency.

Maintain Security
The physical security of data centers is generally well managed. It’s more likely that security could be jeopardized by an electronic invasion. As hackers develop new technologies and look for different kinds of targets, it’s important to develop ways to protect your infrastructure.

To verify that you have the right security safeguards in place, you should know:
• Do I have a third party performing unannounced penetration testing? (The CIO and the VP infrastructure would likely know about these tests.)
• Do they test my internal defense systems so that problems can get caught and/or reported?
• Am I able to test various applications randomly at different points in time?
• Is IT able to intercept viruses?
• Is my company protected against a hacker trying to get into the systems?

Provide Effective Patch Management
Microsoft puts out patches frequently. It’s a good idea to trickle out the patches so that they are not as network consumptive as releasing them all at once. If you are not patching effectively, there will be a lot of calls to the service desk relating to the consequences of patch inconsistencies.

Ask yourself these questions about your internal ticket flow:
• Am I doing more reactive work or proactive work?
• Which incidents have disrupted user productivity?
• Which incidents are simply user requests for increased IT capabilities and are not related to patch management?

Track the Right Metrics
According to a research firm, “Through 2015, 80 percent of mission critical outages will be caused by people and process issues, and more than 50 percent of those outages are caused by change/configuration/release integration and handoff issues.” That’s why it’s so important to measure the availability of your tier-one systems, including the outage times. Begin by tracking these metrics, and have reporting information available as well.

Answer the following questions:
• Am I tracking my tier-one systems?
• Am I getting better at tracking?
• What are my resolution times around sub-one problems?

A sub-one problem would be an outage of a component or service that wouldn’t necessarily derail the entire service. A classic example is a high-availability cluster of servers where you lose just one server, yet the application stays up and running. The resulting problem is more related to a degradation of that service than of failure. The users might not see a performance issue, but if you track this information, you would know that you’re running a riskier configuration because you lost one of the four servers.

Address the following questions:
• What is the outage time around my tier-one applications?
• How am I monitoring that outage?
• How robust is the monitoring?
• Am I using external, synthetic, transaction-generator type tools?
• Are my monitoring tools hiding data from other devices?

This kind of monitoring is similar to wearing a wrist device that measures your blood pressure, pulse rate, and temperature and routinely sends that information off to your health care provider. The program receiving this data looks at static and dynamic thresholds. It recognizes that a rise in your heart rate from soccer practice every afternoon at four o’clock or from time on the treadmill every morning is to be expected. A rise in your heart rate at an unexpected time, however, would signal an alarm. With that in mind, be sure to answer the following question:

• Do your tools function as an external interrogator?

Having an external interrogator is similar to having your doctor do more than merely collect your vital statistics. When the doctor asks you probing questions, your condition becomes clearer. Similarly, when your tools function as an external interrogator, you learn much more about the health of your servers.

Look at the Ultimate Metric
Another principal, and often overlooked, measurement is to keep a tally of all sub-one outages that affect significant numbers of people (five to ten people or more). These should be automatically reported as part of your monitoring strategy. For example, you shouldn’t have to wait for a call at the service desk to say the Amsterdam switch to the network has gone down.

The problem with many monitoring tools is that they generate a tremendous number of false signals. You don’t know which signal is valid and which one is not.

Automation Is the Path to Data Center Efficiency
Automation can keep your data center running smoothly, improve efficiency, and help increase your data center’s overall health. People can only scale so much, and automation helps your organization reach a level of maturity in processes that you can’t easily achieve on our own.

Repeatable processes bring an immense amount of value, ultimately affecting how well you satisfy the needs of the business. By addressing the questions discussed in this article, you can help ensure that your data center is healthy and well equipped to manage physical and virtual environments.