VCDX #200 Blog of one VMware Infrastructure Designer: September 2013

Wednesday, September 18, 2013

Open Manage Essentials Network Connection (tcp/udp ports)

I was recently engaged to implement DELL Datacenter version of OME (Open Manage Essentials). DELL OME is quite easy and efficient tool for basic DELL hardware management. In other words it is free of charge element system management for DELL Servers, Network and also some Storage elements. It allows you to do typical administrator tasks like

Hardware Discovery and Inventory
Monitor Hardware Status
Send email Notification or trigger SNMP trap to another system management
Inventory and System Reporting
In-band (OMSA) or Out-of-band (DRAC) Server Firmware Management - upgrades and downgrades

It is important to note that DELL OME is not enterprise management system like Altiris, MS System Center, or so. For customers considering integration DELL hardware in to some enterprise management system it is very likely DELL has integration toolkit or management plugin for particular enterprise management system. But that's another story.

DELL OME is straight forward to install in small environment but it is usually more complex to implement it in bigger enterprise environment where exist firewalls with strict polices. In such environments you have to tightly cooperate with network departments for creating firewall rules allowing communication among OME server and hardware elements.

Unfortunatelly OME User Guide doesn't describe details network connections. There are listed TCP/UDP ports but for firewall rules you need to know detail network flows and flow directions.

That's the reason I have created document "Open Manage Essentials Network Connection and useful information for creating firewall rules" and publish it on slide share here.

Direct link to the document:
http://www.slideshare.net/davidpasek/ome-network-connections-and-firewall-rules-v04

And as always ... any comments are highly appreciated.

Tuesday, September 17, 2013

DELL Force10 configuration for VMware VXLAN transportation

Right now I work on vSphere Design where network virtualization is leveraged to simplify network management and provide segmentation of multiple tenants. Therefore I was tested VXLANs in my lab. I have equipment listed bellow:

1x DELL Blade Chassis M1000e
2x DELL Force10 IOA (IO Aggregators - blade chassis network modules)
2x DELL Force10 S4810 as top of the rack switches
1x DELL Force10 S60 acting as physical router (L3 switch)
1x DELL EqualLogic storage PS-4110 (iSCSI storage module inside Blade Chassis)

Here are VXLAN physical switch requirements

Minimum MTU size requirement is 1600 however we will use maximum Jumbo Frames across physical network
IGMP snooping should be enabled on L2 switches, to which VXLAN participating hosts are attached.
IGMP Querier enabled on router or L3 switch with connectivity to the multicast enabled networks.
Multicast routing (PIM-SM) must be enabled on routers.

Force10 switches are by default configured to allow Jumbo Frames. However physical interfaces, vlan interfaces and port-channels has to be configured explicitly.

Force10 S-series switches interfaces MTU can be set up to 12000. In CISCO Nexus environments max MTU is 9216.

Force10 IOA (I/O Aggregator) is by default set to MTU 12000 so it is already prepared for VXLAN and nothing has to be configured.

Let's assume we use VLAN 14 for VXLAN transport.

Router config (Force10 S60)

config
igmp snooping enable
ip multicast-routing
interface vlan 14
ip pim sparse-mode
mtu 12000
tagged gigabitethernet 0/46-47
exit

! For all interfaces in VLAN 14 we have to set MTU at least 1600

interface range gigabitethernet 0/46 - 47
mtu 12000
end

Switch config (Force10 S4810)
! IGMP snooping must be enabled

conf
ip igmp snooping enable
interface vlan 14
mtu 12000
exit
interface range tengigabitethernet 0/46 , tengigabitethernet 0/48 - 51 , fortyGigE 0/56 , fortyGigE 0/60
mtu 12000
exit
interface range port-channel 1 , port-channel 128
mtu 12000
exit
end

IO Aggregator (Force10 IOA)

Force10 IOA default configuration has maximum MTU already in default factory settings so it is VXLAN ready and no changes are required.

Here are Force10 IOA default values:

mtu 12000
ip mtu 11982
igmp snooping enabled

Check out these excellent blog articles for more details on VXLAN theory and implementation:

VXLAN requirements
http://www.yellow-bricks.com/2012/10/04/vxlan-requirements/

VXLAN on UCS and vSphere: from L3 to Nexus 1000V
http://vmtrooper.com/vxlan-on-ucs-and-vsphere-from-l3-to-nexus-1000v/

http://www.force10networks.com/CSPortal20/TechTips/0008_mtu-settings.aspx
Adjusting MTU and Configuring Jumbo Frame Settings

UPDATE 2015-02-02:
I have multicast router enabled on my VLAN 14 (see configuration of Force10 S60) therefore it works like IGMP querier. However if you have a need to have VXLAN overlay over the network without multicast router you should configure IGMP Querier on particular VLAN otherwise multicast traffic will be flooded into whole broadcast network (VLAN). IGMP querier can be configured by following command:

ip igmp snooping querier

Sunday, September 15, 2013

Default credentials and initial setup of VMware vSphere components

vCenter Server Appliance
Username: root
Password: vmware

vShield Manager
Username:admin
Password: default

Initial setup:

Log in to console to use CLI
enable
setup (it will start setup wizard where you can set network settings of vShield Manager appliance)
Log out from console
Log in to web management https://A.B.C.D/ (A.B.C.D is address of vShield Manager appliance, use default credentials)
Continue configuration in web management.

High latency on vSphere datastore backed by NFS

Last week one of my customers experienced high latency on vSphere datastore backed by NFS mount. Generally, the usual root cause of high latency is because of few disk spindles used for particular datastore but that was not the case here.

NFS datastore for vSphere
Although NFS was always understood as lower storage tier VMware and NFS vendors were working very hardly on NFS improvements in recent years. Another plus for NFS nowadays is that 10Gb ethernet is already commodity which helps NFS significantly because it doesn't support multi-pathing (aka MPIO) as FC or iSCSI does. On the other hand it is obvious that NFS is another abstract storage layer for vSphere and some other details like NFS client implementation, ethernet/IP queue management, QoS, and so on can impact whole solution. Therefore when someone tell me NFS for vSphere I'm always cautious. Don't get me wrong I really like abstractions, layering, unification and simplification but it must not have any influence on the stability and performance.

I don't want to discuss advantages and disadvantages of particular protocol as it depends on particular environment requirements and what someone wants to achieve. By the way I have recently prepared one particular design decision protocol comparison for another customer here so you can check it out and comment it there.

Here in this case the customer had really good reason to use NFS but the latency issue is potential show stopper.

I have to say that I had also bad NFS experience back in 2010 when I was designing and implementing Vblock0 for one customer. Vblock0 used EMC Celerra therefore NFS or iSCSI were the only options. NFS was better choice because of Celerra iSCSI implementation (that's another topic). We were not able to decrease disk response times bellow 30ms so at the end NFS (EMC Cellera) was used as Tier 3 storage and customer bought another block storage (EMC Clariion) for Tier 1. It is history because I was implementing new vSphere 4.1 and SIOC was just introduced without broad knowledge about SIOC benefits especially for NFS.

Since there lot of things changed with NFS so that's just one history and field experience of one engagement so lets go back to the high latency problem today on NFS and troubleshooting steps what we did with this customer.

TROUBLESHOOTING

Environment overview
Customer has vSphere 5.0 (Enterprise Plus) Update 2 patched to the latest versions (ESXi build 1254542).
NFS storage is NetApp FAS with the latest ONTAP version (NetApp Release 8.2P2 7-Mode).
Compute is based on CISCO UCS and networking on top of UCS is based on Nexus 5500.

Step 1/ Check SIOC or MaxQueueDepth
I told customer about known NFS latency issue documented in KB article 2016122 and broadly discussed on Cormag Hogan blog post here. Based on community and my own experience I have hypothesis that the problem is not related only to NetApp storage but it is most probably ESXi NFS client issue. This is just my opinion without any proof.

Active SIOC or /NFS/MaxQueueDepth 64 is workaround documented on KB Article mentioned earlier. Therefore I asked them if SIOC is enabled as we discussed during Plan & Design phase. The answer was yes it is.

Hmm. Strange.

Step 2/ NetApp firmware
Yes. This customer has NetApp filer and in kb article is update comment that the latest NetApp firmware solve this issue. Customer has latest 8.2 firmware which should fix the issue. But it evidently doesn't help.

Hmm. Strange.

Step 3/ Open support case with NetApp and VMware
I suggested to open support case and in parallel continue with troubleshooting

I don't why but customers in Czech Republic are shame to use support line. I don't known why when they are paying significant amount of money for it. But it is like it is and even this customer didn't engaged VMware nor NetApp support and continued with troubleshooting. Ok, I understand we can solve everything by our self but why not ask for help? That's more social than technical question and I would like to known if this administrator behavior is global habit or some special habit here in central Europe. Don't be shame and speak out in comments even about this more social subject.

Step 4/ Go deeper in SIOC troubleshooting

Check if storageRM (Storage Resource Management) is running

/etc/init.d/storageRM status

Enable advanced logging in Software Advanced Settings -> Misc -> Misc.SIOControlLogLevel = 7
By default there is 0 and 7 is max value.

Customer found strange log message in "/var/log/storagerm.log"

Open /vmfs/volumes/ /.iorm.sf/slotsfile (0x10000042, 0x0) failed: permission denied

There is not VMware KB for it but Frank Denneman bloged about it here.

So customer is experiencing the same issue like Frank in his lab.

Solution is to changed *nix file privileges how Frank was instructed by VMware Engineering (that's the beauty when you have direct access to engineering) ...

chmod 755 /vmfs/volumes/DATASTORE/.iorm.sf/slotsfile

Changes take effect immediately and you can check it in "/var/log/storagerm.log"

...
DATASTORE: read 2406 KB in 249 ops, wrote 865 KB in 244 ops avgReadLatency
1.85, avgWriteLatency 1.42, avgLatency 1.64 iops = 116.59, throughput =
773.65 KBps
...

Advanced logging can be disable in Software Advanced Settings -> Misc -> Misc.SIOControlLogLevel = 0

After this normalized latency is between 5-7 ms which is quite normal.

Incident solved ... waiting for other incidents :-)

Problem management continues ...

Lessons learned from this case
SIOC is excellent VMware technology helping with datastore wide performance fairness. In this example it help us significantly with dynamic queue management helping with NFS response times.

However even in any excellent technology can be bugs ...

SIOC can be leveraged just by customers having Enterprise Plus licenses.

Customers having lower licenses has to use static Queue value (/NFS/MaxQueueDepth) 64 or even less based on response times. BTW default Max NFS queue depth value is 4294967295. I understand NFS.MaxQueueDepth as a Disk.SchedNumReqOutstanding for block devices. Default value of parameter Disk.SchedNumReqOutstanding is 32 helping with sharing LUN queues which usually have queue depth 256. It is ok for usual situations but if you have more disk intensive VMs per LUN than this parameter can be tuned. This is where SIOC help us with dynamic queue management even across ESX hosts sharing same device (LUN, datastore).

For deep dive Disk.SchedNumReqOutstanding explanation i suggest to read Jason Boche blog post here.

Static queue management brings significant operational overhead and maybe other issues we don't know about right now. So go with SIOC if you can, if you have enterprise environment consider upgrade to Enterprise Plus. If you still have response times issue troubleshoot SIOC if it does what he has to do.

Anyway, it would be nice if VMware can improve NFS behavior. SIOC is just one of two workarounds we can use to mitigate risk of high latency NFS datastores.

Customer unfortunately didn't engaged VMware Global Support Organization therefore nobody in VMware knows about this issue and cannot write new or update existing KB article. I'll try to do some social network noise to help highlight the problem.

Friday, September 13, 2013

Troubleshooting Storage Performance in vSphere

Very good blog post series introduction to storage performance troubleshooting in VMware vSphere infrastructures.

Part 1 - The Basics
Part 2 - Troubleshooting Storage Performance in vSphere
Part 3 - SSD Performance

Everybody should read these storage basics before deep diving in to storage performance in shared infrastructures.

Wednesday, September 11, 2013

NFS or Fibre Channel Storage for VMware vSphere?

Final decision depends what do you want to get from your storage. Check out my newly uploaded presentation on SlideShare: http://www.slideshare.net/davidpasek/design-decision-nfsversusfcstorage-v03 where I'm trying to compare both options with special requirements from real customer engagement.

If you have any storage preference, experience or question please feel free to speak up in the comments.

What type of NIC teaming, loadbalancing and physical switch configuration to use for VMware's VXLAN?

As a former CISCO UCS Architect I'm observing VXLAN initiative almost 2 years so I was looking forward to do the real customer project. Finally it is here. I'm working on vSphere design for vCloud Director (vCD). To be honest I'm responsible just for vSphere design and someone else is doing vCD Design because I'm not vCD expert and I have just conceptual and high-level vCD knowledge. I'm not planning to change it in near future because I'm more focused on next generation infrastructure and vCD is in my opinion just another software for selling IaaS. I'm not saying it is not important. It is actually very important because IaaS is not just technology but business process. However nobody knows everything and I leave some work for other architects :-)

We all know that vCD sits on top of vSphere providing multi-tenancy and other IaaS constructs and since vCD 5.1 the network multi-tenancy segmentation is done by VXLAN network overlay. Therefore I have finally opportunity to plan, design and implement VXLANs for real customer.

Right now I'm designing network part of vSphere architecture and I describe VXLAN oriented design decision point bellow.

VMware VXLAN Information sources:

S1: VMware vShiled Administration Guide [Official source]
S2: VMware KB 2050697 [Official source]
S3: Duncan Epping blog post here. [Unofficial source]
S4: VMware VXLAN Deployment Guide available here. [Official source]

I would like to thanks Duncan for his blog post back in October 2012 right before Barcelona VMworld 2012 where VXLANs were officially introduced by VMware. Even it is unofficial information source it is very informative and I'm verifying it against official VMware documentation and white papers. Unfortunately I have realized that there is a lack of trustful and publicly available technical information till today and some information are contradictory. See bellow what confusion I'm facing and I would be very happy if someone help me to jump out from the circle.

Design decision point:
What type of NIC teaming, loadbalancing and physical switch configuration to use for VMware's VXLAN?

Requirements:

R1: Fully supported solution
R2: vSphere 5.1 and vCloud Director 5.1
R3: VMware vCloud Network & Security (aka vCNS or vShield) with VMware distributed virtual switch
R4: Network Virtualization and multi-tenant segmentation with VXLAN network overlay
R5: Leverage standard access datacenter switches like CISCO Nexus 5000, Force10 S4810, etc.

Constraints:

C1: LACP 5-tuple hash algorithm is not available on current standard access datacenter physical switches mentioned in requirement R5
C2: VMware Virtual Port ID loadbalancing is not supported with VXLAN Source: S3
C3: VMware LBT loadbalancing is not supported with VXLAN Source: S3
C4: ~~LACP must be used with 5-tuple hash algorithm Source: S3, S2, S1 on Page 48.~~ ~~[THIS IS STRANGE CONSTRAINT, WHY IT IS HASH DEPENDENT?]~~ Updated 2013-09-11: It looks like there is a bug in VMware documentation and KB Article. Thanks @DuncanYB and @fojta for confirmation and internal VMware escalations.

Available Options:

Option 1: Virtual Port ID
Option 2: Load based Teaming
Option 3: LACP
Option 4: Explicit fail-over

Option comparison:

Option 1: not supported because of C1
Option 2: not supported because of C2
Option 3: supported
Option 4: supported but not optimal because only one NIC is used for network traffic.

Design decision and justification:
Based on available information options 3 and 4 complies with requirements and constraints. Option 3 is better because network traffic is load balanced across physical NICs. That's not a case for option 4.

~~Other alternatives not compliant with all requirements:~~

~~Alt 1: Use physical switches with 5-tuple hash loadbalancing. That means high-end switch models like Nexus 7000, Force10 E Series, etc.~~
~~Alt 2: Use CISCO Nexus 1000V with VXLAN. They support LACP with any hash algorithm. 5-tuple hash is also recommended but not strictly required.~~

Conclusion:
I hope some information in constraints C2, C3, and C4 are wrong and will be clarified by VMware. I'll tweet this blog post to some VMware experts and hope someone will help me to jump out from the decision circle.
~~If you have any official/unofficial topic related information or you see anything where I'm wrong, please feel free to speak up in the comments.~~
Updated 2013-09-11: Constraint C4 doesn't exists and VMware doc will be updated.
Based on updated information LACP and "Explicit fail-over" teaming/load-balancing is supported for VXLANs. LACP is better way to go and "Explicit fail-over" is alternative in case LACP is not achievable on your environment.

Tuesday, September 10, 2013

Storage System Performance Analysis with Iometer

Excellent write up about IOmeter usage is here.

Quick troubleshooting of ESX and 10Gb Broadcom NeXtreme II negotiated only to 1Gb

I have just realized that my vmnic(s) in one DELL blade server M620 (let's call him BLADE1) is connected only at 1Gb speed even I have 10Gb NIC(s) connected to Force10 IOA blade module(s). It should be connected at 10Gb and another blade (let's call him BLADE2) with the same config is really connected at 10Gb speed.

So quick troubleshooting ... we have to find where is the difference

Let's go step by step ...

NIC ports on ESX vSwitch in BLADE1 are configured to use auto negotiation so no problem here
Ports on Force 10 IOA are also configured for auto negotiation and configuration is consistent across all ports in switch modules so that's not a problem.
ESX builds are the same on both blade servers.
What are NIC firmwares? On BLADE1 there is 7.2.14 and on BLADE2 7.6.15

Bingo!!! Let's upgrade NIC firmwares on BLADE1 and check if this was the root cause of the problem ...

Monday, September 09, 2013

Using SSL certificates for VMware vSphere Components

Streaming the certificate replacement and management process in a VMware environment can be challenging at times. For instance, changing certificates for a vCenter 5.1 is a hugely laborious process. And in a typical environment where there are a large number of hosts running, tracking and managing their certificates is difficult and time consuming. More importantly, security breaches due to lapsed certificates can prove to be very expensive to the organization. vCert Manager from VSS Labs provides fully automated management of SSL Certificates in a VMware environment across the entire lifecycle.

VSS Labs has solution to simplify SSL management. For more info look at http://vsslabs.com/vCert.html

To be honest I had no chance to test it because I avoid signed SSL certificates if possible. However when I'll have a customer who requires SSL I definitely have to evaluate VSS Labs solution.

Wednesday, September 04, 2013

OpenManage Integration for VMware vCenter 2.0

OpenManage Integration for VMware vCenter 2.0 is new generation of DELL vCenter Management Plugin targeted as plugin for vSphere 5.5 Web Client.

Looking forward to test it with vSphere 5.5 in my lab.

Monday, September 02, 2013

Configure Force10 S4810 for SNMP

Enable SNMP in Force10 S4810 switches is straight forward. Bellow is configuration sample.

conf

! Enable SNMP for read only access
snmp-server community public ro

! Enable SNMP traps and send it to SNMP receiver 192.168.12.70
snmp-server host 192.168.12.70 version 1
snmp-server enable traps

Configuring Dell EqualLogic management interface

All credits go to Mike Poulson because he published this procedure back in 2011.
[Source: http://www.mikepoulson.com/2011/06/configuring-dell-equallogic-management.html]

I have just rewrote, formated, and slightly changed the most important steps for EqualLogic out-of-band interface IP configuration.

The Dell EqualLogic iSCSI SAN supports an out-of-band management network interface. This is for managing the device from a separate network than the iSCSI traffic is on. So this is a quick set of commands that are used to configure the management (in this case eth2) interface on the device.

The web interface is nice and all but you have to have your 10Gig network setup before you can access it. Also the "setup" does not really give you an easy option to configure the management interface.

Steps:
Login to Console Port with grpadmin username and grpadmin password.

After you run setup you will need to know the "member name". You can get your member name by running the command

member show

This will list the name, Status, Version, Size information for each member configured on the array. Here is example

grpname> member show
Name Status Version Disks Capacity FreeSpace Connections
---------- ------- ---------- ----- ---------- ---------- -----------
member01 online V4.3.6 (R1 16 621.53GB 0MB 0
grpname>

The member name for my device is member01.

Once you know the member name you will need to set the IP address for your management interface. This IP address will need to be one that you can access from your management network. The port is an untagged port similar to other out-of-band management ports on devices (network switches).

To configure the IP use steps described below.

First set the interface to be management ONLY. Use the member command again.

member select member01 eth select 2 mgmt-only enable

Set the IP address and Network Mask

member select member01 eth select 2 ipaddress xxx.xxx.xxx.xxx netmask 255.255.255.0

Enable the interface (by default the MGMT (eth2) interface is disabled and will not provide a LINK).

member select member01 eth select 2 up

Then you will be asked to confirm that you wish to enable the Management port

This port is for group management only. If enabling, make sure it is connected to a dedicated management network. If disabling, make sure you can access the group through another Ethernet port.

Do you really want to enable the management interface? (y/n) [n] y

To view current IP and state of an Eth interface use

member select member01 show eths

Once that is complete you can use the management IP address to establish an http or https connection to the Array.

Veeam Backup Components Requirements

Veeam is excellent backup software for virtualized environments. Veeam is relatively easy to install and use. However when you have bigger environment and looking for better backup performance is really important to know infrastructure requirements and size appropriately your backup infrastructure.

Here are hardware requirements for particular Veeam components.

Veeam Console
Windows Server 2008 R2
4GB RAM + 0.5GB per concurrent backup job

Veeam Proxy
Windows Server 2008 R2
2GB RAM + 0.2GB per concurrent task

Veeam WAN Accelerator
Windows Server 2008 R2
8GB RAM
Disk for cache

Veeam Repository
Windows Server 2008 R2, Linux or NAS (CIFS)
4GB RAM + 2GB per each concurrent ingress backup job

Pages