Sluggish performance within Windows 7/10 VDI session (Non network related!)

I’ve come across this now with multiple customers. Most of the time it’s showing up when people are moving to windows 10, after experiencing poor performance with W7 desktop VMs and as part of the migration process, they buy all new tin.

They experience sluggish responses when opening various apps and for example with IE, it’s painfully obvious that something is wrong. It will spike the CPU to a high percentage, take quite a while to open, then the CPU will drop and it’ll be OK for browsing until more tabs are opened and similar behaviour will occur.

When this occurs, I always ask if there’s any resource pools set, any QoS or similar on the storage and so on, but generally, I know there isn’t. vSphere monitoring will show that everything is OK, so will anywhere you look within Horizon and vROPs. I’ll ask if the physical hardware has power management/saving on within the BIOS and I usually get the following answers:

“Oh, I’m sure it’s been turned off, but I didn’t build the servers”

“Yep, no power management, vSphere says so.”

“I don’t know. How do you even do that?”

“I’ve asked and they say it has been.”

Every time. Every. Time. 

It can be difficult to definitely show that this is the source of the problem when everything you have access to says it all should be OK, plus people also have many different levels of experience with these things. I came across it about 7 years ago when the company I worked for had a mix of Dell and HP servers and being a PC gaming enthusiast, I was always trying to eke out as much performance of a PC as I could, so I tried to do the same with servers – no overclocking though! So power saving settings would be the first thing to get turned off!

So if you’re experience similar problems, with apps spiking the CPU, being sluggish, then the CPU dropping, check your physical hardware to make sure power management isn’t set to power saving, it’ll save you a whole load of heart ache! This also applies to Citrix as well.

If you need proof of this, or want to check it, there’s various tools, but I always check using Systrack – we have a tool as part of the suite called Resolve, which allows for in-depth analysis of specific machines (as well as the ability to compare to other machines/groups) and this will show straight away if a machine is being throttled, or has memory ballooning. Throttling can also show if a CPU is overheating and the BIOS throttles it back to avoid shutting down – many a place have thought they need new machines, or new CPUs, but no, they simply need to get those dusty fans cleaned out!

throttle

May show up a little bit too small for some screens, but what you’d see, is the CPU is throttled to 66%. The CPU usage is low, but due to the throttling, the Thread count and interrupt per seconds are high. The CPU should be 100% or even higher with some modern CPUs, but unless you’re really trying to save some power… You don’t want it lower than 100%

What’s also interesting, is when people go back to the older servers with Windows 7 on and realise that the poor performance throughout, was also due to the power management not being turned off… as a fair few manufacturers ship hardware with this as default…

Troubleshooting non responsive VMs

Sorry for the gap in posting…well…anything… New role has kept me very busy!

Most of us will have our own little tips and tricks on troubleshooting, but recently I had a customer who had a machine hanging, that looked suspiciously like VMware Tools was causing an issue, but they had no idea how to troubleshoot. I’d suggested various options, including simply getting the logs, checking ESXi services and so on and it wasn’t anything they’d had to do before, so I really needed something quick and fully features to suggest to them. There’s a great VMware KB which uses the process of:

Validate the scope – find out the scope of the problem and accurately define what the symptoms are (no point in just having someone screaming “It’s crashing, IT’S CRASHING!”

Identify the cause – so many possibilities! Storage, services crashing, resource contention, a task on the VM…

Action Plan – Take action to remediate the issue – once the cause has been established, focus on what is causing the issue and define a plan to resolve it.

https://kb.vmware.com/s/article/1007819

DFSR Troubleshooting and considerations

When you have a few DFS servers, everything seems fairly manageable. Add a few more…Yeah… All good… Get an issue, oh joy! Look at all these servers I need to examine in minute detail!

So I’ll start with an MS Blog that deals with config mistakes

Common DFSR Mistakes and Oversights

Might be a few banged heads on the desk (your own!) when you read some of that!

And put together by my own fair hand, some troubleshooting from the DFSR Management tool and some useful DFSR Diag commands

DFSR Management Tool

Verify topology simply checks the servers are contactable, which is useful, but does not verify replication.

DFS Management includes the ability to run a propagation test and generate two types of diagnostic reports—a propagation report and a general health report:

Propagation test    Tests replication progress by creating a test file in a replicated folder.

Propagation report    Generates a report that tracks the replication progress for the test file created during a propagation test.

Health report    Generates a report that shows the health of replication and replication efficiency.

To create a diagnostic report for DFS replication

Click Start , point to Administrative Tools , and then click DFS Management .

In the console tree, under the Replication node, right-click the replication group that you want to create a diagnostic report for, and then click Create Diagnostic Report .

Follow the instructions in the Diagnostic Report Wizard.

Perform all 3 tests and save the resulting xml/html report file.

DFSRdiag

This is the command line tool for DFSR – useful commands are:

dfsrdiag ReplicationState /all – verbose output

pollad – checks in with Active Directory

List DFS replication groups:

dfsradmin rg list

List replicated folders in a replication group:

dfsradmin rf list /rgname:<REPL_GROUP>

List members of a replication group:

dfsradmin mem list /rgname:<REPL_GROUP>

List the local folders that correspond to replicated folders of a replication group:

dfsradmin membership list /rgname:<REPL_GROUP> /attr:RfName,MemName,LocalPath

Show backlog between 2 members of a replication group:

dfsrdiag backlog /rgname:<REPL_GROUP> /rfname:<REPL_FOLDER> /smem:<SRV_A> /rmem:<SRV_B> [/v]

dfsrdiag backlog /rgname:<REPL_GROUP> /rfname:<REPL_FOLDER> /smem:<SRV_B> /rmem:<SRV_A> [/v]

Using the ‘Replicate Now’ command within the GUI or command line, kicks off replication again but is more used for when you have a schedule and want to replicate out of that schedule, whereas we can use it to tell it to start replication again.

Within the DFSR GUI – choose the Replication group, choose the ‘Connections’ tab and right click the Sending Server (usually you have a specific server that’s authoritative, but you can choose the sending member to be whichever you believe is most up to date and Microsofts black magic algorithm will attempt to resolve any file conflicts) and choose ‘Replicate Now’

A huge problem in DFSR is when you have an issue with Conflicted, Deleted and pre-existing files. Thankfully if you do get a conflict and file loss, they become Deleted…And you can get them back. This is the MS blog on this, but a few years ago when I had to do this, it took a great deal of work. It’s not just about getting the files back – who’s going to know which one they were working on? Or which is the most up to date? You do end up needing end user involvement and for me personally, that was making the data recovered available for a certain amount of time with the users expressly informed that they needed to check it themselves and there was just no way of doing that for them.

Restoring Conflicted, Deleted and PreExisting files with Windows PowerShell

DFSR Setup and considerations

DFSR is actually relatively easy to setup.

There’s no need for me to re-invent the wheel or explain in tiny detail, as most of it has all been done before.

So, to start

MS blog about how it can work for you.

DFS Replication in Windows Server 2012 R2

How to set it up

DFSR Setup with screenshots

Another MS blog about how if you have a huge estate, you better use DFSRADMIN command line! (Yeah you’d better!)

DFS Replication and command line

 

DFSR Monitoring Script (with email!)

Ah, good old DFSR, with it’s highly complex management algorithm that is at times, a Law Unto Itself. What do you mean that file is newer? I’m going to overwrite it with THIS one!

DFSR is also full of false truths and true lies. Event ID’s that don’t tell you what’s wrong. Logs that make out that everything is broken when it isn’t… Weeks of over written data that no one knew was happening… But sadly if you don’t have hardware replication, you probably use this. Don’t get me wrong, when it works properly, it’s great, but there’s sometimes quite a management overhead, plus a lot of time and experience involved when it needs to get fixed.

This link has an amazing script that was similar to part of my checks in a previous role, whereby I had a few scripts running as part of Daily Checks for the team, that reported on DFSR, Exchange and AD. It emailed pretty pictures and everything (you know how people love pretty pictures!) So as my first post towards Checks and DFSR management – the following link is fantastic.

DFSR Monitoring Script

I will say though, as part of any infrastructure related checks or notifications, it’s what works for you and your team. I had a great Project Manager once who said that if you get the process right, then everything works. If the process fails and a person followed the process, then the process needs changing. Which stands to reason – you need the process that people need to follow, that takes into consideration everything that needs to happen and there’s no point having a barrage of alerts sent to email, if the person involved doesn’t read them, understand them and put down somewhere he’s done all of the above.