A client engaged WME for assistance troubleshooting an issue where their image deployments were taking over four hours as part of an overall SCCM health assessment. The IT folks said the majority of the time elapsed during the Install Applications step of the OSD task sequence, which was quickly confirmed to be the case. They informed me that the problem had started around the time they implemented the Cloud Management Gateway and switched to https for all SCCM client communication. The latter turned out to be a red herring that consumed some troubleshooting time. Secure client-server communication was configured correctly and functioning normally as was the CMG. As a reference point, the client upgraded from CB 1710 to CB 1803 during the engagement.
Now, we know that installing applications via task sequence can be problematic. The blogosphere has plenty of opinions about not using this method, with some suggesting that one should use packages/programs for task sequences and applications for everything else. That approach is not ideal in that it requires double work for any application one wishes to deploy during imaging. Another approach is to deploy apps to collections and allow them to install after a machine is imaged. This is a sound approach but in this case the client wishes to deploy certain apps during OSD so a root cause analysis was necessary.
Live Microsoft Message Analyzer traces during image deployment did not reveal anything noteworthy – and we did import the server’s certificate to decode frames, a process worthy of its own blog which thankfully others have already obliged – other than it didn’t seem there was much communication between the laptop being imaged and the distribution point while the install application steps were running with 17 apps in the task sequence. As is often the case, log files started to reveal some clues but even then it can take a bit of deciphering and a secret decoder ring.
The CAS.log file showed a long list of content sources, only two of which referred to a distribution point while the rest were peers. This is not a bad thing, it illustrates that peer caching is working as advertised. This screenshot of the peer cache dashboard in SCCM shows that a large percentage of content is being delivered by peers. The top application is obfuscated on purpose; the others are all Microsoft content.
The following snippet from the CAS.log file shows that there are 43 available sources for content. The first several are shown, and again, only the last two (not shown) reference a distribution point. All others are peers:
The ContentTransferManager.log file illustrated what was occurring under the covers: the client was attempting to connect to each source in the DP list provided by the management point (shown in the CAS.log file) in sequence but was timing out on each until finally landing on the one good source: the distribution point. This was taking upwards of 15 minutes per application, not counting actual installation time: 17 apps times 15 minutes = 4.25 hours.
Note the start time in the first entry and the completion time in the last entry below…some 15 minutes transpire between the first connection attempt and the successful completion of content transfer from the DP.
In short, peer caching was the culprit, or rather something to do with peers included in the SuperPeers table that are not available but the client insists on trying to use them for content. Some quick DNS queries to identify the subnet that the peers were located on revealed something interesting: a number of the peers were VPN clients connecting from elsewhere via the Internet! (included in recommendations to the customer to exclude VPN clients from peer caching). This is notgood, but what was also interesting was that it appeared that none of the 41 peers, even the ones on the same local network, were communicating. Indeed, we were seeing connection errors in other logs.
Given that we had to provide a solution in a limited amount of time, the client opted to disable peer caching for the time being to see if things improved. Alas, things did not improve! Why?
Answer: disabling peer caching in the SCCM client settings does not switch it off immediately. Clients must notify the management point to remove them as a content source. Until then, the SuperPeers table in the database is still populated and clients will continue to attempt to use peers for content. A quick SQL query revealed 1,143 entries in the SuperPeers table in the CM database.
As is often the case, I am exceedingly grateful to my esteemed colleagues in the blogosphere, the majority of whom I have never met. In this case, this blog: The strange case of Peer Cache not getting disabled details how to create a device collection with a script associated to it that tells clients to notify the management point to remove them as a content location. That would be the recommended and least risky approach. The same blog also shows a quick and dirty approach to purging the entries from the two tables in question: SuperPeerContentMap and SuperPeers.
WARNING: insert whatever “use at your own risk” verbiage works for you here. Oh, and before you do *anything* directly in the database EVER *always* perform a backup (yes a real time backup right before you start mucking around in there).
I can neither confirm nor deny which approach was taken in this case. I can, however confirm that the result was success! The slowness issue disappeared and the OSD task sequence buzzed through application installations as expected.
Now we cannot just leave things hanging with peer caching disabled, so check back for part 2 of this series on peer caching for recommendations on optimizing peer caching and when not to use it.