Exchange Server 2016 Database Availability Group

This article is part of a series based on my notes during the migration of an Exchange organization from version 2010 to 2016. This is not a complete procedure, but a description of the main points and areas. The examples relate to a specific design, but can generally be generalized. Also, while this describes a migration, the information is also suitable for a new installation or administration.

I described the older version of the DAG in the article Exchange 2010 CAS Array and DAG between Sites.

Planning Topology and DAG Installation

Before we start describing what a Database Availability Group is, how it works and how it is configured, let's build on the previous parts and create an example topology. When planning service addresses (Namespaces) in the first part, we created tables and schemas for the planned deployment. If we decide to use a DAG, we should add more networks and addresses. We may also need to plan the databases and where they will be primarily active.

Topologie Exchange Database Availability Group

General Deployment Process

Prepare the individual servers, mostly likely VMs, and set the network addresses
Install the Exchange Mailbox role on the servers and perform basic configuration
Create the DAG (Database Availability Group)
Add the individual Mailbox servers to the DAG (as Nodes - Member Servers)
(Optionally) Remove the old databases and create new mailbox databases
Add database copies to the databases on the other servers

DAG - Database Availability Group

Documentation: Database availability groups, Manage database availability groups, Plan for high availability and site resilience

DAG ensures high availability of Mailbox DBs by creating copies (Mailbox Database Copies) on other servers (DAG members) and keeping them up-to-date through replication (Continuous Replication). One copy (on one server) is active (Active Mounted) and the others are passive (Passive). In case of problems on the server, the DB copy on another server is activated, which leads to automatic failover. Manual activation of the database is called a switchover. Monitoring and switching is handled by the Active Manager component (part of the Exchange Replication service). A DAG is a group of up to 16 Mailbox servers, all must have the same Exchange version (so 2016 and 2010 cannot be mixed) and be in the same domain. In the event of a switchover or failover, clients are almost immediately redirected to the new DB.

From Exchange 2013 SP1 CU4 (minimum Windows Server 2012 R2), we can create a DAG without an IP address, in other words, without a cluster administrative access point (CAAP). Then no IP address is assigned, no network name or DNS is created, and no Cluster Name Object (CNO) is created in AD DS. This overall simplifies the cluster solution. But the cluster can then no longer be managed using the Failover Cluster Management tool, but only using the Exchange Management Shell.

The impact is significant for backups, as some tools use the CAAP and cannot backup the DAG without it. Therefore, it is first necessary to check the backup software. Microsoft SCDPM 2012 R2 UR9 supports backing up Exchange 2016, including a DAG without an IP address. DPM can now backup Exchange 2016

MS recommends using a DAG without an administrative access point (for Exchange 2016 or 2019 on at least Windows Server 2012 R2). Only if you are using a 3rd party application that connects to the cluster/DAG (this is not needed for normal operation).

How DAG Works

DAG uses continuous replication (Continuous Replication - block mode or file mode) and part of the Windows Failover Clustering technology to ensure high availability and site resilience.

DAG uses the Windows Failover Cluster, which uses the principle of quorum. If the servers are in multiple locations and the connection fails, it ensures that only one side is running. Based on the consent (consensus) of the voters, only one group of cluster members is active. If the DAG has an even number of members, a Quorum Witness Resource is required to prevent split brain syndrome. DAG members communicate with the Witness Server and can lock the witness.log file using SMB. The DAG member who locked the Witness Server is called the Locking Node, has an extra vote, and the members communicating with it have the majority.

Depending on the number of cluster nodes, the Quorum Model is used:

Node Majority - odd number
Node and File Share Majority - even number

Creating a DAG

We need to prepare a Witness server where a shared folder will be located (or a second (backup) Witness server can be used). Even if we have 3 Mailbox servers, the Witness will not be used, but we must configure it during the configuration (in case the number of DAG members changes).

Note: In Exchange 2016, everything around the DAG can be managed using the Exchange Management Shell, but many things can also be done in the Exchange Admin Center. In the article, we often only mention one option.

We will create a DAG without an IP address. An unique name is always important for the DAG. Creating it is simple.

EAC - Exchange Admin Center
Servers - Database Availability Groups
Click the plus - New
Enter the name, server address, and Witness folder, do not enter any IP address

Note: By default, DAG uses encryption and compression for network traffic between subnets. We can change this setting in the DAG configuration. Exchange servers use Kerberos authentication between each other.

Adding a Server to the DAG

We will gradually add the first, and then the other members of the DAG, i.e. the individual Mailbox servers.

EAC - Exchange Admin Center
Servers - Database Availability Groups
Select our DAG and click the computer icon with a gear - Manage DAG membership
In the new window, click the plus - Add
Add the server and confirm Save

Přidání serveru do DAGu (Database Availability Groups)

When adding the first server, the following happens

installation of the Windows Failover Clustering component (if not already installed)
a Failover Cluster for the DAG with the specified name is created
the server is added to the DAG object in AD DS
the cluster database is updated with information about the databases that are mounted on the added server

Because we have a DAG without an IP, the following does not happen

a Cluster Name Object (CNO) is not created in the default Computers container of AD DS
the DAG/cluster name is not registered in DNS
a network name is not assigned to the cluster

Note: Before adding the second server to the DAG, I have to wait for the AD replication to complete.

When adding the second server to the DAG, the following happens

the server is added to the Failover Cluster
the witness directory and share are created
the server is added to the DAG object in AD DS
the cluster database is updated with information about the databases that are mounted on the added server

Checking the DAG

After adding a server to the DAG, it is necessary to perform a check, because everything may seem fine, but the server may not be added correctly and may not be active. For the main check, you can use the Exchange Management Shell, Failover Cluster PowerShell, or the Failover Cluster command line.

[PS] C:\>Get-DatabaseAvailabilityGroup -Identity MailDAG -Status

Name             Member Servers          Operational Servers
----             --------------          -------------------
MailDAG          {MAIL2, MAIL1, MAIL3}   {MAIL1, MAIL2, MAIL3}

[PS] C:\>Get-Cluster

Name
----
MailDAG

[PS] C:\>Get-ClusterNode

Name            ID    State
----            --    -----
mail1           1     Up
mail2           2     Up
mail3           3     Up

[PS] C:\>cluster node
Listing status for all available nodes:

Node           Node ID Status
-------------- ------- ---------------------
mail1                1 Up
mail3                2 Up
mail2                3 Up

We need to verify that the individual member servers (member / node) are up and operational. This state can also be seen in the Exchange Admin Center if you edit the DAG. In PowerShell, we can display more details.

Get-DatabaseAvailabilityGroup -Identity MailDAG -Status | FL *

We can also run a test of the overall health of the DAG.

Test-ReplicationHealth

Note: One server did not allow me to add to the DAG, either it returned an error during the addition or it did not work after a successful addition. I spent a long time debugging it and eventually created a new virtual machine and reinstalled Exchange, and everything has been fine since then.

You can check the log about adding a DAG member in the path C:\ExchangeSetupLogs\DagTasks. Then you can use the Event Viewer, where Operational information is logged - Applications and Services Logs - Microsoft - Exchange - HighAvailability - Operational. I found that the problems were with the Failover Cluster itself. The Cluster Service (ClusSvc) was Disabled and could not be started. The HKEY_LOCAL_MACHINE\Cluster registry was incomplete. The files are in C:\Windows\Cluster.

DAG Network

DAG network is a collection of one or more subnets used for replication or MAPI traffic. A DAG must have one MAPI network and may have multiple replication networks. Using a single network (single adapter) for both MAPI and replication together is supported. It is recommended to have at least two networks and separate MAPI and replication. Multiple networks can be used for replication.

The replication network should not have a default gateway configured, and it is recommended to disable Client for Microsoft Networks, File and Printer Sharing for Microsoft Networks, and Register this connection's addresses in DNS. DAG members must have the same network adapters, and each adapter must have an IPv4 address configured (if a server has multiple adapters, each must have an address from a different subnet). If we disable replication on the MAPI network, replication can still occur if the replication network is unavailable.

Microsoft directly states that on Exchange 2010, networks often had to be manually configured. Now, configuration is automatic by the system. When we want to list the networks, we can use the following cmdlet, but if I ran it without a parameter, I got an error.

[PS] C:\>Get-DatabaseAvailabilityGroupNetwork

Could not load file or assembly 'Microsoft.Exchange.Data, Version=14.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
 or one of its dependencies. The system cannot find the file specified.

The solution is to call it with a parameter, where we specify the name of the Exchange 2016 server

[PS] C:\>Get-DatabaseAvailabilityGroupNetwork -Server mail1

Identity                         ReplicationEnabled Subnets
--------                         ------------------ -------
MailDAG\MapiDagNetwork           True               {{10.0.0.0/24,Up}}
MailDAG\ReplicationDagNetwork01  True               {{192.168.0.0/24,Up}}

The system usually creates (reasonably) the MapiDagNetwork and ReplicationDagNetwork01 networks if we have two adapters. On MapiDagNetwork, it enables both MAPI access and Replication, on ReplicationDagNetwork01 only Replication. If we want to make any manual changes to the networks or create networks manually, we need to enable manual configuration on the DAG (can also be done in EAC).

Set-DatabaseAvailabilityGroup MailDAG -ManualDagNetworkConfiguration $true

We probably want to disable replication on the MAPI network

Set-DatabaseAvailabilityGroupNetwork -Identity MailDAG\MapiDagNetwork -ReplicationEnabled:$false

When we add a server from a different Site, and thus a different subnet for MAPI, it is marked as a multi-subnet DAG. In contrast to Exchange 2010, Exchange 2016 can correctly automatically add subnets to the networks.

[PS] C:\>Get-DatabaseAvailabilityGroupNetwork -Server mail1

Identity                         ReplicationEnabled Subnets
--------                         ------------------ -------
MailDAG\MapiDagNetwork           False              {{10.0.0.0/24,Up}, {10.10.0.0/24,Up}}
MailDAG\ReplicationDagNetwork01  True               {{192.168.0.0/24,Up}}

We can now manage the DAG Network from the EAC as well.

EAC - Exchange Admin Center
Servers - Database Availability Groups
Add network - select our DAG and click the computer icon with the plus - New DAG network
Edit existing - select our DAG and on the right you see the DAG Network with a list and actions

Managing Mailbox Database Copies

Documentation: Manage mailbox database copies.

After we have created the DAG and added the member servers, we can create synchronized copies of the existing databases (Mailbox DB) and work with them.

Creating a Database Copy within the DAG

When we create a database copy, a passive copy of the DB is created on the selected server and replication (continuous replication) is activated. Database copies are assigned an identifier in the format Database\Server, for example DB1\MAIL1. When creating a DB copy, the same paths are used for data files and logs. The same (used) disks must exist on the servers. To create a DB copy, Circular Logging (log truncation) must not be used.

EAC - Exchange Admin Center
Servers - Databases
Select the DB, click the three dots - More - Add database copy
Select the server where you want to place the copy
Optionally, we can adjust the copy priority - Activation preference number
Another optional option is delayed replications - Replay lag time

Of course, we can perform the copy addition in the Exchange Management Shell.

Add-MailboxDatabaseCopy -Identity DB1 -MailboxServer MAIL1 -ActivationPreference 2

Checking the Status of Database Copies

Directly in the list of databases in the EAC, we see the list and status of the individual copies.

EAC - Exchange Admin Center
Servers - Databases
Select the DB and on the right you can see the information, including the copies
For a copy, we can use View Detail

Similarly, we can list them in PowerShell. Here you can see that the initial synchronization is in progress after being added to the MAIL3 server.

[PS] C:\>Get-MailboxDatabaseCopyStatus DB1

Name                Status          CopyQueue  ReplayQueue LastInspectedLogTime  ContentIndex
                                    Length     Length                            State
----                ------          ---------  ----------- --------------------  ------------
DB1\MAIL1           Mounted         0          0                                 Healthy
DB1\MAIL2           Healthy         0          0           15.11.2018 18:54:13   Healthy
DB1\MAIL3           Resynchronizing 28083      0                                 Suspended

Managing Database Copies, Switchovers

We can perform various operations with the database copies. The most common is probably a manual switchover to a passive copy, i.e. activating a copy (Activate Database Copy), which Microsoft calls Database switchovers (Switchovers and failovers). The active DB copy is dismounted on one server and the passive copy is mounted as the new active on another. Other options are removal (Remove) of the DB copy, suspension (Suspend) and resumption (Resume), and update (Seed, Update).

EAC - Exchange Admin Center
Servers - Databases
Select the DB, on the right you can see the copies and the available commands

Activation of the DB copy using the Exchange Management Shell offers various switches and sometimes they need to be used.

Move-ActiveMailboxDatabase DB1 -ActivateOnServer MAIL2 -SkipMoveSuppressionChecks

Distribution of Active DB Copies on Servers, DAG Failback

When we have multiple Mailbox servers and a DAG, we have solved High Availability and Fault Tolerance, because if a server error occurs, the data on the other is activated. But it would be a shame to use the servers in an Active-passive style, where one performs all the work and the others only synchronize the data. Therefore, we can use Active-active (Load Balancing), where all servers are active and serve clients. We achieve this by creating multiple mailbox DBs and setting them active on different servers (and using the Activation preference number).

When planned (switchover) or unplanned (failover) activation of database copies is performed, the result is an unbalanced state of the distribution of active DB copies on the servers (compared to the initial state). As with Exchange 2010, we can use the RedistributeActiveDatabases.ps1 script, which activates the DBs on the servers according to the Activation Preference number. But from Exchange Server 2016 CU2 onwards, it is not necessary to use it, because the balancing of DB copies is automatically performed every specified time (default 1 hour) for the DAG PreferenceMoveFrequency (Get-DatabaseAvailabilityGroup | FT Name,PreferenceMoveFrequency).

Maintenance of a Server in the DAG

Before starting any maintenance on the server or installing updates, it is recommended to put the server in Maintenance Mode. This will move the active databases and functions to the other servers. Detailed description Performing maintenance on DAG members. We can use scripts that are part of the Exchange server installation.

cd $ExScripts
.\StartDagServerMaintenance.ps1 -ServerName <ServerName>
.\StopDagServerMaintenance.ps1 -serverName <ServerName>

Server Switchover

Using the EAC, we can easily perform a Server Switchover, which activates the DB copies on another server. This is recommended to be done before restarting any Exchange server that is a member of the DAG.

EAC - Exchange Admin Center
Servers - Servers
Select the server from which you want to move the active DB, and on the right click Server Switchover

Datacenter Activation Coordination (DAC) mode

It is disabled by default, but Microsoft recommends enabling it. At server startup, it checks if it is possible to activate the database if we have multiple datacenters and a copy was activated in the second datacenter. It is meant to prevent Split Brain, so that two copies of the same DB do not run actively. For example, if there is a power outage in the primary DC and all servers fail. The backup DC is activated and the DB copy here. In the primary center, power is restored, the server starts up, but has no communication with the backup. Then it would normally activate the DB. More info Datacenter Activation Coordination mode.

Set-DatabaseAvailabilityGroup -Identity mailDAG -DatacenterActivationMode DagOnly

If we use the DAC mode, we can use the commands to switch datacenters.

Stop-DatabaseAvailabilityGroup
Restore-DatabaseAvailabilityGroup
Start-DatabaseAvailabilityGroup

Client Access Server (CAS) Array

Previously, a DAG and CAS Array were used for high availability. There was also a question of whether we had only one location/datacenter/Site or multiple. In the event of a site failure, the CAS Array did not work for automatic switchover. Exchange 2016 no longer uses the CAS Array at all.

CAS Array was used for MAPI/RPC client Outlook connection to the Client Access Server. In Exchange 2016, MAPI/RPC is not used, instead either MAPI/HTTP (MAPI over HTTP) or Outlook Anywhere (MAPI/RPC over HTTP).

On the databases, we can list the RpcClientAccessServer information, which is either the server address or the CAS Array address if we used it. Even after removing all Exchange 2010 servers, this attribute remains set, but it doesn't matter because it is not used.

Get-MailboxDatabase | Select Name,RpcClientAccessServer

Site Resilience

This means DAG across multiple sites / datacenters. Documentation Site resilience. It is related to the description of Namespaces in the first article of the series.

Exchange 2016 brings a change that automatic failover to servers in another location (datacenter/Site) now works. Previously, the namespace had to be manually changed. Client connections are now using HTTPS (even from Outlook) and multiple IP addresses can be assigned, which are tried in succession. Direct MAPI connection is no longer allowed. In Exchange 2016, the client can connect to any Mailbox server, and the request is routed to the server where the active database for the given mailbox is located. We can use DNS round robin, create one DNS record that will contain the IP addresses of all servers.

Troubleshooting and Errors

Repairing DAG Networks, Removing Non-existent

Networks and subnets are added automatically, but they are not removed when we remove servers that were in a different location. It's good to occasionally run a test and make repairs if necessary. Modifications can be made using the Exchange Management Shell or in the EAC, where it's easier to edit the subnet.

[PS] C:\>Test-ReplicationHealth

Server          Check                      Result     Error
------          -----                      ------     -----
MAIL1           ClusterService             Passed
MAIL1           ReplayService              Passed
MAIL1           ActiveManager              Passed
MAIL1           TasksRpcListener           Passed
MAIL1           TcpListener                Passed
MAIL1           ServerLocatorService       Passed
MAIL1           DagMembersUp               Passed
MAIL1           MonitoringService          Passed
MAIL1           ClusterNetwork             *FAILED*   Subnet '10.0.0.0/24' on network 'MapiDagNetwork' is not Up. 
  Current state is 'Misconfigured'.
                                                      Subnet '10.0.0.0/24' on network 'MapiDagNetwork' is not Up.
  Current state is 'Misconfigured'.
                                                      Subnet '10.10.0.0/24' on network 'MapiDagNetwork' is not Up.
  Current state is 'Misconfigured'.
                                                      Subnet '10.10.0.0/24' on network 'MapiDagNetwork' is not Up.
  Current state is 'Misconfigured'.
MAIL1           QuorumGroup                Passed
MAIL1           FileShareQuorum            Passed
MAIL1           DatabaseRedundancy         Passed
MAIL1           DatabaseAvailability       Passed
MAIL1           DBCopySuspended            Passed
MAIL1           DBCopyFailed               Passed
MAIL1           DBInitializing             Passed
MAIL1           DBDisconnected             Passed
MAIL1           DBLogCopyKeepingUp         Passed
MAIL1           DBLogReplayKeepingUp       Passed

The test shows an error on the MAPI network (it even says it's not up, but at the moment it was functioning normally). So let's list the DAG Network.

[PS] C:\>Get-DatabaseAvailabilityGroupNetwork

Identity                         ReplicationEnabled Subnets
--------                         ------------------ -------
MailDAG\MapiDagNetwork           False              {{10.0.0.0/24,Misconfigured}, {10.10.0.0/24,Misconfigured}}
MailDAG\ReplicationDagNetwork01  True               {{192.168.0.0/24,Up}}
MailDAG\ReplicationDagNetwork02  False              {{fe80::/64,Misconfigured}}

For the MapiDagNetwork and ReplicationDagNetwork02 networks, we see an error, the status is Misconfigured. The MAPI network reports an error because the only server in the second location (network 10.10.0.0) was removed. So we need to edit the network and remove the unused subnet.

EAC - Exchange Admin Center
Servers - Database Availability Groups
Select our DAG, on the right you see the DAG Network where you can find the desired network
Click View details
Here you can see the assigned subnets, you can edit, remove and add them

Note: Removing a subnet or network takes a long time. So the subnet/network will not disappear until a few minutes later.

The Replication Network 2 was somehow automatically created for an unused IPv6 address. We need to remove the entire network. When we try, we get an error that we first need to remove the subnets.

[PS] C:\>Remove-DatabaseAvailabilityGroupNetwork MailDAG\ReplicationDagNetwork02

DatabaseAvailabilityGroupNetwork 'ReplicationDagNetwork02' can't be removed because it contains active subnets. To remove
 the network, all active subnets must first be assigned to other networks. If you want to disable the use of the network,
 consider using the Set-DatabaseAvailabilityGroupNetwork cmdlet with the -IgnoreNetwork or -ReplicationEnabled parameters.
  + CategoryInfo          : InvalidArgument: (:) [Remove-DatabaseAvailabilityGroupNetwork], DagNetworkManagementException
  + FullyQualifiedErrorId : [Server=MAIL1,RequestId=46104fdf-ef9f-4c67-b991-7dba758fe4f9,TimeStamp=19.02.2019 18:16:28]
 [FailureCategory=Cmdlet-DagNetworkManagementException] 7B8813CC,Microsoft.Exchange.Management.SystemConfigurationTasks.
RemoveDatabaseAvailabilityGroupNetwork
  + PSComputerName        : mail1.firma.local

[PS] C:\>Set-DatabaseAvailabilityGroupNetwork MailDAG\ReplicationDagNetwork02 -Subnets $none

We remove all subnets, but when we immediately list the network information, we still see the subnets. We need to wait a few minutes and try again.

[PS] C:\>Get-DatabaseAvailabilityGroupNetwork MailDAG\ReplicationDagNetwork02

Identity                        ReplicationEnabled Subnets
--------                        ------------------ -------
MailDAG\ReplicationDagNetwork02 False              {}

Then the network removal will work

[PS] C:\>Remove-DatabaseAvailabilityGroupNetwork MailDAG\ReplicationDagNetwork02

Confirm
Are you sure you want to perform this action?
Removing database availability group network "MailDAG\ReplicationDagNetwork02".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

Content Index in Failed State - DB Copy

It quite often happens that when we look (EMS or EAC) at the status of the passive DB copy, it shows Content index state: Failed. At this point, we won't be able to switch this copy to active. This situation occurs, for example, always after restarting some server. The repair instructions are described, for example, in Repairing a Failed Content Index in Exchange Server 2016. But it turned out that in most cases, it's enough to wait for a while (maybe 15 minutes) and this error will disappear by itself.

Issue with Activating a Database Copy

I wanted to do a Database switchover, i.e. activate a DB copy on another server. The operation went through without any error being displayed. But when I looked at the database shortly after, it was active on the original server again. On another attempt, an error was displayed that there were too many logs to inspect and replay.

An Active Manager operation failed. Error: The database action failed. Error: An error occurred while trying to validate
 the specified database copy for possible activation. Error:

   mail1:
   Database copy 'DB2' on server 'mail1' has 124 logs to inspect and replay, which is higher than maximum allowed replay
 queue length of 10. If you need to activate this database copy, you can use the Move-ActiveMailboxDatabase cmdlet with
 the -SkipLagChecks and -MountDialOverride parameters to forcibly activate the database copy. If the database does not
 automatically mount after running Move-ActiveMailboxDatabase successfully, use the Mount-Database cmdlet to mount the
 database.

We can also display the current details on the status of the DB copies:

[PS] C:\>Get-MailboxDatabaseCopyStatus * | sort name | Select name, status, contentindexstate, ReplayQueueLength

Name         Status  ContentIndexState ReplayQueueLength
----         ------  ----------------- -----------------
DB2\MAIL1    Healthy            Failed                46
DB2\MAIL2    Mounted           Healthy                 0

Here we see the log queue (ReplayQueueLength) and also the Content Index state as Failed. When we wait for a while, the queue will empty and the state will return to Healthy. Then we can move (activate) the DB copy again, but it will end up the same way. After several attempts, the copy activation will no longer succeed and we'll get the error:

An Active Manager operation failed. Error: The database action failed. Error: Move for database 'DB2' was suppressed
 because too many moves have happened recently. 4 moves

If we want to perform the move (activation) at this point, we need to use the Exchange Management Shell and add the attribute:

Move-ActiveMailboxDatabase DB2 -ActivateOnServer MAIL1 –SkipMoveSuppressionChecks

But the copy still doesn't activate on the new server. When browsing the application event log on the target server, I found the cause:

EventID 121 ExchangeStoreDB
At '23.02.2019 18:45:40' the Exchange store database 'DB2' copy on this server appears to have failed due to insufficient
 memory. For more details about the failure, consult the Event log on the server for other "ExchangeStoreDb" events.
 Recovery was not attempted.

A user (administrator) was logged in on the server and was using about 1.5 GB of RAM. After logging the user out, everything started working again.