SQL Server 2012: Indirect Checkpoint (Target Recovery Time), get the real scoop

SQL Server 2012 brings a new feature called Indirect Checkpoint. You can read more about it here: http://msdn.microsoft.com/en-us/library/ms189573.aspx.

With Indirect Checkpoint, you get smaller and too many I/Os

Checkpoint normally writes large blocks of data to disk optimally in single write operation, up to 256KB, depending on the number of contiguous dirty pages in cache that needs to be flushed to disk. Once you turn on the indirect checkpoint feature by setting target recovery time to a non-zero value, the checkpoint writes will turn into single page writes, 8KB writes.

This will lead to too many I/Os. We all know few large I/Os are more efficient than many small I/Os. This may slow down overall write throughput and performance. If you are hosting your server on the cloud based virtual servers, this means, it will increase the cost as you pay by the number of I/O operations you incur.

Too many outstanding I/Os

Normally check point throttles the number of outstanding I/Os by setting threshold based on write latency in order not to overload the storage subsystem. But once you enable indirect checkpoint, SQL Server can queue up hundreds of thousands of I/Os. This will not only overload the storage subsystem, if users want to read data from the database, their I/O requests will be queued up way behind  the indirect checkpoint writes and they may experience very slow response time and timeouts.

You might also see such errors in the errorlog as large number of I/Os in the queue may take a long time to complete:

A time-out occurred while waiting for buffer latch — type 4, bp 0000000283508040, page 1:1080, stat 0x60d, database id: 16, allocation unit Id: 0/281474976710656, task 0x00000002321BC558 : 0, waittime 300 seconds, flags 0x1a, owning task 0x00000002321BC558. Not continuing to wait.

High CPU usage

When you enable indirect checkpoint, you might encounter high CPU usage. One of the CPUs that is performing the write might spin at 100%. This can be monitored by measuring DP_LIST spins as shown below, more than 4 billion spins were noticed by running the below test just once. Also the writers will use much more CPU.

select * from sys.dm_os_spinlock_stats where name = ‘DP_LIST’

You will notice the updates are much slower when indirect checkpoint is enabled.

In extreme cases when you try to restart the SQL Server service, it will take a long time to stop. Meanwhile you can neither connect nor start the service as the service is already in the process of stopping, flushing the dirty pages before shutdown.

You can use the below scripts to reproduce the scenario

use master

if  exists (select * from sys.databases where name = ‘sqlworkshops’)

drop database sqlworkshops

go

use master

create database sqlworkshops on primary (name = ‘sqlworkshops_1′, filename = ‘P:sqlworkshops_1.mdf’, size = 5GB)

log on (name = ‘sqlworkshops_log_1′, filename = ‘Q:sqlworkshops_log_1.ldf’, size = 64MB)

go

use master

alter database sqlworkshops set recovery simple

go

use sqlworkshops

if  exists (select * from sys.objects where name = ‘tab71′)

drop table tab71

go

use sqlworkshops

create table tab71 (c1 int primary key clustered, c2 int, c3 char(2000))

insert into tab71 with (tablock) select top 500000 row_number() over (order by c1.object_id, c2.o) as c1, 1 as c2, replicate(‘a’, 2000) as c3

from sys.all_columns c1, sys.all_columns c2

update statistics tab71 with fullscan

checkpoint

go

use sqlworkshops

if  exists (select * from sys.objects where name = ‘tab72′)

drop table tab72

go

use sqlworkshops

create table tab72 (c1 int primary key clustered, c2 int, c3 char(2000))

insert into tab72 with (tablock) select top 500000 row_number() over (order by c1.object_id) as c1, 1 as c2, replicate(‘a’, 2000) as c3

from sys.all_columns c1, sys.all_columns c2

update statistics tab72 with fullscan

checkpoint

go

 

–Session 1:

use sqlworkshops

update tab71 set c2 = 2

go

use sqlworkshops

update tab71 set c2 = 3

go

 

–Session 2 (optional):

use sqlworkshops

update tab72 set c2 = 2

go

use sqlworkshops

update tab72 set c2 = 3

go

 

–To monitor spins:

select * from sys.dm_os_wait_stats

where wait_type = ‘DIRTY_PAGE_POLL’

go