About R Meyyappan

Ramesh Meyyappan (www.sqlworkshops.com) is a SQL Server specialist with expertise in Performance Tuning. Ramesh worked at Microsoft Corporation from 1994 to 2004. Nearly half of that time he worked in Redmond in the development teams - specifically as Program Manager in the SQL Server Development Team responsible for optimizing SQL Server product for SAP. Ramesh now offers onsite and offsite consulting and workshops independently as well as by partnering with various Microsoft Subsidiaries. LinkedIn Profile: http://de.linkedin.com/in/rmeyyappan.

Data Compression in SQL Server – Pros and Cons

SQL Server supports row and page compression on tables, indexes and partitions. This can lead to reduced I/O and better performance. However, it can also result in additional CPU usage in some cases, outweighing the benefits of data compression. SQL Server query optimizer does not cost the overhead of expanding the compressed data, which can lead to plan regression as shown in the video.

The below video demonstrates this with hands-on example. SQLTest Tool simulates the hands-on labs for you, no registration necessary. You can practice the online hands-on example while watching the video.

Check out http://www.sqlvideo.com for additional hands-on videos.

Subscribe to our newsletter: https://newsletter.sqlworkshops.com

SQL Server Command Timeout – Application Timeout – Attention

When you use ODBC or SqlClient to access data from SQL Server, by default the query will be cancelled if there is no response from the server within a certain period of time (30 seconds by default). ODBC or SqlClient will start a timer after sending the query to SQL Server and if there are no results from the server within 30 seconds, the query will be cancelled and a timeout error message will be sent to the application. In case a network packet (containing query results) was received within the 30 seconds, and if we are expecting additional network packets (additional rows that couldn’t fit in the first network packet), then the timer is restarted, if the next network packet is not received within this 30 second period, the query will be cancelled.

The common reasons for SQL Command timeout are non-optimal schema or inefficient queries that execute for a long period of time or that miss indexes or lock wait issues. It is important to tune the queries rather than increase the SQL Command timeout settings.

In the video an example is demonstrated where decreasing the network packet size eliminates the timeout, this demonstration is to understand how network packets are involved in resetting the timer and to develop better understanding of timeout. Changing the network packet size is not a solution to avoid timeout, tuning non-optimal schema or inefficient queries or adding indexes or avoiding lock waits are the proper solution.

To identify the command that leads to SQL Command timeout error, you can use Extended Events and monitor event ‘sqlserver.attention’. The extended event script used in the video is available at www.sqlvideo.com/xevents.

In SQL Server Management Studio, the SQL Command timeout can be changed using ‘Options’ in the connection dialog. By default, this is set to ‘0’, which means no timeout. When a query or stored procedure is executing, if you click the ‘Cancel’ button (the red square), the same attention will be sent to the SQL Server (like SQL Command Timeout). And this will in-turn generate ‘sqlserver.attention’ event in Extended Events.

In the below video you can see an example of the SQL Command timeout. SQLTest Tool simulates the hands-on labs for you, no registration necessary. You can practice the online hands-on example while watching the video.

Check out http://www.sqlvideo.com for additional hands-on videos.

Subscribe to our newsletter: https://newsletter.sqlworkshops.com

Slow performance due to SQL Server 2014 Buffer Pool Extension (BPE) and serial queries (MAXDOP 1)

The SQL Server 2014 Buffer Pool Extension feature can extend the buffer pool space on to Solid State Drive (SSD), where data and index pages can be cached. There is no possibility of data loss in the event of SSD failure as the cached data is clean (unmodified pages).

Buffer Pool Extension can provide better random I/O performance by decreasing the latency for such I/O operations using SSD.
Buffer Pool Extension is not suitable for large sequential reads with lesser degree of parallelism. When the required data pages are in BPE, single threads are performing only 1 I/O operation at a given time and that I/O operation is limited to 1 page (8KB), leading to too many small I/Os and decreased throughput. The wait type used while waiting for BPE I/Os to complete is ‘EC’.

To receive the example scripts used in the video, subscribe to our mailing list: https://newsletter.sqlworkshops.com

If a table’s size is 16GB and, let’s say, about half of that table is in BPE, a serial query (MAXDOP = 1) will access this 8GB, 1 page at a time. The reason being that there is no purpose in copying the 8GB that is in the BPE back to the actual buffer pool as it will have to replace useful pages that are already there. To read pages from the BPE without affecting the actual buffer pool, every thread is using 1 page of memory buffer to read 1 page from the BPE, processes that page and then issues the next read request to the BPE and so on. If you observe the read queue length of the SSD volume, you will notice, while such a serial query is executing, the read queue length will never exceed 1 and average bytes per read will always be 8KB.

Such queries will be many times faster without Buffer Pool Extension; BPE can slow down such queries. In our example, with 8GB total memory for SQL Server and 16GB BPE, a range scan covering half of a 16GB table without BPE takes about 15 seconds, whereas with BPE it takes about a minute with MAXDOP = 1. A full scan on such a table without BPE takes about 30 seconds, whereas with BPE it takes more than 5 to 10 minutes in some instances with MAXDOP = 1. The reason being that SQL Server’s access to spinning disks leads to better throughput as the I/Os are not limited to 8KB, they can be up to 512KB (provided the data pages are not fragmented) and outstanding I/Os are not limited to 1.

The data in the Buffer Pool Extension related to a table might not be contiguous and hence it might not be possible to issue larger I/Os.

The best way to mitigate this issue is to use parallel query execution with a large number of worker threads when performing large table or range scans when BPE is enabled. This effectively increases the number of outstanding I/Os from 1 to the number of parallel threads executing the query and keeps the SSD busy with many outstanding I/Os, even though the I/O size does not change and remains at 8KB.

NUMA remote (foreign) memory access overhead on Windows, SQL Server and In-Memory OLTP

In NUMA (Non-Uniform Memory Access), processors in the same physical location are grouped in a node which has its own local node memory. In a NUMA based system, there will be more than one such node and these nodes will use a shared interconnect mechanism to transfer data between them. In such a case, a processor accessing its local node memory will be much faster than the same processor accessing memory from remote node.

To receive the example scripts used in the video, subscribe to our mailing list: https://newsletter.sqlworkshops.com

Reference: https://software.intel.com/en-us/articles/optimizing-applications-for-numa

Most operating systems optimize memory allocation on a NUMA based system, such that when a thread executing on a processor allocates memory, the operating system will try to allocate the memory from the processor’s local node unless there is a memory availability issue, in which case it will allocate memory from a remote node.

Using a simple Windows application, we can demonstrate that a thread running on a processor accessing remote memory is 15%+ more expensive than accessing its local memory depending on the processor model.


To receive the example scripts used in the video, subscribe to our mailing list: https://newsletter.sqlworkshops.com

In SQL Server, when memory is allocated for caching data, it is allocated from the local node whenever it is possible. In cases where a query scans a table serially (MAXDOP = 1) and where memory is allocated for data cache, memory is always allocated from the local node when possible. This might lead the table to reside entirely in one memory node. When a thread executing a query, retrieving data from this table, happens to be on a different node, the data access becomes expensive.

Remote memory is referred to as “foreign memory” in SQL Server. Too much foreign node memory allocation by itself does not indicate an issues as the threads that are accessing this memory can be from any node. SQL Server does some amount of optimization regarding NUMA, like when a query executes in parallel, it keeps all the threads on a single node when possible.

Like explained above, when a table is cached on a single NUMA node part of a large table or range scan and later accessed from another node part of the large table or range scan, like using serial execution (MAXDOP = 1), the performance penalty can be 15%+ memory depending on the processor model. To mitigate this issue it is recommended to scan the table in parallel (MAXDOP greater than the number of processors in a NUMA node) so the data is not isolated to a single NUMA node.
On the positive side, one can isolate a table to a single NUMA node using Resource Governor and access that table always using processors in that node and avoid this 15%+ penalty. On a 2 node NUMA system this can lead to 7.5%+ overall improvement.


To receive the example scripts used in the video, subscribe to our mailing list: https://newsletter.sqlworkshops.com

This applies to SQL Server 2014 In-Memory OLTP technology as well, with a slight variation. With regular tables, non-memory optimized tables, where data is not in the cache; it is loaded into the cache on demand (whenever someone accesses the data). Hence, it is important to distribute the data across all NUMA nodes during query execution for predictable performance.

With memory optimized tables, data is loaded by SQL Server at startup, during this time SQL Server distributes data across all NUMA nodes.

The problem with memory optimized tables and data distribution across NUMA nodes occurs only during initial data load into the table. This problem disappears after SQL Server restart as explained above. Since queries execute serially (always MAXDOP = 1) in In-Memory OLTP, there is a possibility that all the data inserted by a single thread will reside on a single NUMA node.

Since In-Memory OLTP technology uses native code, there are less processor instructions; this magnifies the performance impact on foreign memory access. With In-Memory OLTP technology, this penalty can be even 30%+ depending on the processor model.

Usually foreign memory access should not be an issue with In-Memory OLTP usage as one should not perform large table or range scans.
Again, like with normal table and large scans, it is recommended to load the table with many threads, which sometimes means splitting large inserts into small parts and executing then in separate batches.

SQL Server 2012: Indirect Checkpoint (Target Recovery Time), get the real scoop

SQL Server 2012 brings a new feature called Indirect Checkpoint. You can read more about it here: http://msdn.microsoft.com/en-us/library/ms189573.aspx.

With Indirect Checkpoint, you get smaller and too many I/Os

Checkpoint normally writes large blocks of data to disk optimally in single write operation, up to 256KB, depending on the number of contiguous dirty pages in cache that needs to be flushed to disk. Once you turn on the indirect checkpoint feature by setting target recovery time to a non-zero value, the checkpoint writes will turn into single page writes, 8KB writes.

This will lead to too many I/Os. We all know few large I/Os are more efficient than many small I/Os. This may slow down overall write throughput and performance. If you are hosting your server on the cloud based virtual servers, this means, it will increase the cost as you pay by the number of I/O operations you incur.

Too many outstanding I/Os

Normally check point throttles the number of outstanding I/Os by setting threshold based on write latency in order not to overload the storage subsystem. But once you enable indirect checkpoint, SQL Server can queue up hundreds of thousands of I/Os. This will not only overload the storage subsystem, if users want to read data from the database, their I/O requests will be queued up way behind  the indirect checkpoint writes and they may experience very slow response time and timeouts.

You might also see such errors in the errorlog as large number of I/Os in the queue may take a long time to complete:

A time-out occurred while waiting for buffer latch — type 4, bp 0000000283508040, page 1:1080, stat 0x60d, database id: 16, allocation unit Id: 0/281474976710656, task 0x00000002321BC558 : 0, waittime 300 seconds, flags 0x1a, owning task 0x00000002321BC558. Not continuing to wait.

High CPU usage

When you enable indirect checkpoint, you might encounter high CPU usage. One of the CPUs that is performing the write might spin at 100%. This can be monitored by measuring DP_LIST spins as shown below, more than 4 billion spins were noticed by running the below test just once. Also the writers will use much more CPU.

select * from sys.dm_os_spinlock_stats where name = ‘DP_LIST’

You will notice the updates are much slower when indirect checkpoint is enabled.

In extreme cases when you try to restart the SQL Server service, it will take a long time to stop. Meanwhile you can neither connect nor start the service as the service is already in the process of stopping, flushing the dirty pages before shutdown.

You can use the below scripts to reproduce the scenario

use master

if  exists (select * from sys.databases where name = ‘sqlworkshops’)

drop database sqlworkshops

go

use master

create database sqlworkshops on primary (name = ‘sqlworkshops_1′, filename = ‘P:sqlworkshops_1.mdf’, size = 5GB)

log on (name = ‘sqlworkshops_log_1′, filename = ‘Q:sqlworkshops_log_1.ldf’, size = 64MB)

go

use master

alter database sqlworkshops set recovery simple

go

use sqlworkshops

if  exists (select * from sys.objects where name = ‘tab71′)

drop table tab71

go

use sqlworkshops

create table tab71 (c1 int primary key clustered, c2 int, c3 char(2000))

insert into tab71 with (tablock) select top 500000 row_number() over (order by c1.object_id, c2.o) as c1, 1 as c2, replicate(‘a’, 2000) as c3

from sys.all_columns c1, sys.all_columns c2

update statistics tab71 with fullscan

checkpoint

go

use sqlworkshops

if  exists (select * from sys.objects where name = ‘tab72′)

drop table tab72

go

use sqlworkshops

create table tab72 (c1 int primary key clustered, c2 int, c3 char(2000))

insert into tab72 with (tablock) select top 500000 row_number() over (order by c1.object_id) as c1, 1 as c2, replicate(‘a’, 2000) as c3

from sys.all_columns c1, sys.all_columns c2

update statistics tab72 with fullscan

checkpoint

go

 

–Session 1:

use sqlworkshops

update tab71 set c2 = 2

go

use sqlworkshops

update tab71 set c2 = 3

go

 

–Session 2 (optional):

use sqlworkshops

update tab72 set c2 = 2

go

use sqlworkshops

update tab72 set c2 = 3

go

 

–To monitor spins:

select * from sys.dm_os_wait_stats

where wait_type = ‘DIRTY_PAGE_POLL’

go