Slow performance due to SQL Server 2014 Buffer Pool Extension (BPE) and serial queries (MAXDOP 1)

The SQL Server 2014 Buffer Pool Extension feature can extend the buffer pool space on to Solid State Drive (SSD), where data and index pages can be cached. There is no possibility of data loss in the event of SSD failure as the cached data is clean (unmodified pages).

Buffer Pool Extension can provide better random I/O performance by decreasing the latency for such I/O operations using SSD.
Buffer Pool Extension is not suitable for large sequential reads with lesser degree of parallelism. When the required data pages are in BPE, single threads are performing only 1 I/O operation at a given time and that I/O operation is limited to 1 page (8KB), leading to too many small I/Os and decreased throughput. The wait type used while waiting for BPE I/Os to complete is ‘EC’.

To receive the example scripts used in the video, subscribe to our mailing list: https://newsletter.sqlworkshops.com

If a table’s size is 16GB and, let’s say, about half of that table is in BPE, a serial query (MAXDOP = 1) will access this 8GB, 1 page at a time. The reason being that there is no purpose in copying the 8GB that is in the BPE back to the actual buffer pool as it will have to replace useful pages that are already there. To read pages from the BPE without affecting the actual buffer pool, every thread is using 1 page of memory buffer to read 1 page from the BPE, processes that page and then issues the next read request to the BPE and so on. If you observe the read queue length of the SSD volume, you will notice, while such a serial query is executing, the read queue length will never exceed 1 and average bytes per read will always be 8KB.

Such queries will be many times faster without Buffer Pool Extension; BPE can slow down such queries. In our example, with 8GB total memory for SQL Server and 16GB BPE, a range scan covering half of a 16GB table without BPE takes about 15 seconds, whereas with BPE it takes about a minute with MAXDOP = 1. A full scan on such a table without BPE takes about 30 seconds, whereas with BPE it takes more than 5 to 10 minutes in some instances with MAXDOP = 1. The reason being that SQL Server’s access to spinning disks leads to better throughput as the I/Os are not limited to 8KB, they can be up to 512KB (provided the data pages are not fragmented) and outstanding I/Os are not limited to 1.

The data in the Buffer Pool Extension related to a table might not be contiguous and hence it might not be possible to issue larger I/Os.

The best way to mitigate this issue is to use parallel query execution with a large number of worker threads when performing large table or range scans when BPE is enabled. This effectively increases the number of outstanding I/Os from 1 to the number of parallel threads executing the query and keeps the SSD busy with many outstanding I/Os, even though the I/O size does not change and remains at 8KB.

NUMA remote (foreign) memory access overhead on Windows, SQL Server and In-Memory OLTP

In NUMA (Non-Uniform Memory Access), processors in the same physical location are grouped in a node which has its own local node memory. In a NUMA based system, there will be more than one such node and these nodes will use a shared interconnect mechanism to transfer data between them. In such a case, a processor accessing its local node memory will be much faster than the same processor accessing memory from remote node.

To receive the example scripts used in the video, subscribe to our mailing list: https://newsletter.sqlworkshops.com

Reference: https://software.intel.com/en-us/articles/optimizing-applications-for-numa

Most operating systems optimize memory allocation on a NUMA based system, such that when a thread executing on a processor allocates memory, the operating system will try to allocate the memory from the processor’s local node unless there is a memory availability issue, in which case it will allocate memory from a remote node.

Using a simple Windows application, we can demonstrate that a thread running on a processor accessing remote memory is 15%+ more expensive than accessing its local memory depending on the processor model.


To receive the example scripts used in the video, subscribe to our mailing list: https://newsletter.sqlworkshops.com

In SQL Server, when memory is allocated for caching data, it is allocated from the local node whenever it is possible. In cases where a query scans a table serially (MAXDOP = 1) and where memory is allocated for data cache, memory is always allocated from the local node when possible. This might lead the table to reside entirely in one memory node. When a thread executing a query, retrieving data from this table, happens to be on a different node, the data access becomes expensive.

Remote memory is referred to as “foreign memory” in SQL Server. Too much foreign node memory allocation by itself does not indicate an issues as the threads that are accessing this memory can be from any node. SQL Server does some amount of optimization regarding NUMA, like when a query executes in parallel, it keeps all the threads on a single node when possible.

Like explained above, when a table is cached on a single NUMA node part of a large table or range scan and later accessed from another node part of the large table or range scan, like using serial execution (MAXDOP = 1), the performance penalty can be 15%+ memory depending on the processor model. To mitigate this issue it is recommended to scan the table in parallel (MAXDOP greater than the number of processors in a NUMA node) so the data is not isolated to a single NUMA node.
On the positive side, one can isolate a table to a single NUMA node using Resource Governor and access that table always using processors in that node and avoid this 15%+ penalty. On a 2 node NUMA system this can lead to 7.5%+ overall improvement.


To receive the example scripts used in the video, subscribe to our mailing list: https://newsletter.sqlworkshops.com

This applies to SQL Server 2014 In-Memory OLTP technology as well, with a slight variation. With regular tables, non-memory optimized tables, where data is not in the cache; it is loaded into the cache on demand (whenever someone accesses the data). Hence, it is important to distribute the data across all NUMA nodes during query execution for predictable performance.

With memory optimized tables, data is loaded by SQL Server at startup, during this time SQL Server distributes data across all NUMA nodes.

The problem with memory optimized tables and data distribution across NUMA nodes occurs only during initial data load into the table. This problem disappears after SQL Server restart as explained above. Since queries execute serially (always MAXDOP = 1) in In-Memory OLTP, there is a possibility that all the data inserted by a single thread will reside on a single NUMA node.

Since In-Memory OLTP technology uses native code, there are less processor instructions; this magnifies the performance impact on foreign memory access. With In-Memory OLTP technology, this penalty can be even 30%+ depending on the processor model.

Usually foreign memory access should not be an issue with In-Memory OLTP usage as one should not perform large table or range scans.
Again, like with normal table and large scans, it is recommended to load the table with many threads, which sometimes means splitting large inserts into small parts and executing then in separate batches.