NUMA remote (foreign) memory access overhead on Windows, SQL Server and In-Memory OLTP

In NUMA (Non-Uniform Memory Access), processors in the same physical location are grouped in a node which has its own local node memory. In a NUMA based system, there will be more than one such node and these nodes will use a shared interconnect mechanism to transfer data between them. In such a case, a processor accessing its local node memory will be much faster than the same processor accessing memory from remote node.

To receive the example scripts used in the video, subscribe to our mailing list: https://newsletter.sqlworkshops.com

Reference: https://software.intel.com/en-us/articles/optimizing-applications-for-numa

Most operating systems optimize memory allocation on a NUMA based system, such that when a thread executing on a processor allocates memory, the operating system will try to allocate the memory from the processor’s local node unless there is a memory availability issue, in which case it will allocate memory from a remote node.

Using a simple Windows application, we can demonstrate that a thread running on a processor accessing remote memory is 15%+ more expensive than accessing its local memory depending on the processor model.


To receive the example scripts used in the video, subscribe to our mailing list: https://newsletter.sqlworkshops.com

In SQL Server, when memory is allocated for caching data, it is allocated from the local node whenever it is possible. In cases where a query scans a table serially (MAXDOP = 1) and where memory is allocated for data cache, memory is always allocated from the local node when possible. This might lead the table to reside entirely in one memory node. When a thread executing a query, retrieving data from this table, happens to be on a different node, the data access becomes expensive.

Remote memory is referred to as “foreign memory” in SQL Server. Too much foreign node memory allocation by itself does not indicate an issues as the threads that are accessing this memory can be from any node. SQL Server does some amount of optimization regarding NUMA, like when a query executes in parallel, it keeps all the threads on a single node when possible.

Like explained above, when a table is cached on a single NUMA node part of a large table or range scan and later accessed from another node part of the large table or range scan, like using serial execution (MAXDOP = 1), the performance penalty can be 15%+ memory depending on the processor model. To mitigate this issue it is recommended to scan the table in parallel (MAXDOP greater than the number of processors in a NUMA node) so the data is not isolated to a single NUMA node.
On the positive side, one can isolate a table to a single NUMA node using Resource Governor and access that table always using processors in that node and avoid this 15%+ penalty. On a 2 node NUMA system this can lead to 7.5%+ overall improvement.


To receive the example scripts used in the video, subscribe to our mailing list: https://newsletter.sqlworkshops.com

This applies to SQL Server 2014 In-Memory OLTP technology as well, with a slight variation. With regular tables, non-memory optimized tables, where data is not in the cache; it is loaded into the cache on demand (whenever someone accesses the data). Hence, it is important to distribute the data across all NUMA nodes during query execution for predictable performance.

With memory optimized tables, data is loaded by SQL Server at startup, during this time SQL Server distributes data across all NUMA nodes.

The problem with memory optimized tables and data distribution across NUMA nodes occurs only during initial data load into the table. This problem disappears after SQL Server restart as explained above. Since queries execute serially (always MAXDOP = 1) in In-Memory OLTP, there is a possibility that all the data inserted by a single thread will reside on a single NUMA node.

Since In-Memory OLTP technology uses native code, there are less processor instructions; this magnifies the performance impact on foreign memory access. With In-Memory OLTP technology, this penalty can be even 30%+ depending on the processor model.

Usually foreign memory access should not be an issue with In-Memory OLTP usage as one should not perform large table or range scans.
Again, like with normal table and large scans, it is recommended to load the table with many threads, which sometimes means splitting large inserts into small parts and executing then in separate batches.