Order of duplicate keys in non-clustered indexes [SQL Server]

qncylg1j  于 5个月前  发布在  SQL Server
关注(0)|答案(2)|浏览(65)

I have a SQL server table with ~10-100s of millions of rows which acts as a fact table in a star schema. An example definition would look like this:

CREATE TABLE fact (
    [ImportID] INT NOT NULL,

    [Dim1ID] INT NOT NULL,
    [Dim2ID] INT NOT NULL,
    [Dim3ID] INT NOT NULL,
    [Dim4ID] INT NOT NULL,
    [dData] FLOAT(53)

    CONSTRAINT PK_Data_Dims PRIMARY KEY CLUSTERED ([Dim4ID],[Dim3ID],[Dim2ID],[Dim1ID]),
    INDEX IX_Data_Imports NONCLUSTERED ([ImportID])
)

I need this table to be performant when:

  • Merging new records (enforce the PK constraint by overwriting existing data or inserting new data)
  • Performing the following query:
SELECT [Dim1ID], [Dim2ID], [Dim3ID], [Dim4ID], [dData]
FROM fact
WHERE [ImportID] = X
ORDER BY [Dim4ID],[Dim3ID],[Dim2ID],[Dim1ID]

I have included the indexes I would expect to want in the example definition but ORDER BY performance would seem to depend on how duplicate keys in the nonclustered index are ordered.

I have also considered using a heap table with non-clustered indices for both. (In practice, there are 13 INT dimension IDs so the key size with clustered PK_Data_Dims will be rather large.) I think this is likely to be slower, however, as the second dimension in the index is correlated with the time the records are inserted. (The first does not have many distinct values so should not fragment the imported records too greatly.) Therefore, I would expect there to be some spatial locality of the records in an import when they are inserted into the clustered index.

I may be greatly overthinking this but I see a few more situations like this coming up for me so I would appreciate some input on best practices or some additional factors that I might want to consider.

vdgimpew

vdgimpew1#

I have a SQL server table with ~10-100s of millions of rows which acts as a fact table in a star schema

best practices or some additional factors that I might want to consider.

General best practice here is to use a clustered columnstore index, and optionally a non-clustered primary key index, and additional non-clustred indexes if really necessary.

An additional dimension of table geometry can be introduced for very large tables by partitioning on one of the keys of the primary key index.

Clutered columnstores are just so much smaller, and can be scanned column-wise in batch mode. So really are the default for data warehouse fact tables.

nwnhqdif

nwnhqdif2#

A non-clustered index automatically includes the clustering key, even if you don't specify it. It is added in the same order as the clustering key.

You can however change the order explicitly, by adding those columns to your non-clustered index's key.

You can see this in action in this fiddle .

So in your particular case:

  • merging by primary key will always be OK, as the primary key always uniquely identifies a row.
  • The query will also be OK, as the non-clustered index will probably be used (to satisfy the WHERE ) and the ORDER BY requirement is in the same order as the clustering key. Again, if the ORDER BY changes then you can either reorder the primary/clustering key (with obviously no loss in uniqueness) or you can reorder the columns in the NCI by adding them explicitly in the key.

The size of the clustering key is a bit of a concern: it shouldn't be very wide normally. This is mainly because it can increase the depth of the B+tree, but also because the clustering key is used in every NC index also. This is less of a concern in your case, as you appear to need those columns in the NCI anyway.

相关问题