SQL Server GROUP BY works but I don't think it should

6yt4nkrj  于 4个月前  发布在  其他
关注(0)|答案(2)|浏览(63)

From https://learn.microsoft.com/en-us/sql/t-sql/queries/select-group-by-transact-sql?view=sql-server-ver16 it says "However, each table or view column in any nonaggregate expression in the list must be included in the GROUP BY list". Exactly as you'd expect (there seem to be a couple of exceptions, probably MSSQL specific, regarding string concatenation but they don't seem to be relevant here).

At work I intended to write this:

SELECT 
    max(STATS_DATE(object_id, stats_id)), 
    object_name(object_id)
FROM sys.stats  
GROUP BY object_name(object_id)

Which works as expected. What I actually wrote by accident was this

SELECT 
    max(STATS_DATE(object_id, stats_id)), 
    object_name(object_id)
FROM sys.stats  
GROUP BY object_id -- should have been object_name(object_id)

The SELECT contains object_name(object_id) but the GROUP BY contains object_id , which is definitely not in the SELECT. ISTM this should have been rejected, but it runs fine.

I don't get it. Is there some support for functional dependencies that I'm not aware of? Is there even a functional dependency between object_id and object_name(object_id) (I don't think so because object_name(object_id) might not necessarily be unique, although my understanding of FDs is pretty poor).

What's going on?

k3bvogb1

k3bvogb11#

With

GROUP BY object_id

you get one result row per object_id, say IDs 1, 2, 3 and 4.

With

SELECT object_name(object_id)

you get their object names, e.g. 'Object 1', 'Object 2', 'Object 3', 'Object 4'.

As each result row refers to one object ID, you get the name for that object ID in your result row. Nothing wrong with this.

It would be different with

SELECT object_id
...
GROUP BY object_name(object_id)

Here you'd group by the object name of the object IDs. Two or more IDs could have the same name. It is not possible hence to select the object ID, because it is not known and not necessarily unique per object name.

But this is not the case in your query.

tnkciper

tnkciper2#

Is there some support for functional dependencies that I'm not aware of?

Sort of, but no.

What you are allowed to do is use the GROUP BY value in any way you like in the SELECT , with any functions or calculations, as long as you are not using any columns or values that have not been grouped.

So the below query works, because object_id has been grouped, and you are allowed to pass a grouping value to a function.

SELECT
    max(STATS_DATE(object_id, stats_id)), 
    object_name(object_id)
    -- whatever
FROM sys.stats  
GROUP BY object_id

You can add two GROUP BY values together, or pass them to a function, or cast them etc.

What you cannot do is do the calculation, then group that, and then use the original value from before the calculation. The original value is not present in the GROUP BY so cannot be used, only the actual value that was used.

SELECT 
    max(STATS_DATE(object_id, stats_id)), 
    object_id
FROM sys.stats  
GROUP BY object_name(object_id)

Two further notes:

  • Don't do what you intended to do and group by object_name , because

  • grouping by a nvarchar is less efficient.

  • the compiler does not understand any kind of functional dependency, and therefore may hash or sort where it doesn't need to.

  • Ideally don't use this type of aggregation at all, as it often forces a full grouping. Instead use APPLY to calculate a value per object.

  • Furthermore, you shouldn't use those convenience functions if you can avoid it, as they have a number of issues, primarily around isolation levels and locking. Instead use the catalog views, such as sys.tables and dynamic management views such as sys.dm_db_stats_properties .

Instead, use the following query:

SELECT 
    MaxStatsDate, 
    t.name
FROM sys.tables t
CROSS APPLY (
    SELECT MAX(p.last_updated) AS MaxDate
    FROM sys.stats s
    CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) p
) s;

相关问题