SQL Server How can I insert long string into UTF8-collated varchar(N) column?

lokaqttq  于 6个月前  发布在  其他
关注(0)|答案(2)|浏览(34)

I have data in VARCHAR(MAX) source, that I need to insert into another table as VARCHAR(N) .
The collation I need to use is UTF8 (specifically Czech_100_CI_AS_SC_UTF8 ).
The problem is, even if I cut down the data up to maximum permissible length (using LEFT function, or SUBSTRING , or whatever), the insert fails for some data, with
Msg 2628, Level 16, State 1, Line 25
String or binary data would be truncated in table 'tablename', column 'columnname'. Truncated value: 'sometext'.

My question is, how to properly shorten the text, so it will fit, but still keep as much of it as possible?
The actual length of the string will depend on characters used, for latin-only texts the full length can be used. If there are a few accented characters, that will shorten the usable length by a bit. If the text is fully Unicode (non-latin script, e.g. Japanese), the usable length might be cut in half, or third.

Steps to reproduce:

DROP TABLE IF EXISTS [#tmpdl]
GO
CREATE TABLE [#tmpdl] ([Txt] VARCHAR(10) COLLATE Czech_100_CI_AS_SC_UTF8 NULL)
GO
INSERT INTO [#tmpdl] ([Txt]) VALUES (LEFT('123456789abcd', 10)) --this works
GO
INSERT INTO [#tmpdl] ([Txt]) VALUES (LEFT('123456789ábcd', 10)) --this fails
GO
INSERT INTO [#tmpdl] ([Txt]) VALUES (LEFT(N'一二三四', 10)) --this also fails
GO
syqv5f0l

syqv5f0l1#

even if I cut down the data up to maximum permissible length (using LEFT function, or SUBSTRING , or whatever), the insert fails for some data

With LEFT or SUBSTRING you are cutting it down to the specified number of characters, whereas the 10 in varchar(10) specifies the number of bytes, so after your truncation with LEFT or SUBSTRING you are left with a string that may still have more bytes than 10 - which is the same, for the purpose of insertion, as if you never truncated it in the first place.

A simple way to truncate the string to a number of characters rather than bytes is to use CAST or CONVERT , because they cut the string before the surrogate pair that doesn't fit and keep the original collation:

INSERT INTO [#tmpdl] ([Txt]) VALUES (cast(N'123456789ábcd' COLLATE Czech_100_CI_AS_SC_UTF8 as varchar(10)));
INSERT INTO [#tmpdl] ([Txt]) VALUES (cast(N'一二三四' COLLATE Czech_100_CI_AS_SC_UTF8 as varchar(10)));

Do note that this requires SQL Server 2012 or later.

pxq42qpu

pxq42qpu2#

The 10 for a UTF8 value denotes the bytes that can be stored in the data type, not the number of characters. For the value '123456789abcd' these are all ANSI characters and 1 bytes in size. For '123456789ábcd' , however, the character á is 2 bytes in size; this makes the datalength of the string you try to INSERT 11 bytes, not 10. It's a similar story for your last string, '一二三四' :

SELECT V.YourString,
       DATALENGTH(V.YourString),
       DATALENGTH(LEFT(V.YourString,10))
FROM (VALUES('123456789abcd' COLLATE Czech_100_CI_AS_SC_UTF8),
            (CONVERT(varchar(20),N'123456789ábcd' COLLATE Czech_100_CI_AS_SC_UTF8)),
            (CONVERT(varchar(20),N'一二三四' COLLATE Czech_100_CI_AS_SC_UTF8)))V(YourString);
YourStringDatalength10CharsDataLength
123456789abcd1310
123456789ábc1411
一二三四1212

You need to ensure that the length of the column you define has enough space for all the bytes, not the number of characters:

DROP TABLE IF EXISTS #tmpdl
GO
CREATE TABLE [#tmpdl] ([Txt] VARCHAR(12) COLLATE Czech_100_CI_AS_SC_UTF8 NULL)
GO
INSERT INTO [#tmpdl] ([Txt]) VALUES (LEFT('123456789abcd', 10)) --this works
GO
INSERT INTO [#tmpdl] ([Txt]) VALUES (LEFT('123456789ábcd', 10)) --this fails
GO
INSERT INTO [#tmpdl] ([Txt]) VALUES (LEFT(N'一二三四', 10)) --this also fails
GO
SELECT *
FROM  #tmpdl;
GO
DROP TABLE #tmpdl

相关问题