Read 9 million xml files and transfer data into SQL Server

ipakzgxi  于 6个月前  发布在  SQL Server
关注(0)|答案(2)|浏览(54)

I am using Visual Studio 2022. I have 9 million XML files and need to transfer the data into SQL Server.

I am reading in the XML through XmlDocument , Xmlnode list. It's working fine, but taking so much time to complete - around 4 hours to complete the task. With SqlBulkCopy , it's taking 5 minutes only.

I am reading xml file one by one. How could I read multiple XML files on each thread? I think using thread concept, I read all XML files easily.

My code:

DateTime startTime = DateTime.Now;
_xmlfiles = Directory.GetFiles(source, "*.xml", SearchOption.AllDirectories).ToList();                
_xmlfiles.Remove(Path.Combine(Path.GetFullPath(source), "abc.xml"));
_xmlfiles.Remove(Path.Combine(Path.GetFullPath(source), "xyz.xml"));
            
WriteLog(_logPath, string.Format("{0} |xml files are found in given path.{1}", DateTime.Now,_xmlfiles.Count ));
DataTable _table = /* code for get table structure */;

foreach (string _xmlfile in _xmlfiles)
{
    _casFile = Path.GetFileNameWithoutExtension(_xmlfile);
    LoadXml(_xmlfile, _table);                    
}
           
WriteLog(_logPath, string.Format("{0} |Reading the files is completed,Found the Total Phrases {1}.",DateTime.Now, _table.Rows.Count ));

DataSet ds = new DataSet();
ds.Tables.Add(_table);

_sqlWriter = new SQLWriter();

WriteLog(_logPath, string.Format("{0} |Writing Phrases into data base is started......",DateTime.Now ));

_sqlWriter.AutoSqlBulkCopy(ds);

Load XML method:

XmlDocument _casFile = new XmlDocument();
_casFile.Load(xmlfile);

XmlNodeList _instanceList = _casFile.SelectNodes("Specification/Substance/Property/Instance");

foreach (XmlNode _instance in _instanceList)
{
    string _day = _instance.Attributes["DD"].Value;
    string _month = _instance.Attributes["MM"].Value;
    string _year = _instance.Attributes["YYYY"].Value;
    string _active =_instance.Attributes["Active"].Value;
    _uniqueID = _instance.Attributes["UniqueID"].Value;                    
    _baseCode = _instance.Attributes["LitSource"].Value;                    

    XmlNodeList _phraseList = _instance.SelectNodes("Fields/Phrases");

    foreach (XmlNode _phrase in _phraseList)
    {
        string _Characteristic = _phrase.Attributes["ID"].Value;
        XmlNodeList _childList = _phrase.ChildNodes;

        foreach(XmlNode _child in _childList)
        {
            DataRow row = table.NewRow();
            row["Substance"] = _substance;
            row["UniqueID"] = _uniqueID;
            row["Characteristic"] = _Characteristic;
            row["PhraseID"] = _child.Attributes["Key"].Value;

            if (_child.Attributes["Code"] != null)
            {
                row["Code"] = _child.Attributes["Code"].Value;
            }
            else
            {
                row["Code"] = string.Empty;
            }

            row["BaseCode"] = _baseCode;                            
            row["Description"] = _child.InnerText;
            row["Active"] = _active;
            row["DD"] = _day;
            row["MM"] = _month;
            row["YYYY"] = _year;

            table.Rows.Add(row);
        }                        
    }
}
luaexgnf

luaexgnf1#

You could perform all of the XML parsing and table insert operations on the SQL Server side by defining a stored procedure that encapsulates all of that logic. The C# application could then read the XML file and just pass the entire XML string to that stored procedure.

I am not sure what the performance difference would be, but I suspect the SQL Server set operations would be some improvement. The number of objects created on the C# side and the number of calls between the application and the SQL server instance would also be reduced.

Something like:

CREATE PROCEDURE LoadCasTableFromXMl @CasXml XML
AS
    INSERT CasTable(Substance, UniqueID, Characteristic, PhraseID, Code, 
                    BaseCode, Description, Active, DD, MM, YYYY)
    SELECT
        'Unknown _substance' AS Substance,
        I.Instance.value('(@UniqueID)[1]', 'NVARCHAR(MAX)') AS UniqueID,
        P.Phrase.value('(@ID)[1]', 'NVARCHAR(MAX)') AS Characteristic,
        C.Child.value('(@Key)[1]', 'NVARCHAR(MAX)') AS PhraseID,
        ISNULL(C.Child.value('(@Code)[1]', 'NVARCHAR(MAX)'), '') AS Code,
        I.Instance.value('(@LitSource)[1]', 'NVARCHAR(MAX)') AS BaseCode,
        C.Child.value('(text())[1]', 'NVARCHAR(MAX)') AS Description,
        I.Instance.value('(@Active)[1]', 'NVARCHAR(MAX)') AS Active,
        I.Instance.value('(@DD)[1]', 'NVARCHAR(MAX)') AS DD,
        I.Instance.value('(@MM)[1]', 'NVARCHAR(MAX)') AS MM,
        I.Instance.value('(@YYYY)[1]', 'NVARCHAR(MAX)') AS YYYY
    FROM @CasXml.nodes('/Specification/Substance/Property/Instance') I(Instance)
    CROSS APPLY I.Instance.nodes('./Fields/Phrases') P(Phrase)
    CROSS APPLY P.Phrase.nodes('./*') C(Child)

Or perhaps:

CREATE PROCEDURE LoadCasTableFromXMl @CasXmlText NVARCHAR(MAX)
AS
    DECLARE @CasXml XML = @CasXmlText
    ...

Lacking details, I've just assumed NVARCHAR(MAX) for all columns. Hopefully, your table has properly typed columns, and the above code should be modified to use the correct types. (I also hope that your end table does not store dates as separate YYYY , MM , and DD columns. This is a bad practice and you should use a single column properly typed as DATE .)

Sample results:
| Substance | UniqueID | Characteristic | PhraseID | Code | BaseCode | Description | Active | DD | MM | YYYY |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| Unknown _substance | 111 | 1 | Key1 | Code1 | AAA | Description1 | True | 31 | 12 | 2023 |
| Unknown _substance | 111 | 1 | Key2 | | AAA | Description2 | True | 31 | 12 | 2023 |
| Unknown _substance | 222 | 1 | Key3 | Code3 | BBB | Description3 | False | 01 | 01 | 2024 |

See this db<>fiddle for a demo.

cmssoen2

cmssoen22#

You should use Linq and the .AsParallel() method if you are sure that the way you insert stuff into the database has no order dependency.

相关问题