java—如何将pdf文件中提取的单词集成到mysql中进行索引?

o7jaxewo  于 2021-06-18  发布在  Mysql
关注(0)|答案(1)|浏览(253)

我用java编写了一个简单的应用程序,用于定位pdf文件,提取其中的所有文本并将其存储在hashset中。
我还在mysql中创建了一个数据库,其中包含一个表 columns ID, Location PATH, and Word . 位置路径应该存储从中读取和提取单词的pdf路径。例如, "D:/PDF/my.pdf ".
word应该将从该特定pdf文件中提取的所有单词存储在hashset中。
问题是如何通过将hashset存储到我的数据库表中来进行集成,以便在运行它时保持路径和单词的位置?
代码如下:

public class Main {

    public static void main(String[] args) throws Exception {

        HashSet<String> uniqueWords = new HashSet<>();
        try (PDDocument document = PDDocument.load(new File("D:/PDF/my.pdf"))) {

            if (!document.isEncrypted()) {

                PDFTextStripper tStripper = new PDFTextStripper();
                String pdfFileInText = tStripper.getText(document);
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    String[] words = line.split(" ");

                    for (String word : words) {
                        uniqueWords.add(word);

                    }

                }
               System.out.println(uniqueWords);

            }
        } catch (IOException e){
            System.err.println("Exception while trying to read pdf document - " + e);
        }

        MysqlAccess connection=new MysqlAccess();
        connection.readDataBase();

    }

}

sql连接代码:

public class MysqlAccess {
    private Connection connect = null;
    private Statement statement = null;
    private PreparedStatement preparedStatement = null;
    private ResultSet resultSet = null;

    public void readDataBase() throws Exception {
        try {
            // This will load the MySQL driver, each DB has its own driver
            Class.forName("com.mysql.jdbc.Driver");
            // Setup the connection with the DB
            connect = DriverManager
                    .getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
                            + "user=root&password=root");

            // Statements allow to issue SQL queries to the database
            statement = connect.createStatement();
            System.out.print("Connected");
            // Result set get the result of the SQL query

            preparedStatement = connect
                    .prepareStatement("insert into  fulltext_ltat.index_detail values (default, ?, ?)");

            preparedStatement.setString(1, "D:\\Full Text Indexing\\testIndex");
            preparedStatement.setString(2, "test");
            preparedStatement.executeUpdate();
            resultSet = statement
                    .executeQuery("select * from fulltext_ltat.index_detail");

            writeResultSet(resultSet);
        } catch (Exception e) {
            throw e;
        } finally {
            close();
        }

    }

    private void writeResultSet(ResultSet resultSet) throws SQLException {
        // ResultSet is initially before the first data set
        while (resultSet.next()) {
            // It is possible to get the columns via name
            // also possible to get the columns via the column number
            // which starts at 1
            // e.g. resultSet.getSTring(2);
            String path = resultSet.getString("path");
            String word = resultSet.getString("word");

            System.out.println();
            System.out.println("path: " + path);
            System.out.println("word: " + word);

        }
    }

    private void close() {
        try {
            if (resultSet != null) {
                resultSet.close();
            }

            if (statement != null) {
                statement.close();
            }

            if (connect != null) {
                connect.close();
            }
        } catch (Exception e) {

        }
    }

}

如有任何建议,将不胜感激。
编辑:对于任何不明白我意思的人,
假设我有一个叫做 "my.pdf" . 当我提取该文件中的单词并将唯一的单词存储在hashset中时,我希望它也存储在我在mysql中创建的表中。比如说,pdf位于 "D:/Folder/my.pdf" 那么这个表应该是这样的:

ID Location                Word 
1 "D:/FOLDER/my.pdf"       family 
2 "D:/FOLDER/my.pdf"       chicken....... and it goes on and on

这是可行的吗?

axr492tv

axr492tv1#

以下是一些可能对您有用的建议:

package com.test;

import java.io.File;
import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.HashSet;
import java.util.Set;

public class PopulateDatabase {

    // Read all the words in the file
    private static Set<String> getWordsFromFile(File pdfFile) throws IOException  {
        Set<String> uniqueWords = new HashSet<String>();
        // PDDocument document = PDDocument.load(pdfFile);
        // ...
        //   for (String word : words) {
        //      uniqueWords.add(word);
        //      ...
        return uniqueWords;
    }

    private static void updateDB(File pdfFile, Set<String>uniqueWords) throws ClassNotFoundException, SQLException {
        // Open DB
        Class.forName("com.mysql.jdbc.Driver");
        Connection conn = DriverManager.getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?user=root&password=root");

        // Insert rows
        for (String word : uniqueWords) {
            PreparedStatement preparedStatement = 
                conn.prepareStatement("insert into  fulltext_ltat.index_detail values (default, ?, ?)");
            preparedStatement.setString(1, pdfFile.getAbsolutePath());
            preparedStatement.setString(2, word);
            preparedStatement.executeUpdate();
        }

        // Close DB
        conn.close();
    }

    public static void main(String[] args) {
        // Read filepath for .pdf from cmd-line
        if (args.length == 0) {
            System.out.println("USAGE: PopulateDatabase <myPdfFiles>");
            return;
        }

        // Convert to Java "File" object.
        File pdfFile = new File(args[0]);
        if (!pdfFile.exists() ) {
            System.out.println("ERROR: " + args[0] + " does not exist!");
            return;
        }

        try {   
            // Parse file
            Set<String> words = getWordsFromFile(pdfFile);

            // Update database
            updateDB(pdfFile, words);

            // Done
            System.out.println("Done: #/words: " + words.size() + ", pdfFile: " + pdfFile.getName());
        } catch (Exception e) {
            // ...
        }
    }

}

您可以稍微修改代码以允许多个文件或通配符。您还可以添加另一个方法来查询添加的行。
为了节省数据库空间,可以为每个文件路径指定一个id(而不是为每个单词存储整个文件路径字符串)。
“希望这有帮助。。。

相关问题