在windows上使用spyder和/或jupyter笔记本将hive数据库与pyspark连接起来

ntjbwcob 于 2021-05-27 发布在 Spark

关注(0)|答案(0)|浏览(415)

我正在尝试将hive数据库与Pypark连接，以便在我的windows计算机上使用spyder和/或jupyter notebook来流式传输实时数据，并创建可以运行分析、构建可视化和创建ml模型的dataframe。
我拥有的（包括组织隐私的示例）：数据库主机地址：“localhost”，端口：“10001”
当在google（github，stackoverflow，towardsdatascience，medium）上进行研究时，我发现所有的连接都使用vms，我已经知道了，但是我没有找到任何具体的方法来连接pyspark本地。我试过的代码：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import re

from os.path import join, abspath

from pyspark.sql import Row, SparkSession, HiveContext, SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.sql.functions import desc

# Setting up the warehouse location

warehouse_location = abspath('Should I put LocalHost (Host Address) here?')

# Where should I put the host and port ID so that I am connected to the Hive Database, this is all I found the best

spark = SparkSession \
.builder \
.appName("myAppName") \
.config("what should I put here", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()

# Once above setup is done, I can query the data directly as below, but above setup is not happening

df = spark.sql("select * from database.tableName").show()

任何帮助都将不胜感激。

Hive apache-spark pyspark apache-spark-sql spark-streaming

来源：https://stackoverflow.com/questions/62920399/connect-hive-database-with-pyspark-using-spyder-and-or-jupyter-notebook-on-windo