Hive
Apache hive consist of two major component: HiveServer and Hive Metastore. HiveServer is sql engine which run on the hadoop ecosystem. HiveMetastore is a catalog which stores databases and tables information.
Hive Metastore
Metastore needs a relational database backend. I used an external (outside of Kubernetes cluster) postgresql server. The official Hive docker image doesn't contain postgresql driver. To overcome this issue we need to build a custom hive image.
Docker Image
First, download required driver. If you want to use different relational database you should download related driver. You can find postgre and mysql driver in my repository.
curl -fLo files/postgres.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.4/postgresql-42.7.4.jar
Create a Dockerfile like following.
Tip
krb5-config
and krb5-user
packages are optional. If you want to run kerberos client commands like kinit
, klist
inside the container, you should install these packages. Also we have created hive
user home folder to avoid some errors.
FROM apache/hive:4.0.1
COPY files/postgres.jar /opt/hive/lib/postgres.jar
# Mysql: COPY files/mysql.jar /opt/hive/lib/mysql.jar
USER root
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get -qq -y install krb5-config krb5-user
RUN mkdir -p /home/hive
RUN chown -R hive:hive /home/hive
USER hive
Then build and push it
Tip
You should use your own custom docker registry or docker hub account.
Kerberos Settings
Hive needs following principal for kerberos authentication:
You should create a keytab file for the principal. Then you should deploy it as Secret, you can usektutil
tool to create keytab.
HDFS Connection
Create required HDFS directories:
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir /warehouse
hdfs dfs -mkdir /warehouse/external
hdfs dfs -mkdir /warehouse/managed
hdfs dfs -mkdir -p /opt/hive/scratch_dir # you can change this path in hive-site.xml
hdfs dfs -chmod 777 /tmp
hdfs dfs -setfacl -R -m user:hive:rwx /warehouse
hdfs dfs -setfacl -R -m user:hive:rwx /opt/hive/scratch_dir
Add following configurations to Hadoop core-site.xml
:
...
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.users</name>
<value>*</value>
</property>
...
Deployment
Tip
As Hive Metastore is a stateless service, you can deploy it as a Deployment.
Configs
Tip
Hive server requires YARN kerberos principal to set. Otherwise It returns "Can't get Master Kerberos principal for use as renewer" error.
https://stackoverflow.com/a/69068935/3648566
https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/errors.html
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://metastore.company.bigdata.svc.cluster.local:9083</value>
</property>
<property>
<name>hive.metastore.execute.setugi</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/warehouse/managed</value>
</property>
<property>
<name>hive.metastore.warehouse.external.dir</name>
<value>/warehouse/external</value>
</property>
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.try.direct.sql.ddl</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.try.direct.sql</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.dml.events</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.notification.add.thrift.object</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.server.filter.enabled</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.authorization.storage.checks</name>
<value>false</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<!-- Kerberos -->
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.kerberos.keytab.file</name>
<value>/etc/security/keytabs/hive.keytab</value>
</property>
<property>
<name>hive.metastore.kerberos.principal</name>
<value>hive/_HOST@HOMELDAP.ORG</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>KERBEROS</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hive/_HOST@HOMELDAP.ORG</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.keytab</name>
<value>/etc/security/keytabs/hiveserver.keytab</value>
</property>
</configuration>
Deploying configurations:
kubectl create configmap hive-site-config -n bigdata --from-file=hive-site.xml=./configs/hive-site.xml --from-file=yarn-site.xml=./configs/yarn-site.xml
Manifest file
apiVersion: v1
kind: Pod
metadata:
name: hive-metastore
namespace: bigdata
labels:
name: hive-metastore
app: hive-metastore
dns: hdfs-subdomain
spec:
hostname: metastore
subdomain: company
hostAliases:
- ip: "192.168.1.52"
hostnames:
- "kdc.homeldap.org"
containers:
- name: hive-metastore
image: kube5:30123/custom/hive:4.0.1
imagePullPolicy: Always
env:
- name: SERVICE_OPTS
value: "-Xmx1G -Djavax.jdo.option.ConnectionDriverName=org.postgresql.Driver -Djavax.jdo.option.ConnectionURL=jdbc:postgresql://192.168.122.18:5432/hive -Djavax.jdo.option.ConnectionUserName=root -Djavax.jdo.option.ConnectionPassword=142536"
- name: SERVICE_NAME
value: metastore
- name: DB_DRIVER
value: postgres
- name: VERBOSE
value: 'true'
- name: HIVE_CUSTOM_CONF_DIR
value: /hive-configs
resources:
limits:
memory: "1G"
cpu: "1000m"
ports:
- containerPort: 9083
volumeMounts:
- name: hive-site-config
mountPath: /hive-configs/hive-site.xml
subPath: hive-site.xml
- name: hive-site-config
mountPath: /hive-configs/yarn-site.xml
subPath: yarn-site.xml
- name: hadoop-config
mountPath: /hive-configs/core-site.xml
subPath: core-site.xml
- name: hadoop-config
mountPath: /hive-configs/hdfs-site.xml
subPath: hdfs-site.xml
- name: keytab-hive
mountPath: /etc/security/keytabs/
- name: krb5conf
mountPath: /etc/krb5.conf
subPath: krb5.conf
volumes:
- name: hive-site-config
configMap:
name: hive-site-config
- name: hadoop-config
configMap:
name: hadoop-config
- name: keytab-hive
secret:
secretName: keytab-hive
- name: krb5conf
configMap:
name: krb5conf
---
apiVersion: v1
kind: Service
metadata:
name: hive-metastore
namespace: bigdata
spec:
type: ClusterIP
selector:
app: hive-metastore
ports:
- port: 9083
targetPort: 9083
Ensure that use same headless service named same as subdomain value:
...
labels:
name: hive-metastore
app: hive-metastore
dns: hdfs-subdomain
spec:
hostname: metastore
subdomain: company
...
Set container image as our custom image that related sql driver included. Setting to imagePullPolicy
as Always
can be useful when testing the custom image.
SERVICE_OPTS
variable.
- name: SERVICE_OPTS
value: "-Xmx1G -Djavax.jdo.option.ConnectionDriverName=org.postgresql.Driver -Djavax.jdo.option.ConnectionURL=jdbc:postgresql://<database_host>:<database_port>/<database_name> -Djavax.jdo.option.ConnectionUserName=<database_user_name> -Djavax.jdo.option.ConnectionPassword=<database_password>"
- name: DB_DRIVER
value: postgres
Tip
We can set maximum heap memory using the -Xmx1G
parameter.
While the SERVICE_NAME
variable determines which hive service will run, the HIVE_CUSTOM_CONF_DIR
variable determines the location of the configuration files.
We mount Hive configurations, krb5.conf
and hive keytab file. Also for HDFS support we should mount HDFS configuration files.
...
volumeMounts:
- name: hive-site-config
mountPath: /hive-configs/hive-site.xml
subPath: hive-site.xml
- name: hive-site-config
mountPath: /hive-configs/yarn-site.xml
subPath: yarn-site.xml
- name: hadoop-config
mountPath: /hive-configs/core-site.xml
subPath: core-site.xml
- name: hadoop-config
mountPath: /hive-configs/hdfs-site.xml
subPath: hdfs-site.xml
- name: keytab-hive
mountPath: /etc/security/keytabs/
- name: krb5conf
mountPath: /etc/krb5.conf
subPath: krb5.conf
volumes:
- name: hive-site-config
configMap:
name: hive-site-config
- name: hadoop-config
configMap:
name: hadoop-config
- name: keytab-hive
secret:
secretName: keytab-hive
- name: krb5conf
configMap:
name: krb5conf
...
Hive Server
I deployed minimal, not distributed, Hive Server. It is only to test and control metastore.
Because we enable kerberos authentication for metastore, we need the following hive server keytab and kerberos settings:
Deploy the hive server keytab as secret:apiVersion: v1
kind: Pod
metadata:
name: hive-server
namespace: bigdata
labels:
app: hive-server
dns: hdfs-subdomain
spec:
hostname: hiveserver
subdomain: company
containers:
- name: hive-server
image: kube5:30123/custom/hive:4.0.1
imagePullPolicy: Always
env:
- name: SERVICE_NAME
value: hiveserver2
- name: HIVE_CUSTOM_CONF_DIR
value: /hive-configs
- name: SERVICE_OPTS
value: "-Dhive.metastore.uris=thrift://metastore.company.bigdata.svc.cluster.local:9083"
resources:
limits:
memory: "2G"
cpu: "1000m"
ports:
- containerPort: 10000
- containerPort: 10002
volumeMounts:
- name: hive-site-config
mountPath: /hive-configs/hive-site.xml
subPath: hive-site.xml
- name: hive-site-config
mountPath: /hive-configs/yarn-site.xml
subPath: yarn-site.xml
- name: hadoop-config
mountPath: /hive-configs/core-site.xml
subPath: core-site.xml
- name: hadoop-config
mountPath: /hive-configs/hdfs-site.xml
subPath: hdfs-site.xml
- name: keytab-hive-server
mountPath: /etc/security/keytabs/
- name: krb5conf
mountPath: /etc/krb5.conf
subPath: krb5.conf
volumes:
- name: hive-site-config
configMap:
name: hive-site-config
- name: hadoop-config
configMap:
name: hadoop-config
- name: keytab-hive-server
secret:
secretName: keytab-hive-server
- name: krb5conf
configMap:
name: krb5conf
---
apiVersion: v1
kind: Service
metadata:
name: hive-server-svc
namespace: bigdata
spec:
type: ClusterIP
selector:
app: hive-server
ports:
- port: 10000
targetPort: 10000
---
apiVersion: v1
kind: Service
metadata:
name: hive-server-np
namespace: bigdata
spec:
type: NodePort
selector:
app: hive-server
ports:
- port: 10000
name: server
targetPort: 10000
nodePort: 31000
- port: 10002
name: webui
targetPort: 10002
nodePort: 31002
Set following parameters to run hive server:
- name: SERVICE_NAME
value: hiveserver2
- name: HIVE_CUSTOM_CONF_DIR
value: /hive-configs
- name: SERVICE_OPTS
value: "-Dhive.metastore.uris=thrift://metastore.company.bigdata.svc.cluster.local:9083"
Deploy hive server:
To test our metastore first we should access hive server CLI:
Then we should take kerberos ticket and connect to hive server usingbeeline
:
kinit:
kinit -kt /etc/security/keytabs/hiveserver.keytab hive/hiveserver.company.bigdata.svc.cluster.local@HOMELDAP.ORG