Big Data on Kubernetes

About This Document

This documentation is a collection of notes I compiled while learning to deploy big data services on Kubernetes.

Big Data Services

There are numerous big data tools available, but I selected one for each type of workload. For example, for a SQL engine, I chose Trino, and for a table catalog, I selected Apache Hive. My goal was to focus on widely used services. The targeted services covered in this documentation are listed below:

Kubernetes (RKE2) / Custom Docker Registry
HDFS (Storage)
Apache Hive 4 (Metastore/Catalog)
Apache Ranger (Trino, HDFS, Spark and Hive Policy Management/Audit)
Trino (SQL Engine)
Hue (Web-Based SQL Editor)
Spark Notebook - Jupyter Hub / Enterprise Gateway (Notebook)
Iceberg (Table Format)
Kerberos and LDAP Integration

The following services are used in the documentation but are not covered in detail:

Kerberos/LDAP Server: Refer to this guide and this documentation for setup instructions.
PostgreSQL Database Server: Backend database for services like Ranger, Hive, and Hue. It is assumed that the database server is hosted outside the Kubernetes cluster.
Solr: Used only for Ranger audit logs. GitHub Repository
Longhorn Storage: Installation Guide

Environment

Initially, I experimented with tools like KIND and Minikube. However, I found these tools more suitable for quick tests or single-service development rather than a full-stack big data environment. A full-stack setup requires extensive configuration for networking, storage, and container deployment. As a result, I opted to use virtual machines, which also provide a more realistic, production-like experience.

The virtual machines and host system used for this setup are as follows:

Host: Desktop PC with 12 CPUs, 48GB RAM, and 2x480GB SATA SSDs running Arch Linux.
Virtual Machines:
1 x Kubernetes Master (Rocky Linux 9.5): 4 CPUs, 6GB RAM, 60GB Disk
4 x Kubernetes Workers (Rocky Linux 9.5): 4 CPUs, 6GB RAM, 60GB Disk each
1 x LDAP/KDC Server (Ubuntu Server 24.04): 2 CPUs, 2GB RAM, 20GB Disk
1 x PostgreSQL Server (Rocky Linux 9.5): 2 CPUs, 2GB RAM, 20GB Disk

For virtualization, I used the libvirt and QEMU stack.

Although Kubernetes installation is not the primary focus of this documentation, I included a section on it for completeness. Rancher tools (e.g., RKE2, Rancher UI) simplify the process and provide a robust Kubernetes environment.

Limitations

I am still in the process of learning the services covered in this documentation. Therefore, the content may not be production-ready or fully optimized.

In most cases, I used NodePort services to expose applications outside the Kubernetes cluster. However, for production environments, you should consider using Ingress controllers, LoadBalancers, or high-availability tools such as HAProxy, Keepalived, MetalLB, or kube-vip.

For persistent storage, I primarily used hostPath to keep the Kubernetes environment minimal. In production, you should consider using a CSI driver for persistent storage, such as Longhorn, rook.io, or vSphere.