· Kubernetes · 11 min read
Building a Kubernetes Security Hardening Lab: A Practical Guide to Defense-in-Depth
How I built a production-grade Kubernetes security lab from scratch, implementing RBAC, Pod Security Standards, Network Policies, Falco runtime detection, and Trivy vulnerability scanning — including every mistake I made along the way.
Building a Kubernetes Security Hardening Lab: A Practical Guide to Defense-in-Depth
Security in Kubernetes is not a single setting you toggle on. It’s a stack of overlapping controls — each one assuming the others might fail. This post walks through how I built a full Kubernetes security hardening lab from scratch, implementing six layers of defense and documenting every decision along the way.
The goal wasn’t just to make things work. It was to understand why each control exists, what attack it mitigates, and what happens when you take it away.
Why Kind Over Minikube
Before writing a single line of YAML, the first decision was the local Kubernetes environment. Minikube is the default answer, but I chose kind (Kubernetes in Docker) for three reasons:
Multi-node support. Minikube defaults to a single node, which makes Network Policy testing unrealistic. kind lets you spin up a proper control-plane + worker topology, giving you real node boundaries to test against.
Docker-native. kind nodes are Docker containers. The cluster is fully version-controlled as a YAML file, reproducible on any machine with Docker, and can run in CI. That matters for a security lab where you want to tear everything down and rebuild from scratch regularly.
kubeadm control. kind uses kubeadm under the hood, which means you can inject API server configuration directly — critical for enabling audit logging and custom admission plugins.
Phase 1: Cluster Setup and Audit Logging
The cluster setup turned out to be the hardest part of the entire project, and not for reasons I expected.
The goal was a 3-node kind cluster (1 control plane, 2 workers) with API server audit logging enabled. Audit logging captures every request to the Kubernetes API — who accessed what secret, who deleted which pod, what the response was. It’s your forensic trail.
The Two-Layer Mount Problem
Getting audit logging working on Apple Silicon with Docker Desktop required solving a non-obvious problem: getting a file from your Mac into a running API server pod requires two separate mount operations.
The architecture looks like this:
Mac filesystem
↓ (extraMounts — kind level)
kind node container
↓ (extraVolumes — kubeadm level)
API server podMost documentation only mentions one of these layers. If you configure extraMounts without extraVolumes, the file exists on the node but the API server pod can’t see it. If you configure extraVolumes without extraMounts, the API server looks for a file that was never put there in the first place. Both cause the API server to crash on startup with a generic timeout error that gives no indication of the real cause.
The working configuration:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
disableDefaultCNI: true
nodes:
- role: control-plane
image: kindest/node:v1.29.2
extraMounts:
- hostPath: /path/to/audit-policy.yaml
containerPath: /etc/kubernetes/audit-policy.yaml
readOnly: true
kubeadmConfigPatches:
- |
kind: ClusterConfiguration
apiServer:
extraArgs:
audit-policy-file: /etc/kubernetes/audit-policy.yaml
audit-log-path: /var/log/kubernetes/audit.log
extraVolumes:
- name: audit-policy
hostPath: /etc/kubernetes/audit-policy.yaml
mountPath: /etc/kubernetes/audit-policy.yaml
readOnly: true
pathType: FileWe ultimately hit a hard compatibility wall between Kubernetes 1.33, Apple Silicon, and Docker Desktop’s virtualized kernel that made audit logging unreliable. We documented it, pinned to K8s 1.29.2, and moved on. Knowing why something doesn’t work and articulating it clearly is more valuable than hiding the gap.
Phase 2: RBAC
With a running cluster, the next layer is identity and access control. RBAC in Kubernetes is the difference between “anyone with a kubeconfig can do anything” and “every identity has exactly the permissions it needs.”
Three Roles, One Principle
I implemented three roles following least-privilege:
auditor — A ClusterRole with read-only access across all namespaces. Auditors need cluster-wide visibility to do their job, but they cannot modify anything.
developer — A Role scoped to the dev namespace only. Can create and manage pods, deployments, services, and configmaps. Secrets are explicitly excluded. Developers retrieve secrets through a secrets manager, not directly from the Kubernetes API.
namespace-admin — A Role with full access, but scoped to the staging namespace only. Cannot touch production, cluster-level resources, or other namespaces.
The secrets exclusion from the developer role deserves emphasis. A compromised developer kubeconfig should not give an attacker access to database credentials, API keys, or TLS certificates. By excluding secrets from the role entirely, you force secrets to flow through a proper secrets manager and limit the blast radius of any credential compromise.
Certificate-Based Authentication
Users were authenticated via X.509 certificates signed by the cluster CA. This avoids the need for an external identity provider in a lab environment while demonstrating the full certificate workflow:
openssl genrsa -out developer1.key 2048
openssl req -new -key developer1.key \
-out developer1.csr \
-subj "/CN=developer1/O=dev-team"
openssl x509 -req -in developer1.csr \
-CA ca.crt -CAkey ca.key -CAcreateserial \
-out developer1.crt -days 365The CN field becomes the Kubernetes username. The O field becomes the group. In production you’d replace this with OIDC (Okta, Google Workspace) for centralized identity management.
Verifying It Works
The most important part of RBAC isn’t the YAML — it’s proving the controls actually enforce what you think they do:
kubectl auth can-i list pods -n dev --as developer1 # yes
kubectl auth can-i get secrets -n dev --as developer1 # no
kubectl auth can-i list pods -n production --as developer1 # no
kubectl auth can-i get nodes --as developer1 # noEvery no is a threat mitigated.
Phase 3: Pod Security Standards
Pod Security Standards (PSS) replaced PodSecurityPolicy in Kubernetes 1.25. They’re enforced at the namespace level via labels, with three profiles: privileged (no restrictions), baseline (blocks obvious attacks), and restricted (full hardening).
I applied a progressive hardening model across namespaces:
dev— warn-only on restricted (developers see warnings, aren’t blocked)staging— enforce baseline, warn on restrictedproduction— enforce restricted, no exceptions
The restricted profile requires a specific security context on every pod:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 10001
seccompProfile:
type: RuntimeDefault
containers:
- securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]Each field maps to a specific attack:
| Field | Attack Mitigated |
|---|---|
runAsNonRoot | Exploits that require root to succeed |
readOnlyRootFilesystem | Attacker writing persistence files or modifying binaries |
capabilities.drop: ALL | Abuse of Linux capabilities (CAP_NET_ADMIN, CAP_SYS_PTRACE, etc.) |
allowPrivilegeEscalation: false | setuid binary exploitation |
seccompProfile: RuntimeDefault | Restricts available syscalls, reduces kernel attack surface |
Getting nginx to run under these constraints required some work — the standard nginx:alpine image tries to bind to port 80 (requires NET_BIND_SERVICE), write to /var/cache/nginx at startup, and modify its config file. The solution was switching to port 8080, providing an nginx config via ConfigMap mounted with subPath, and using emptyDir volumes for the paths nginx needs to write to.
Phase 4: Network Policies
Network Policies are the zero-trust layer for pod-to-pod communication. Without them, every pod in your cluster can reach every other pod — a compromised frontend can talk directly to your database.
CNI Matters
This is a common trap: kind’s default CNI (kindnet) does not enforce NetworkPolicy. You can apply all the policies you want and they’ll be accepted by the API server, but nothing will actually be blocked. You need a CNI that enforces policy at the kernel level.
I replaced kindnet with Calico, which uses eBPF to enforce policies. This required disabling the default CNI at cluster creation time:
networking:
disableDefaultCNI: trueDefault Deny Everything
The network security model starts with denying all traffic:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- EgressThen selectively re-allow only what’s needed. For a 3-tier app (frontend → API → database):
- frontend can reach API on port 8080
- API can reach database on port 8080
- All pods can reach DNS (port 53)
- Everything else is blocked
Proving It Works
The verification output is the deliverable:
✓ frontend -> api: ALLOWED
✓ api -> database: ALLOWED
✓ frontend -> database: BLOCKED ← lateral movement prevented
✓ database -> api: BLOCKED
✓ database -> frontend: BLOCKEDThat last column — BLOCKED — is the point. An attacker who compromises the frontend pod cannot reach the database directly, even though they’re in the same namespace, on the same cluster, potentially on the same node.
Phase 5: Falco Runtime Security
Prevention isn’t enough. An attacker with RCE can always run commands — you can’t prevent that. What you can do is ensure every attack is detected.
Falco watches system calls from inside your containers and fires alerts when it sees behavior that matches a rule. It runs as a DaemonSet, one pod per node, using eBPF probes attached to the kernel.
Writing Custom Rules
The default Falco rules are a good starting point, but writing custom rules demonstrates that you understand what you’re detecting and why:
- rule: Shell Spawned in Container
condition: >
spawned_process and container
and proc.name in (bash, sh, zsh, dash, fish)
and not proc.pname in (containerd, dockerd, runc)
output: >
Shell spawned in container
(user=%user.name container=%container.name
image=%container.image.repository proc=%proc.name)
priority: WARNING
tags: [container, shell, mitre_execution]I wrote four custom rules covering shell execution, sensitive file reads, package manager execution (attackers installing tools), and service account token reads.
Attack Simulations
Running the simulations and watching alerts fire in real time is the most satisfying part of the project:
# Scenario 1 — shell in container
kubectl exec -n production $FRONTEND -- sh -c "echo attacker"
# → Falco: Warning Shell spawned in container
# Scenario 2 — credential harvesting
kubectl exec -n production $FRONTEND -- cat /etc/passwd
# → Falco: Error Sensitive file read in container (file=/etc/passwd)
# Scenario 3 — tool installation
kubectl exec -n production $FRONTEND -- apk --help
# → Falco: Error Package manager run in container (proc=apk)Phase 6: Trivy Vulnerability Scanning
Falco catches attacks at runtime. Trivy catches them before they happen — at the point where a container image is built and pushed.
Trivy scans container images against the CVE database and reports vulnerabilities by severity. The CI integration is the critical piece: configuring exit-code: 1 on HIGH and CRITICAL findings means a vulnerable image fails the pipeline and never reaches the cluster.
- name: Run Trivy on image
uses: aquasecurity/trivy-action@master
with:
image-ref: nginx:alpine
format: sarif
output: trivy-results.sarif
severity: HIGH,CRITICAL
exit-code: '1'Results upload to the GitHub Security tab as SARIF, giving you a browsable vulnerability report tied to each commit.
One gotcha on Apple Silicon: Trivy’s secret scanning causes timeouts when analyzing images. Disabling it with --scanners vuln and pre-pulling images with docker pull before scanning resolved the issue.
The Setup Script
One thing I invested time in early was a setup.sh that rebuilds the entire cluster from scratch in a single command. This paid off many times over — every time we hit a configuration issue that required a cluster rebuild, it was a matter of minutes rather than an hour of manual steps.
The script runs preflight checks (verifying kind, kubectl, helm, envsubst, and Docker are all available), creates the cluster, installs Calico, and applies all six phases in order. It also handles idempotency — running it against an existing cluster tears it down first.
This kind of operational thinking — “how do I make this reproducible?” — is worth demonstrating in a portfolio project.
What I Learned
Layered security is the point. No single control is sufficient. The service account token attack is mitigated by three independent controls — automountServiceAccountToken: false, minimal RBAC permissions on the default SA, and a Falco rule that alerts if a token is read. All three must fail simultaneously for the attack to succeed.
Document the gaps honestly. The audit logging incompatibility with Apple Silicon, the Falco metadata enrichment limitation, the missing image signing — these are all documented in the security baseline as known gaps with accepted risk and remediation paths. Senior engineers respect honesty about what’s not solved far more than they respect pretending everything is perfect.
Verification output is the deliverable. For every control, the most important thing is proving it works. The kubectl auth can-i output for RBAC, the network policy test results, the Falco alert captures — these are the artifacts that demonstrate the controls actually enforce what you claim.
The environment matters. A significant amount of time in this project was spent working around Docker Desktop + Apple Silicon + kind compatibility issues. In a production environment on real Linux nodes, several things that required workarounds here (audit logging, Falco metadata, Trivy image scanning) would just work. Knowing the difference between “this is broken in my environment” and “this is fundamentally broken” is a valuable skill in itself.
The Full Stack
| Layer | Control | What It Prevents |
|---|---|---|
| Image | Trivy | Vulnerable images reaching the cluster |
| Cluster | RBAC | Unauthorized API access |
| Cluster | Audit logging | Undetected API activity |
| Namespace | Pod Security Standards | Privileged containers, root processes, host access |
| Namespace | ResourceQuota + LimitRange | Resource exhaustion DoS |
| Network | NetworkPolicy + Calico | Lateral movement between pods |
| Runtime | Falco | Post-exploitation activity detection |
Each layer assumes the layer above it might fail. That’s what defense-in-depth means in practice.
Repository
The full project is available on GitHub, including all YAML manifests, the setup script, the attack simulation script, and the full documentation set (threat model, RBAC matrix, attack scenarios, security baseline).
If you’re building something similar and hit the Apple Silicon + kind audio logging issue, the two-layer mount pattern described in Phase 1 is the fix. Save yourself the hours.
