Homelab server
Introduction
My notes for a process of building and setting up a homelab server for self-hosting various applications. It started as an effort to capture some of manual steps. With time, however, I replaced many of these with ansible playbook, which can be found on my github.
1. Hardware
The /r/homelab has great hardware guides both for off-the-shelf options, as well as for building a custom server. Some noteable off-the-shelf options are: combinations of NAS and small-factor PCs, like Intel NUCs or pre-built mini-tower HP, Dell, Levono servers. For a moment I was seriously considering the NAS killer 2.0 from reddit user JDM_WAAAT, however the form-factor and relatively high power consumption was a deal-breaker.
I ended up basing my build around the iconic Fractal Design Node 304 mini ITX case. Intel Pentium G4560 had a pretty good power specification. Having the Nvidia GPU with CUDA support was also a requirement, as I am doing some projects or courses involving deep learning from time to time and in the past I used to spend quite a bit by renting GPU instances - managed to get a really small refurbished GTX 650 from Zotac on Black Friday for about $50.
The build process was relatively straightforward. The motherboard PSU cable turned out to occupy most of the free space inside the Node 304. Currently I have only 1 SSD and 2 mirrored HDD drives, so I removed one of HDD bays and it fits OK. However I will likely be ordering the custom-length PSU cables to fit all 6 3.5” HDDs in the future. There are several companies who offer those, e.g. CableMod. I also had to replace the included 140mm fan by Noctua’s one as Fractal’s was quite noisy. Now the server sits right under my TV set in the living room and noone can hear that it is on.
2. Setting up the host (“setup-server” role)
After using proxmox for a month with linux containers I decided to switch to Ubuntu Server LTS 18.04 with a mix of LXD and docker.
Ubuntu partitions, assuming 256Gb SSD: 182G / 200Mb /boot 32G ext4 zfs cache, not mounted 8G ext4 zfs log, not mounted 16G ext4 linux swap
Implemented ansible tasks:
- connecting to existing zfs pool and setting up cache/log
- configure ZFS event daemon
- install and configure email (smtp only)
- install and configure smartmontools
- install and configure docker
- install and configure lxd with public ips
References proxmox installation zed setup telegraf smart plugin S.M.A.R.T. smartmontools docker on ubuntu cuda on ubuntu
TODO
- unattended upgrades
- telegraf smartctl
- schedule bi-monthly scrubs
- docker remote api certs
3. Setting up applications (“setup-docker” role)
List of tasks/applications:
- Plex
- Filebrowser
- Transmission with openVPN
- Grafana
- Watchtower
- Traefik
- Portainer
All containers are launched using ansible directly and connected to bridge network to be exposed via traefik. This stack of applications is not expected to be exposed on external network, traefik is used mostly to simplify navigation by using “app.local” host names.
7. Setting up web server - OLD NOTES
Running nodejs web server in Alpine LXC. Steps to set up are the following. Update apk and install openssh and sudo. Add sudo user (visudo) for ssh access. Install node apk add nodejs nodejs-npm
and pm2 by npm install -g pm2
. Create a non-sudo www
user. Mount your web app folder from proxmox host zpool and chmod to 777 to allow read/write to everyone. Generate pm2 startup scripts for www user. Launch all nodejs apps and save them with pm2 to persist. pm2 will also be responsible for restarting apps.
For SSL setup install certbot. NodeJS app needs to be refactored to serve a static folder, e.g. public
to allow certbot to create challenges. After this the certificate can be generated by:
sudo certbot certonly --webroot -w /var/www/e2pe/public -d e2.pe --config-dir /var/www/certificates
It will create certificates in the host folder, so they can be shared with reverse proxy.
Then create a root cronjob for daily renew attempts sudo crontab -e
and add certbot renew --config-dir /var/www/certificates
.
This tutorial has some other details for HSTS and DH options. Create one stronger DH cert shared between nginx reverse proxy and every subdomain or app.
One such LXC can take care of multiple apps - just need to assign different ports, so that nginx reverse proxy can redirect correspondingly.
8. Python / Deep Learning - OLD NOTES
This guide has a lot of good information, but needs few adjustments.
First need to install on host. Use official guide. Get proper headers first. I ended up installing using apt-get install -t stretch-backports nvidia-cuda-toolkit
from deb http://httpredir.debian.org/debian stretch-backports main contrib non-free
as per this topic. Make sure to update stretch
to correct debian/proxmox version. Verity installation by cat /proc/driver/nvidia/version
and nvcc -V
.
Enable start at boot as per Nvidia guide. Create a file in /etc/init.d
and add to startup scripts.
Share with LXC as per the guide, linked in the beginning. Then install nvidia-cuda-toolkit
similar as on host. Verify that nvidia-smi
and nvcc -V
give the same versions as on the host. Download and install the cuDNN deb package. Look at archived section in case you need to match CUDA version exactly.
Will use Tensorflow as computing backend. Tensorflow recommends (or requires?) GPUs “compute capability” of at least 3.5. Unfortunately my refurbished GTX650 has 3.0, but some posts on stack overflow indicate that it is still working with 3.0. I didn’t bother with compiling tensorflow and just used pyenv
to install miniconda. Miniconda in turn provides a pre-compiled TF with GPU support by conda install -c anaconda tensorflow-gpu
, which worked oob.
Keras MNIST CNN example output:
2018-12-24 20:51:10.840424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 650 major: 3 minor: 0 memoryClockRate(GHz): 1.0585
pciBusID: 0000:01:00.0
totalMemory: 978.12MiB freeMemory: 954.25MiB
2018-12-24 20:51:10.840468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-24 20:51:11.156234: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-24 20:51:11.156289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-24 20:51:11.156298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-24 20:51:11.156526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 683 MB memory) -> physical GPU (device: 0, name: GeForce GTX 650, pci bus id: 0000:01:00.0, compute capability: 3.0)
60000/60000 [==============================] - 34s 565us/step - loss: 0.2767 - acc: 0.9152 - val_loss: 0.0588 - val_acc: 0.9813
Epoch 2/12
60000/60000 [==============================] - 31s 519us/step - loss: 0.0902 - acc: 0.9737 - val_loss: 0.0472 - val_acc: 0.9842
Test loss: 0.024805454400579036
Test accuracy: 0.9917
According to Grafana it took about 6min of GPUs time with temperature raising up to 53 degC. The average time of epoch is 31s compated to an average of 142s on my mid-2015 13” MacBook Pro with ~330% CPU load and associated noise and heat, so not bad for a $50 card.
9. Backup - OLD NOTES
Use homelab machine as borg backup server for all computers inside the network, as well as for itself. Then I use rclone to sync the deduplicated borg repository with B2. How to daemonize rclone.