BSD for Scientific Computing
Abstract
Introduction
This is article is part of my promised ‘BSD in Production’ series. It’s a demo of the BSD stack and how I use it for Machine Learning at work. The audience will be UNIX folks, working data scientists, and DevOps engineers. I’ll unpack a little bit of the history of BSD as well as trends in software used for scientific programming. I’ll explore scientific software in detail and review benchmarking comparisons that have been done with BSD and Linux (as well as virtual machines and container technology). I’ll finish by exploring custom benchmarking done on BSD and Linux with data science workflows in R/Python as I make my case for using BSD for scientific computing.
What is BSD?
BSD (Berkeley Software Distribution) is a free variant of Research UNIX that was initially developed in the 1980’s (3BSD) by computer scientists from around the country, but coordinated by researchers at the University of California Berkeley. It is an operating system with a storied history that predates GNU/Linux by about a decade. It has developed a small (but mighty) cult following among systems administrators and software engineers. It’s development is now, and has always been, championed by a community of researchers and users and not a single company or business consortium. The reseason for this is it’s licensing. BSD is covered by an open source license that is the most permissive in the industry. You can use it, modify it, and even redistribute derivitives (without the source code) for anything – and for free. The BSD kernel and operating system exists in several variants. These include NetBSD, FreeBSD, OpenBSD, DragonFly BSD, Hardened BSD, et al. These are not separate distributions, as with GNU/Linux. All of the BSD’s out there share a lineagewith the last release of BSD4.4-Lite2 [1] by UC Berkeley. And the BSD developer community typically shares source code freely. They even organize their conferences (e.g., BSDCan, EuroBSDCon) and user groups together. They differ in how kernels and drivers (as well as default applications) have changed since the BSD4.4-Lite2 release. Some variants really focus on security (e.g., Hardened BSD and OpenBSD), some on performance (DragonFly BSD), and others on portability (NetBSD). In this article, when I say BSD, I’m referring to FreeBSD. It’s the most widely used and active of the BSD variants. But know that there are others out there doing important work. For a full history of BSD, please see Marshal et al, 2015 and the FreeBSD Handbook.
BSD usage [1] appears consistent with your average GNU/Linux distribution (e.g., Fedora, Debian). Notably other BSD variants are not tracked by Stack Overflow. BSD has never been a wildly popular choice for operating system. But it’s popular enough to get support from seasoned systems administrators and developers when you need it. And they are a vocal group in the UNIX world.
Why should I use it?
In a world dominated by enterprise Linux, why should you take a chance on BSD? The reasons for using BSD are varied, but after a cursory survey of the literature ([1,2,3]), the factors driving the use of BSD in industry are it’s :
- simplicity
- configurability
- stability
- performance
- compiler support and tooling
- permissive licensing
A base install of a FreeBSD operating system typically includes an ZFS filesystem, kernel tunables for running virtual machines, clang/llvm and make, a few UNIX shells, and a few network services (like SSH). The system is designed so that you can build on the light-weight base install using a vast collection of software ports [1] or even tinker with the kernel configuration or base software (world [1]) using clang/llvm/make. The ports system is smart enough to bootstrap compilers to build very complex software from source (out-of-the-box). The ports collection itself is maintained by about 100 developers (with the commit traffic growing exponentially since 1999 [1]). You can either install their official binary builds using the pkg command (or build from source from the ports tree directly [1]). This is a departure from the experience of some desktop GNU/Linux distributions like Canonical’s Ubuntu, where advertisements are served up in the shell and the terminal is not available unless you go digging for it. Software engineers and tinkerers are treated as first-class citizens in BSD and it is assumed that you know what you are doing. However, you get a mature UNIX environment that is tuned to resource-intensive server workloads out-of-the-box (and for free).
Performance
We will unpack performance related to Machine Learning work as-executed from containers and Virtual Machines relative to baseline bare-metal scoring in some detail later in this paper. The “BSD is performant” argument I usually make is that the BSD developer community makes design decisions that emphasize the simplicity and configurability of all software distributed through ports. When you install software, it comes pre-configured with sane defaults. Often these defaults make for high performance deployments of web servers and databases; sometimes they do not and you need to tune the software build process used in ports, the filesystems you are deploying on, or the configuration of the service post-install. In practice, most performance benchmarking places BSD operating system variants slightly-under the performance of Linux on bare-metal servers [1]. This includes performance-optimized BSD (Dragonfly BSD). This comes from 2021 work by Michael Larabel [1], a core Phoronix contributor and long-time Linux software developer. The reasons for the performance differences between Linux / BSD are not discussed at-length in Larabel’s work. I suspect that for performance-optimized Linux distributions (e.g., Clear Linux), the performance difference is largely related to the compiler optimizations put in-place when building software. Intel’s work on Clear Linux probably sets a high water mark for speed on bare-metal servers, as well as VMs [1]. It would be difficult to optimize the software build process or the various kernel optimizations made for running Clear Linux as a VM guest more thoroughly than what Intel has done. In this paper, we will try to replicate Larabel’s 2021 work, but expand the benchmarking from bare-metal through Virtual Machines (Bhyve), and containers.
Containerization
FreeBSD has had containerization technology implemented in it’s kernel since 1999 (FreeBSD 4.0; Poul-Henning Kamp [1]). It’s been around so long that it pre-dates the word “containers” (it used to be referred to as OS-level virtualization; [1, 2]). Since their inception, containers in FreeBSD have colloquially been called ‘jails’, because security was an early focus of container development. Folks believed that partitioning complex software into units that were isolated from the kernel and host operating system offered security benefits (in the same way that a VM offers isolation between host and guest) [1]. An early focus of container development was trying to improve resource sharing and efficiency between the host operating system’s kernel and the guest containers (i.e., to beat traditional VMs on performance), while maintaining some degree of isolation and a good security posture. More recently, experimental OCI-compatible container support has been added to FreeBSD through Buildah and Podman (FreeBSD 13.0; Doug Rabson [1, 2]). So, FreeBSD is incrementally improving it’s containerization offerings and we should expect that OCI container technology and orchestration will play a bigger role in FreeBSD usage in the future. Jails and OCI containers both handily outperform Kernel Virtual Machine technology [1]. It’s been reported in the literature that Docker containers (on Linux) tend to marginally outperform FreeBSD jails in terms of CPU event handling (eps), and memory I/O, and strongly outperfom jails for disk reads [1].
I suspect that Ryding & Johansson were using the default configuration for ZFS used by FreeBSD during installation. Which essentially creates a single pool (called zroot) with no mirroring or software raid functionality. And based on their read speeds, it looks like these tests were run on (platter) HDDs. Most users (including developers) new to FreeBSD will probably use a similar configuration in their deployments; not having exposure to ZFS’ software raid or mirroring functionality. Interestingly, the performance differences they highlight really only emerge when running large numbers of containers in either software. It’s possible that routine benchmarking may not have captured these performance differences. However, I have not been able to reproduce Ryding & Johansson’s results when run using third-party tooling for benchmarking. In my experience, FreeBSD containers are performant and more stable than OCI containers deployed on similar hardware. As Klara Systems researchers have pointed out in their work benchmarking FreeBSD in production, you should always test with workloads which accurately model your real workloads [1]. To this end, we will unpack performance differences between OCI containers and jails (including accounting for number of active containers and threading within containers) for data intensive workflows in this paper. We will use a version of a compute-intensive workflow that uses successive high-intensity reads from disk (sqlite) and a version of a compute-intensive workflow that uses a dedicated PostgreSQL database with high-intensity reads over Cat 5 ethernet.
Breaking out of Walled Gardens
As scientists, why should we care about any of this? Our stack usually sits at-least one layer of abstraction above Operating Systems. We concern ourselves with high-level languages like Julia, S, BUGS, and Python. And massively-distributed computing in the cloud. We sometimes care about graphics cards or CPUs/RAM. But that’s it. So, why should we entertain BSD? The answer is escaping vendor lock-in and walled-gardens. The software stacks we use in the machine learning and data-science world are complex and varied. The datasets we use, particularly for things like Large Language Models, are even more so. As a profession, we are starting to see greater-and-greater consolidation of computing resources, software, and engineering prowess into a few large companies. You know who they are. And we have come to rely on them for everything. This is to the detriment of our profession, but also to humanity. We should be leveling the playing field in machine learning so that anyone in the world can leverage the technology and make use of it. From the algorithms used for inference down to the low-level software that these systems are built on – and for free.
Large technology companies have always been walled gardens. Their business model is to optimize technology to their own ends in order to increase profitability and returns for their shareholders. If we continue to contribute our research and work on AI under the umbrella of FANG companies, we will end up chained to expensive and proprietary platforms that mostly serve the financial interests of the investors of these companies. This is a trap – and anathema to the open spirit of science. But one we can break out of by giving it away. I promise you that you’ll grow as an engineer by thinking about the operating systems and languages that your applications run on. All the more so if your work enables others to develop and re-distribute their own derivitive models and software as they see fit (BSD Clause-2).
Popular Frameworks for Scientific Computing
The analysis portion of this paper is structured-like a white paper. I’ll make an inventory of the predominant technologies in the sciences that I’ve used with teams over the last decade, emphasizing what I see used now (and why). I’ll then discuss best practices for how I implement them in FreeBSD. I’m going to be using long-run historical data from stack exchange to discuss these things. As with operating system usage statistics and programming language popularity as represented by TIOBE, there are caveats. Some of these technologies are, like BSD, backed by teams that are small-but-mighty. But I’ll use it to prioritize how languages and frameworks for inference are implemented in FreeBSD.
We will start with programming languages. Python has come to dominate the IT world. Not just in data science, but as a universal language for all things backend. This is partly to my dismay, as my preference for some services (particularly for geospatial modeling work) would actually be Rust / C / C++ (for the memory efficiency and speed). But most organizations have not hired or invested in development teams that excel at systems-level programming languages. It’s faster/cheaper to write everything in an interpretted language like Python and throw more cloud-computing resources at the problem than to take the slow, methodical road of debugging compiled programs. I see Scala used for data engineering work in some shops. Though it’s use is on the wane, in my opinion. It was very popular for data work with Apache Spark. Spark’s usage has been pretty consistent since about 2018 (it’s about as popular on SO as questions related to ’tensorflow’), though in my opinion it’s probably reached market saturation and the world has moved on to Python/Kubernetes. Apache spark is available out-of-the-box in FreeBSD ports, though it’s commit history over the last two years has been anemic [1]. Scala (current) is also available in ports and it is dutifully maintained [1, 2]. Most software engineering teams that I work with are split into frontend (javascript/typescript) and backend (python/sql/r) roles. I encounter full-stack engineers, but in my experience this is usually a frontend developer that knows a little Python. For machine learning and scientific workflows, I almost universally see machine learning engineers and data engineers use Python/R/SQL. And occassionally STAN (for shops doing Bayesian inference), which is also dutifully maintained in ports [1].
First a word on R. R is a popular language for traditional statistical modeling and data preparation work. If you go to graduate school in the US and learn statistics, everyone uses R for their work. It’s nice because it is an old language that is well-supported. There are principal investigators in research groups doing impressive work in parametric statistics, climate modeling, hydrology, and remote sensing that do all of their work in R. They train their graduate students (and post-docs) to do everything in R, so that it’s easy to share code with collaborators. And when these folks get their degrees and leave, they bring R with them into industry. This can be jarring on entry to data science teams, because most software engineers have not used R. They think it’s single-purposed (i.e., can only be used for data work – nobody writes non-statistical software in R), does not scale well to production workflows, and is difficult to deploy outside personal workstations.
I tend to agree with some of these criticisms, but the market penetration by R in the sciences is undeniable [1]. R is probably as-widely used as Python for scientific workflows. It appears to have supplanted matlab / octave and dwarfs the questions asked on SO related to other common tooling in scientific computing. For what it’s worth, R is written in modern C++ and has direct hooks (via it’s RCPP packaging) to allow C++ developers to implement their code in C++ while using R data structures and abstractions. So it can be surprisingly fast. I’ll touch on julia, octave, pytorch, cuda, opencl (and ROCm). But the bulk of my demo will focus on Python/R workflows on FreeBSD. GPU programming on BSD is worthy of it’s own paper (or several). So I won’t dwell on native cuda programming here.
Lastly, a word on data [1]. SQL still dominates the data world. In most shops that I’ve worked in, databases are driven by PostgreSQL. I encounter MySQL, but mostly as a backend for web development. My work in ML deals with a lot of geospatial modeling and PostgreSQL excels at dealing with high dimensional geospatial data (through PostGIS extensions). PostgreSQL (with PostGIS and connection pooling, etc…) is very easy to deploy and manage on FreeBSD. I’ll touch on it briefly. I’ll touch on the so-called NoSQL alternatives like HBase and redis as well. I don’t see a lot of data work done on key-value stores, but I do see some. I see redis used as a backend for job queues in Kubernetes. The promise of NoSQL databases is that they can scale out (i.e., with clusters) in ways that traditional SQL servers cannot. Traditional databases are much easier to scale-up. For very large datasets, being able to scale-out is a real asset. That said, NoSQL will be a conversation for another post. Though Redis [1] and Mongodb [1] exist in ports and are very actively maintained.
Environment Management
In industry, Machine Learning work in R / Python is typically implemented using best practices from the Software Development Life Cycle (SDLC); work environments are cordoned-off into isolated environments. I usually see development, staging, and production used. Often there is some flexibility in developers managing their local development environments. But the end goal should always be to lift-and-shift their work, so that it can be easily deployed in staging/production. Although isolated from one another, typically staging and production environments are very similar (if not identically implemented). Virtual machines filled this niche for a while (recall Hashicorp’s Vagrant files). OCI containers are the vehicle for doing this in almost every environment I’ve worked in recently. This change is largely owing to containers composability and reproducibility. They are easier to create and share than virtual machines ever were. And increasingly the framework for managing and orchestrating many containers in production is Kubernetes.
Though jails pre-date Docker containers (and container runtimes, like podman), enterprise Linux has experienced a Cambrian Explosion of technologies related to containerization that BSD has not seen. We are getting there. As mentioned, Podman exists in FreeBSD ports. It uses Redhat’s container runtime to make jails into OCI compatible containers with docker-like functionality. And although the technology is promising, it is still in its early stages of development. Maintaining podman containers for FreeBSD is closely-tied to the version of FreeBSD you are running (as well as the CPU architecture, only arm64 and x86-64 jails are supported). This isn’t a consideration for Linux containers. In Linux, you are really only interested in the CPU architecture a maintainer used to build a container (e.g., x86-64, aarch64, arm7). Even the distribution used to build the container become moot in Linux; the container runtimes can run them regardless of the distribution they were made for. In Linux, we often use orchestration frameworks like portainer or kubernetes to manage containers. In FreeBSD, there’s Hashicorp’s Nomad and Pot. But these are no where near as robust as Kubernetes. There’s also no thriving third-party ecosystem building pot or nomad and providing an analog to kubernetes operators or helm charts. Indeed, these technologies may never exist in FreeBSD. Many UNIX systems administrators consider them a nuisance and the thought of having to manage them in FreeBSD might keep them up at night.
So, where does this leave us? What’s a software engineer working on FreeBSD to do? I encourage developers to use FreeBSD for what it’s good at : bare metal UNIX services (like databases and file servers), virtual machines (which we will get to in a second), and jails. And I encourage developers to use Linux for what it’s good at : containerized workloads, container orchestration, and (e.g., GPU) hardware support. I tend to mix-and-match Linux and BSD in my work; which is what we are going to explore in this paper.
Methods
Virtual Machines
FreeBSD ships with a virtualization hypervisor called Bhyve. It’s included in the world build process as a component of the FreeBSD base operating system. It’s written in C, has direct links to the FreeBSD kernel (i.e., operates as a KVM), and has been under active development since 2011 ([1, 2]) If you’ve worked with VirtualBox or VMware Fusion before, this is not that. Bhyve is opinionated about implementation and is more in keeping with VMWare workstation or HyperV, if you’ve ever worked with them.
Conda-forge
R and Python, reproducability, collaboration, version-pinning
Ports
What Elements of the Data Science Stack Are Available Natively on BSD?
Architecture : Environments
The development environment should closely resemble staging / production. It can run entirely on BSD technology, but all libraries and code need to be portable; such that the workflow can be lifted-and-shifted to OCI containers and Kubernetes as our code matures. I am going to use a python program I wrote that merges giant tables on a SQL database and constructs a standard view with geometry data included in the view for fast geospatial rendering. And for testing GNU R, we are going to use an R script that fits a vanilla Generalized Linear Model to corn grain yield given mean annual precipitation and temperature across the continental United States. We are going to use pure-FreeBSD for our development environment. We will use it’s native jails containerization platform and the version of Python the port maintainers keep up-to-date for us in the FreeBSD ports collection. We will then use PyPi’s virtual environment module (venv) to create a sandboxed environment from the system install of Python. We will use the versions of various packages we are using in software to build actual OCI containers that can be deployed in staging. To build OCI containers that are compatible with Kubernetes (where all containers typically end up), I will use FreeBSD’s bhyve hypervisor to create a Debian VM that is resourced like an e2-medium instance on Google Cloud Platform. This way, when I run my test workloads in staging, I will be able to catch issues related to memory and CPU utilization before they land in production. I will use Podman on Debian to build my containers, and I will use Anaconda (miniforge) on the conda-forge repository to manage my Python dependencies. Anaconda will be able to rebuild the exact versions of the Python dependencies we use in FreeBSD.
First, let’s setup our dev environment in FreeBSD. I assume you already have the latest FreeBSD RELEASE installed on your machine [1] and you are probably staring at an empty shell (/bin/sh) with lots of room on your hard drive. You can set-up a desktop environment on this workstation [1], or you can use your laptop to SSH into this system to get things setup. We will use bastille to manage our jails. It has a docker-like syntax that will make working with jails easier (coming from the Linux world). Here’s a boilerplate setup, but you can read the bastille bsd website for a deeper dive on how it works [1]. This work is implemented as Ansible playbooks as-well ( [1, 2, 3 ]). But good-old (/bin/sh) shell scripts are demonstrative enough.
Benchmarking Setup
- Expand Larabel’s 2021 Phoronix Benchmarks Against:
- Bare-metal (c3-standard-8; Intel(R) Core(TM) i7-6700T CPU @ 2.80GHz [8 cores] / 32 GB of Memory / 1 TB NVMe SSDs)
- Bhyve-hosted Virtual Machines for Debian and FreeBSD 14.0 (e2-medium; 2 CPUs / 4 GB of Memory / 1 TB NVMe SSDs)
- Single-instance Podman Containers Run From Debian VMs and
- Single-instance Jails Run from FreeBSD VMs (e2-medium; 2 CPUs / 4 GB of Memory / 1 TB NVMe SSDs)
- Single-instance Podman Containers Run From Debian VMs and
- Expand Ryding & Johansson’s 2020 Benchmarks
- Amended to Use Telemetry and Actual Modeling Inference Workflows in R / Python
- Bhyve-hosted Virtual Machine for Debian 12.5 and Podman (e2-medium; 2 CPUs / 4 GB of Memory / 1 TB NVMe SSDs)
- K-fold Cross-validation of a Generalized Linear Model trained from on-disk data (R/Python)
- Runs with : 1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 containers running simulaneously
- K-fold Cross-validation of a Generalized Linear Model trained from database hosted-data (R/Python)
- Runs with : 1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 containers running simulaneously
- K-fold Cross-validation of a Generalized Linear Model trained from on-disk data (R/Python)
- Bhyve-hosted Virtual Machine for FreeBSD 14.0 and Jails (e2-medium; 2 CPUs / 4 GB of Memory / 1 TB NVMe SSDs)
- K-fold Cross-validation of a Generalized Linear Model trained from on-disk data (R/Python)
- Runs with : 1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 containers running simulaneously
- K-fold Cross-validation of a Generalized Linear Model trained from database hosted-data (R/Python)
- Runs with : 1, 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 containers running simulaneously
- K-fold Cross-validation of a Generalized Linear Model trained from on-disk data (R/Python)
- Bhyve-hosted Virtual Machine for Debian 12.5 and Podman (e2-medium; 2 CPUs / 4 GB of Memory / 1 TB NVMe SSDs)
- Amended to Use Telemetry and Actual Modeling Inference Workflows in R / Python
Bastille for Jail Management
## 1. Configures a Base Install of FreeBSD to use Bastille for Managing Jails with a
## Docker-like Syntax
pkg install -y bastille && sysrc bastille_enable=YES
# Allow the use of LIMITS in bastille so we can provision containers similar
# to what is used by cloud providers
echo "kern.racct.enable=1" >> /boot/loader.conf
# If you have not created a zfs filesystem entry, this will tack one on to zroot
# Change as-needed
zfs create zroot/bastille && sysrc bastille_zfs_enable=YES
# Let bastille know the name of the zfs filesystem we'd like it to use
sysrc -f /usr/local/etc/bastille/bastille.conf bastille_zfs_enable="YES"
sysrc -f /usr/local/etc/bastille/bastille.conf bastille_zfs_zpool="zroot/bastille"
# Bootstrap a network interface we'd like to use
IF=`route get 8.8.8.8 | grep interface | awk '{print $2}'`
RELEASE=`uname -a | awk '{ print $3 }' | cut -d '-' -f 1-2`
# bootstrap our distfiles for the current version of FreeBSD
bastille bootstrap $RELEASE update
PostgreSQL in FreeBSD
Version tracking in ports. PostgreSQL extensions
## 1. Testing Database Installation On Bare-metal
pkg install -y postgresql15-server postgis34
# optional -- create a connection pooler to handle high database traffic
#bastille create connection_pooler $RELEASE 0.0.0.0 $IF
#bastille pkg pgbouncer install -y pgbouncer
Python in FreeBSD
A resonably up-to-date version of Python is included in FreeBSD ports. The version of Python (and common packages) are often a few versions behind the official release of Python from the Python foundation and package maintainers. For example, Python 3.11 was stable as of today, but Python 3.9 is what is current in FreeBSD ports. For 99% of developers, this is fine. My preference is for security and stability of Python over new features. For reproducability, we can record the version of python and installed packages used for inference and predictions and store it in a manifest file. When we are ready to move our implementation to production (e.g., on Kubernetes), we can build containers with Anaconda that can re-create the exact environment used in testing. The code below gives an example of this.
## 2. Build a Python (3.9) Development Environment Container Via FreeBSD Jails
## Using Templating
# Create a Bastilefile template to work with (uses latest stable Python from ports)
mkdir -p /usr/local/bastille/templates/local
cat <<EOF > /usr/local/bastille/templates/default/python-database-client/Bastillefile
# bastille python template for dev environment (C)2024 Kyle Taylor
LIMITS 4G
CMD mkdir -p /payload
ARG version=39
PKG python${version} python
PKG py${version}-tqdm
PKG py${version}-sqlalchemy20
PKG py${version}-pandas
CMD python -m venv --system-site-packages /payload/python-environment
CMD sh /payload/python-environment/bin/activate &&\
/payload/python-environment/bin/pip --require-virtualenv list\
--format freeze >> /payload/requirements.txt
CMD python -VV
EOF
# add our template to bastille
sysrc -f /usr/local/etc/bastille/bastille.conf bastille_template_local=default/local
bastille create python-sandbox $RELEASE 0.0.0.0 $IF
bastille template $RELEASE local/python-database-workloads 0.0.0.0 $IF
GNU R in FreeBSD
GNU R is available in FreeBSD ports and is actively maintained. However, like with Python, typically data scientists and statisticians will want to pin the version of R (and package dependencies) to specific versions when they are ready to share their code with their devops team for deployment in production. This is done so that results from inference when the DS fits models are consistent and reproducable in production. It is perfectly ok to grab the versions of R (and packages) from the version of GNU R installed from ports and CRAN and to store them in a manifest. Typically the DevOps staff will use Anaconda to re-create the environment that is needed in production. The code used below follows this pattern, but for our benchmarking workflows I did not use Anaconda. I used the version of GNU R installed from the operating system’s package repositories.
## 2. Build a GNU R Development Environment Container Via FreeBSD Jails
## Using Templating
# Create a Bastilefile template to work with (uses latest stable Python from ports)
mkdir -p /usr/local/bastille/templates/default/gnu-r
cat <<EOF > /usr/local/bastille/templates/default/gnu-r/r_package_version_dump.r
ip <- as.data.frame(installed.packages()[,c(3:4)])
ip <- ip[is.na(ip$Priority),1:2,drop=FALSE]
ip <- paste(paste("r", tolower(rownames(ip)), sep="-"), ip$Version, sep="==")
writeLines(ip, file("/payload/requirements.txt"))
EOF
cat <<EOF > /usr/local/bastille/templates/default/gnu-r/Bastillefile
# bastille GNU R template for dev environment (C)2024 Kyle Taylor
LIMITS 4G
CMD mkdir -p /payload
CP r_package_version_dump.r /payload/r_package_version_dump.r
PKG R r-cran-randomforest r-cran-ggplot2 r-cran-rsqlite r-cran-rpostgresql
CMD Rscript /payload/r_package_version_dump.r
CMD R --version
EOF
# add our template to bastille
sysrc -f /usr/local/etc/bastille/bastille.conf bastille_template_gnu_r=default/gnu-r
bastille create gnu-r-sandbox $RELEASE 0.0.0.0 $IF
bastille template $RELEASE default/gnu-r
Performance Benchmarking
Bare-metal Testing
## Python/R Workflows, Disk Utilization, Memory, and Database Worklows
phoronix-test-suite batch-benchmark sqlite # DISK USAGE (and data)
phoronix-test-suite batch-benchmark stress-ng # CPU USAGE
phoronix-test-suite batch-benchmark osbench # OS EFFICIENCY
phoronix-test-suite batch-benchmark pybench # DATA : python
phoronix-test-suite batch-benchmark numpy # DATA : python
phoronix-test-suite batch-benchmark rbenchmark # DATA : gnu r
phoronix-test-suite batch-benchmark pgbench # DATA : postgresql
VM Testing
Container Environment Testing
Staging Environment : Bhyve and Podman
Production Environment : The Kubernetes Hand-off
Discuss the potential for managing production deployments for API services on FreeBSD directly (side-stepping the cloud). As well as hybrid on-prem and kubernetes integration using Tailscale. And discuss the inevitability of Kubernetes in our current IT climate.
version: 1.2.3
Results
Discussion
What is missing?
Native FreeBSD builds for data science tooling for Conda-forge
PCI Pass-through with Bhyve
PCI Pass-through support for GPU devices has been supported in FreeBSD since the 2021. NVIDIA GPUs not currently supported, as the vendor prevents VM pass-through for the bulk of their cards by checking the CPU for virtualization (and preventing ROM export). But select AMD GPUs are supported [1, 2]. Currently requires PCI ROM extraction for the graphics card [1] and passing it ROM for execution by Bhyve’s boot firmware. The bus address for a dedicated card is passed to the VM. If your machine only has one graphics card, your FreeBSD host will switch to headless as pass-through is attempted. Windows guests and Linux guests hae been tested by users in-the-wild, but presumably data scientists will be most interested in implementing PCI pass-through for Linux guests on Bhyve.
Using ROCm and CUDA via FreeBSD’s Linux Kernel Emulation (Linuxulator)
Native support for ROCm and CUDA
A push for experimental support for a ROCm driver in FreeBSD (as part of graphics/drm-devel-kmod) was started in 2018. However, this work stalled in 2019 [1,2].
Appendix
Benchmarking Source Code in R/Python
import sys
import os
import multiprocessing
import time
from sklearn.model_selection import train_test_split
from sklearn.ensemble import (
RandomForestRegressor,
)
from sklearn.linear_model import (
LinearRegression
)
from sklearn.metrics import r2_score
import numpy as np
from formulaic import Formula
import sqlite3
import uuid
import sqlalchemy as sa
from sqlalchemy.pool import NullPool
def build_credentials_dict(
database: str,
host: str,
password: str = os.environ.get("PG_PASS", ""),
user: str = os.environ.get("PG_USER", ""),
) -> dict:
"""Wrapper for credentials parsing that psycopg2 understands"""
return {
"database": database,
"host": host,
"user": user,
"password": password,
}
def build_connection_string_sa(
database: str,
host: str = os.environ.get("PG_HOST", ""),
user: str = os.environ.get("PG_USER", ""),
password: str = os.environ.get("PG_PASS", ""),
application: str = "python-benchmarking",
) -> str:
"""Returns a connection string to a user-specified postgres backend
that SQL Alchemy understands"""
credentials = build_credentials_dict(
database=database, host=host, password=password, user=user
)
return (
f"postgresql://{uri_quote(uri_unquote(credentials['user']))}"
+ f":{uri_quote(uri_unquote(credentials['password']))}@{host}"
+ f":5432/{database}?application_name={application}"
)
def sqlite_read_df() -> pd.DataFrame:
with sqlite3.connect(
"/payload/corn_frequency_testing_data.sqlite"
) as connection:
df = pd.read_sql(
"SELECT * FROM corn_frequency_testing_data", connection
)
return df
def sqlite_write_df(df: pd.DataFrame) -> str:
output_file = f"/payload/{uuid.uuid4()}.sqlite"
with sqlite3.connect(output_file) as connection:
df.to_sql("test_df", con=connection, if_exists="replace")
return output_file
def postgres_read_df():
database_engine = sa.create_engine(
build_connection_string_sa(
database="postgres",
host=os.environ.get("PG_HOST"),
user=os.environ.get("PG_USER"),
password=os.environ.get("PG_PASS"),
),
poolclass=NullPool,
)
with database_engine.connect() as connection:
df = pd.read_sql(
"SELECT * FROM public.corn_frequency_testing_data", connection
)
return df
def fit_and_evaluate_regression_models():
# read our table entry from the postgresql database
dbread_start = time.time()
training_df = postgres_read_df()
dbread_stop = time.time()
dbread_time = dbread_stop - dbread_start
# read our table entry from the sqlite (file) database
fread_start = time.time()
training_df = sqlite_read_df()
fread_stop = time.time()
fread_time = fread_stop - fread_start
# write our table entry back out to sqlite
fwrite_start = time.time()
outfile = sqlite_write_df(training_df)
fwrite_stop = time.time()
try:
os.unlink(outfile)
except Exception:
pass
fwrite_time = fwrite_stop - fwrite_start
# training / testing split
train_start = time.time()
# shallow-learning RF regression trees
# this is a large stack of predictors with a myriad of
# potential interactions. Rather than explore the parameters
# in detail, we'll lean on RF to capture the variation for us
training_variables = [ c for c in training_df.columns if c not in ['nass_frequency','index']]
X_train, X_test, Y_train, Y_test = train_test_split(
training_df.loc[:,training_variables], training_df['nass_frequency'],
test_size=round(0.1*len(training_df))
)
w = 1/Y_train.value_counts()
w = Y_train.replace(w)
model_parameters = {
"max_depth": 45,
"min_samples_split": 3,
"min_impurity_decrease": 0.05,
"n_estimators": 30,
"n_jobs": -1,
}
rf_regression_model = RandomForestRegressor(**model_parameters).fit(
X_train[training_variables], Y_train
)
glm_training_df = X_train.copy()
glm_training_df['p_nass_frequency'] = rf_regression_model.predict(X_train)
glm_training_df['nass_frequency'] = Y_train
# fit a shallow-boosted GLM for 'frequency of years grown'
# to everything we've got (but focusing on key climate variables)
w_step = w.copy()
rsq_glm = -100.0
for i in range(50):
y, X = Formula(
'''nass_frequency ~
poly(tmean_annual,2) + poly(ppt_annual,2) +
poly(tmax_04,2) + poly(tmax_05,2) +
poly(ppt_04,2) + poly(ppt_05,2) + p_nass_frequency +
+ tmean_annual:ppt_annual''').get_model_matrix(
glm_training_df
)
m_glm = LinearRegression().fit(X,y, sample_weight=w_step)
residuals = abs(y-m_glm.predict(X)).iloc[:,0]
r_sq_step = m_glm.score(X,y)
if round(r_sq_step,5) > round(rsq_glm,5):
w_step = w_step + (
(residuals - min(residuals)) /
(max(residuals) - min(residuals))
)
rsq_glm = r_sq_step
else:
w_step = w_step + np.array(
[ random.uniform(0,100)/100 for i in range(len(glm_training_df)) ]
)
train_stop = time.time()
train_time = train_stop - train_start
# evaluate
eval_start = time.time()
X_test['p_nass_frequency'] = rf_regression_model.predict(X_test)
predicted = m_glm.predict(X_test)
predicted[ predicted < 0 ] = 0
predicted[ predicted > 10 ] = 10
# overall accuracy for all classes
overall_accuracy = sum((Y_test - predicted) == 0) / len(Y_test)
eval_stop = time.time()
eval_time <- eval_stop - eval_start
return pd.DataFrame(
dbread_time_sec=dbread_time,
fread_time_sec=fread_time,
fwrite_time_sec=fwrite_time,
training_time_sec=train_time,
evaluation_time_sec=eval_time,
r_squared=rsq_glm,
rf_r_squared=median(m_rf[rsq]),
overall_accuracy=overall_accuracy)
)
if __name__ == "__main__":
if len(sys.argv) == 0:
raise RuntimeError(
"At least one argument must be supplied (backend label)."
)
# run our fitting tests in parallel across several configurations of threads
# -- we'll use K replicates to calculate our statistics in aggregate across runs
k = 200
results = []
for i in [1, 2, 4, 6, 8]:
with multiprocessing.Pool(processes=i) as pool:
r = pool.map(fit_evaluate_regression_models, range(k))
r["threads"] < -i
results.append(r)
results = pd.concat(results).rebuild_index()
# total size read from sqlite database (in megabytes)
sqlite_size_t = (
os.path.getsize("/payload/corn_frequency_testing_data.sqlite") / 1e6
)
results["sqlite_total_mb_read_per_second"] = (
sqlite_size_t / results["fread_time_sec"]
)
results["sqlite_total_mb_write_per_second"] = (
sqlite_size_t / results["fwrite_time_sec"]
)
results["backend"] = sys.argv[1]
results.to_csv("./results.csv", index=False)
sys.exit(0)
And for our heavily telemeterized, computationally-expensive GLM and cross-validation workflow:
# this is ~10 megabytes of data
wget -P ./corn_index_testing_data.sqlite https://the-integral.dev/data/corn_frequency_testing_data.sqlite
bastille cp gnu-r-sandbox ./corn_frequency_testing_data.sqlite payload/corn_frequency_testing_data.sqlite
cat <<EOF > ./r_regression_with_crossvalidation.r
require("RSQLite")
require("RPostgreSQL")
require("parallel")
require("randomForest")
sqlite_read_df <- function(path="/payload/corn_frequency_testing_data.sqlite") {
connection <- RSQLite::dbConnect(RSQLite::SQLite(), path)
df <- RSQLite::dbReadTable(connection, 'corn_frequency_testing_data')
RSQLite::dbDisconnect(connection)
return(df)
}
sqlite_write_df <- function(df) {
output_file <- paste("/payload/", system("uuidgen", intern=T), ".sqlite", sep="")
connection <- RSQLite::dbConnect(RSQLite::SQLite(), output_file)
RSQLite::dbWriteTable(conn=connection, name="test_df", value=df, overwrite=T)
RSQLite::dbDisconnect(connection)
return(output_file)
}
postgres_read_df <- function() {
connection <- RPostgreSQL::dbConnect(drv=RPostgreSQL::PostgreSQL(),
user=Sys.getenv("PG_USER"),
password=Sys.getenv("PG_PASS"),
host=Sys.getenv("PG_HOST"),
port=5432,
dbname="postgres"
)
df <- RPostgreSQL::dbGetQuery(
connection,
"SELECT * from public.corn_frequency_testing_data;"
)
RPostgreSQL::dbDisconnect(connection)
return(df)
}
fit_evaluate_regression_models <- function(x) {
# read our table entry from the postgresql database
dbread_start <- Sys.time()
training_df <- postgres_read_df()
dbread_stop <- Sys.time()
dbread_time <- as.numeric(difftime(dbread_stop,dbread_start, units='sec'))
# read our table entry from the sqlite (file) database
fread_start <- Sys.time()
training_df <- sqlite_read_df()
fread_stop <- Sys.time()
fread_time <- as.numeric(difftime(fread_stop,fread_start, units='sec'))
# write our table entry back out to sqlite
fwrite_start <- Sys.time()
outfile <- sqlite_write_df(training_df)
fwrite_stop <- Sys.time()
outfiles <- file.remove(outfile)
fwrite_time <- as.numeric(difftime(fwrite_stop,fwrite_start, units='sec'))
# training / testing split
train_start <- Sys.time()
training <- sample(nrow(training_df), 0.9*nrow(training_df))
testing <- training_df[ !(1:nrow(training_df) %in% training) , ]
training <- training_df[ training, ]
# shallow-learning RF regression trees
# this is a large stack of predictors with a myriad of
# potential interactions. Rather than explore the parameters
# in detail, we'll lean on RF to capture the variation for us
w <- as.vector(1/table(training$nass_frequency))
weights <- unlist( sapply(
1:length(w), function(i) {
as.vector(rep(w[i], times=sum(training$nass_frequency == i-1)))
} )
)
m_rf <- randomForest::randomForest(
nass_frequency ~ .,
weights=weights,
data=training,
ntree=30
)
# use the features as a label in our original training set
training$p_nass_frequency <- as.numeric(m_rf$predicted)
# fit a shallow-boosted GLM for 'frequency of years grown'
# to everything we've got (but focusing on key climate variables)
w_step <- weights
rsq_glm <- 0.0
for(i in 1:50){
m_glm <- glm(
nass_frequency ~
poly(tmean_annual, 2) + poly(ppt_annual, 2) +
poly(tmax_04, 2) + poly(tmax_05, 2) +
poly(ppt_04, 2) + poly(ppt_05, 2) +
+ p_nass_frequency + tmean_annual:ppt_annual,
weight=w_step,
data=training
)
r_sq_step <- ( 1 - ( m_glm$deviance / m_glm$null.deviance ) )
if(r_sq_step > rsq_glm){
w_step <- w_step + (
as.vector(m_glm$residuals) - min(m_glm$residuals)) /
(max(m_glm$residuals) - min(m_glm$residuals)
)
rsq_glm <- r_sq_step
} else {
w_step <- w_step + ( runif(min=0, max=100, n=nrow(training)) / 100 )
}
}
train_stop <- Sys.time()
train_time <- as.numeric(difftime(train_stop,train_start, units='sec'))
# evaluate
eval_start <- Sys.time()
testing$p_nass_frequency <- as.numeric(
predict(m_rf, newdata=testing, type="response"))
predicted <- as.vector(predict(m_glm, newdata=testing, type="response"))
predicted[ predicted < 0 ] <- 0
predicted[ predicted > 10 ] <- 10
predicted <- as.integer(predicted)
# overall accuracy for all classes
overall_accuracy <- sum((testing$nass_frequency - predicted) == 0) / nrow(testing)
eval_stop <- Sys.time()
eval_time <- as.numeric(difftime(eval_stop,eval_start,units='sec'))
return(data.frame(
dbread_time_sec=dbread_time,
fread_time_sec=fread_time,
fwrite_time_sec=fwrite_time,
training_time_sec=train_time,
evaluation_time_sec=eval_time,
r_squared=rsq_glm,
rf_r_squared=median(m_rf$rsq),
overall_accuracy=overall_accuracy)
)
}
if (!interactive()) {
args = commandArgs(trailingOnly=TRUE)
if (length(args)==0) {
stop("At least one argument must be supplied (backend label).", call.=FALSE)
}
# run our fitting tests in parallel across several configurations of threads
# -- we'll use K replicates to calculate our statistics in aggregate across runs
k=200
results <- list()
for (i in c(1,2,4,6,8)){
cluster <- parallel::makeCluster(i)
clusterExport(cluster, "sqlite_read_df")
clusterExport(cluster, "sqlite_write_df")
clusterExport(cluster, "postgres_read_df")
r <- do.call(``
"rbind",
parLapply(cluster, 1:k, fit_evaluate_regression_models)
)
r$threads <- i
results[[i]] <- r
parallel::stopCluster(cluster)
}
results <- do.call("rbind", results)
# total size read from sqlite database (in megabytes)
sqlite_size_t <- file.info(
"/payload/corn_frequency_testing_data.sqlite")$size / 1e+6
results$sqlite_total_mb_read_per_second <-
sqlite_size_t / results$fread_time_sec
results$sqlite_total_mb_write_per_second <-
sqlite_size_t / results$fwrite_time_sec
results$backend <- args[1]
write.csv(results, "./results.csv")
}
EOF