Tigraine

Daniel Hoelbling-Inzko talks about programming

Why I love Go error handling

One thing that Go almost forces you to do is to explicitly handle each and every error that any random part of the system might create. This has one very obvious side effect of making even simple code quite long and peppered with if err != nil statements:

func writeFile() error {  

   fileName := "test.txt"  
   f, err := os.Open(fileName)  
   if err != nil {  
      return fmt.Errorf("failed to open file %s: %w", fileName, err)  
   }  
   defer f.Close()  

   d1 := []byte("hello\ngo\n")  
   _, err = f.Write(d1)  
   if err != nil {  
      return fmt.Errorf("unable to write to file %s: %w", fileName, err)  
   }  
   return nil  
}

As you can see a simple open-file-and-write requires 2 error checks that add almost 50% of lines of code to this rather simple method. Because of this most Java/C#/C++ people you show Go code to almost always react with aversion and distaste and never give the language another chance (although I think this is now changing gradually).

But I actually think this is Go's biggest strength and a boon to developers. By having errors be so "in your face" - you have to do something about exceptions. In languages with Exceptions traversing up the call stack it's all too easy to just expect someone up the food chain to catch your exception. All too often that doesn't happen or if it does it's a very generic "catch-everything" block that can only log the problem without having any chance to actually recover from it.

Go in contrast makes you think about every error in detail and how it affects the current control flow. A classic example of this would be a for loop that calls some method. I've seen all too often bugs because people didn't put a try catch inside the loop, so the first problem that arises (and most likely it's a very rare thing that happens) stops execution of the loop and you are then wondering why you're missing half your data or something like that. If you have to really think hard about each error, you're much more likely to also think about how it's affecting the code you're currently writing, so in Go I find myself writing continue and break a lot more frequently than I usually do in Java/Kotlin.

Another case where conscious error handling is very handy in my opinion is when making an application resilient to failures downstream (see my recent post on Bulkheads). Only if you have useful error handling in place on all levels of the application can you start building logic that responds to these errors (without having to go on a archeological excavation of the whole call stack).

Obviously you have to have some discipline in your errors, just doing return err won't do you any favours here. But I find the way Go requires error handling also tends to promote more deliberate throwing of errors that carry actually useful information up the call stack (because you need that info to handle them up there). If that's then in place you can also make much better decisions on how to treat these errors in failure scenarios and when deciding if a CircuitBreaker should trip or not etc.

Filed under golang, errors

A bulkhead in Go is really just a semaphore

When looking to build a software system that's resilient in the face of failure there are a bunch of useful concepts and components that all need to work together to achieve that goal.

One of these tools is Bulkheading).

Bulkheads in traditional shipbuilding are a means to keep water that's entering the vessel in one compartment from flooding the whole ship and sinking it. Translated to software it's pretty similar: You try to compartmentalise the application so failures in one part don't adversely affect the rest of the application.

A classic example of why this is important would be a database that's acting up and starts responding slowly to queries.

By itself that would not be a problem - a slow requests would run into a timeout and the application would gracefully handle that down the line. It does become a problem though if the clients continue hammering that service with more and more queries while the database is slow. The slow responses end up blocking resources in the application and given high enough timeouts and enough incoming requests there is a real risk of the application running out of resources and crashing.

The other issue in such a scenario is that once the database starts becoming unstable/slow, adding more queries just equates to kicking someone that's already down. There is a high chance that the added queries will just make matters worse and cause a struggling database to shut down completely.

The solution to this is to introduce a maximum number of concurrent requests that the application is allowed to send to the database. Once the DB starts getting slow the incoming requests are not immediately submitted to the DB but actually have to wait until another active request is done. By putting a maximum wait time on this you can essentially limit the number of in-flight requests to a known quantity that will prevent your service from consuming all available resources and crashing. And you get to degrade the service gracefully.

Why not use a normal timeout? Timeouts are a static upper bound while latency is rarely uniform. Putting a timeout on an operation that during normal operation responds between 5ms and 10s will usually call for a timeout of 15-20 seconds depending on how generous you are. With a 20 second timeout and a quite moderate 300 operations per second you end up at a respectable 6.000 in-flight requests that tie up resources in your application. In Java-Land that would already spell doom for your application's threadpools. So in addition to maximum duration timeouts we need something more - and that something is a Bulkhead.

After having used the excellent Resilience4J library in Java to "failure-proof" a service that was having spotty collaborators we then moved on to some Go services to do the same. We expected to find a lot of libraries providing Bulkheading, but we couldn't really find one that's maintained and confidence inspiring.

So we looked at alternatives. Remembering that a Bulkhead isn't anything super fancy we looked at the Go standard library and hit gold in the golang.org/x/sync/semaphore package. Specifically the Weighted semaphore implementation is essentially all you need for a Bulkhead. A bulkhead in Go is simply a Semaphore, with all the relevant timeout features being enabled by the clever use of the context package. It doesn't come with monitoring out of the box like maybe Resilience4J does - but that's easy to layer on top and the API ends up being very simple:

sem := semaphore.NewWeighted(5) // allow 5 concurrent calls
go func() {
        ctx, _ := context.WithTimeout(context.TODO(), 1*time.Second)
        // Acquire the semaphore
        err := sem.Acquire(ctx, 1)
        if err != nil {
            // bulkhead is full and we timed out
            return
        }
        defer sem.Release(1)

        // do work
}()

As you can see since semaphore supports context we can very easily add our maximum waiting time for the bulkhead via the context.WithTimeout and we've essentially implemented a Bulkhead but with the standard library and quite straightforward idiomatic Go syntax.

Filed under go, resilience

Debugging Go IOWait Hang: Sometimes it's really not your code

If something looks like a bug in the Language Runtime, Standard Library or the Operating System I tend to always approach it with caution: It's usually a bug in my code and I'm just not seeing it.

But sometimes it's not me - it's really the compiler and you spend a solid week debugging a Go program until you find out that cross-compiling from OSX to Linux leads to a stdlib Bug that manifests itself with the whole application just hanging in IOWait loops given enough concurrency.

Obviously the whole thing was really frustrating because:

  • The bug only happened on production servers (obviously - anything else would not be fun).
  • Could only be reproduced on a large dataset of 300 million items (so every test also takes quite a while)
  • I had to test if it works without concurrency (which took 2 days and yes it did)

But the important finding from this exercise was that you can print the full stacktrace of all running Goroutines as well as their status for a running/hanging program! You just have to send the kill -ABRT signal to a process! This is similar to what you see when a panic occurs and was massively helpful in hunting down this bug. Kudos to the Go team for that.

An example for this:

package main

func main() {
  for {}
}

The program will obviously hang and do a busy loop, but if you send the kill -ABRT signal you'll get something similar to this printed to stderr:

SIGABRT: abort
PC=0x1056d70 m=0 sigcode=0

goroutine 1 [running]:
main.main()
        /Users/tigraine/projects/test/main.go:4 fp=0xc00003c788 sp=0xc00003c780 pc=0x1056d70
runtime.main()
        /usr/local/Cellar/go/1.14.1/libexec/src/runtime/proc.go:203 +0x212 fp=0xc00003c7e0 sp=0xc00003c788 pc=0x102b3f2
runtime.goexit()
        /usr/local/Cellar/go/1.14.1/libexec/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003c7e8 sp=0xc00003c7e0 pc=0x10528f1
...
Filed under golang, go, debugging

Upping Apache PoolingHttpClientConnectionManager pool limits

Imagine configuring a HTTP Connection pool and setting setMaxTotal to 50. Reasonable assumption would be that henceforth 50 concurrent connections will be made by the HttpClient upstream.

Well not in Java-land - here you'll get exactly 2 connections going out - apparently regardless of what you set as maximum total connections.

Turns out there is a second setting on the PoolingHttpClientConnectionManager that's called maxPerRoute and that controls how many connections you can make to the same host/url combination. Since in our current setup we mostly query one endpoint over and over again the maxTotal setting is pretty useless and the limiting factor will be the maxPerRoute.

Thankfully there is a setDefaultMaxPerRoute which can be tweaked, or there is the ability to specify individual limits per upstream route with setMaxPerRoute

The final code in question is:

PoolingHttpClientConnectionManager poolingConnectionManager = new PoolingHttpClientConnectionManager();
poolingConnectionManager.setMaxTotal(MAX_TOTAL_CONNECTIONS);
poolingConnectionManager.setDefaultMaxPerRoute(MAX_TOTAL_CONNECTIONS);

To debug the issue of slow responding upstream clients I also wrote a little go webserver called blackhole that does exactly what the name implies: It accepts any HTTP connection and swallows it for 100 seconds. This makes it easy to test your code against slow responding HTTP servers (like when under duress or if the system becomes unresponsive).

Filed under java, apache

Convert a millisecond precision unix timestamp to Time in go

It's no real secret that I do love the programming language Go. So I was really delighted to see that Go apparently does all the right things when it comes to their time package that handles time zones etc correctly by default as opposed to be something bolted on after the fact like most other languages.

But for some unknown reason it's just way too complex to convert a millisecond resolution Unix timestamp to time.Time. The built-in time.Unix() function only supports second and nanosecond precision.

This means that you either have to multiply the millis to nanoseconds or split them into seconds and nanoseconds. So obviously my naive implementation was:

time.Unix(0, timestamp * int64(1000000))

But that code looked ugly to me - especially if you have to do this a few times around the codebase - so I wrote a function.

But for some reason I also decided to benchmark my function as I am working on a performance critical piece of code right now. And it turns out that the simple multiplication to turn millis into nanos is 2x slower than dividing the millis into seconds and then turning the remainder into nanos.

time.Unix(ms/int64(millisInSecond), (ms%int64(millisInSecond))*int64(nsInSecond))

Benchmark:

goos: darwin
goarch: amd64
pkg: github.com/tigraine/go-timemillis
BenchmarkMult-8         2000000000               0.50 ns/op
BenchmarkDiv-8          2000000000               0.25 ns/op

So I packaged my findings into a library which is now available on GitHub: go-timemilli

Filed under golang, time

Golang hidden gems: testing.T.Log

One thing I love about Go is it's build chain and overall ease of use. Some things take time to get used to, but the lightning fast builds and the convention-based testing Go offers are addicting right from the start.

Today I found another hidden Gem I think is just genius: testing.T.Log(). Ok I admit, not the most sexy method to get excited about - but bear with me for a moment. Imagine the following code.

func TestSomething(t *testing.T) {
  t.Log("Hello World")
}

What's the output? If you'd expect Hello World you are mistaken. The output is exactly nothing :)

testing.T.Log() only prints something if a testing.T.Error or testing.T.Fatal occurred. Brilliant! Nothing is more annoying than chatty test suites where your actual problem is buried in 2-3 megabytes of meaningless debug statements! And this solves the problem really elegantly. You can log as much debug info as you want and it will only surface if the test actually failed.

Filed under golang, go, testing

Golang: int is not a type

Today I ran into a very interesting compiler error in my go program: int is not a type. Although the same lines of code that didn't compile worked a few minutes earlier.

It took me 20 minutes to figure it out. I was already halfway through a forum post on the golangbridge forums and trying to put together a minimal example when I noticed the problem. Apparently I had a typo in mit func init() and had inadvertently called it func int()!

The go compiler didn't even complain with me overriding one of it's core types.

Filed under golang

Generating synthetic CPU load on Linux

While working on some alerting and metric collection about our infrastructure at Bitmovin I wanted to test out if the alerts I configured are actually triggered when a server experiences high CPU load.

I came across this beautiful Stackoverflow Answer that did exactly what I needed:

seq 3 | xargs -P0 -n1 md5sum /dev/zero

This command will saturate 3 cores with 100% user load until you cancel the command with CRTL+C.

Filed under linux, devops, server

Compiling vim8 with python support on Ubuntu

Today I took a day off from work so as always when I try some new stuff I end up spending 2 hours on my Vim configuration before actually getting something done. So todays two hours where spent on getting Vim compiled with python3 support.

First off - do use Vim8 - it's awesome and do compile it from source. It's rather simple and saves you from outdated packages on Ubuntu :).

Now my issue today was that I tried enabling python2 and python3 support at the same time. For no apparent reason the following configuration did always result in a vim binary that thought it had python support - but didn't.

./configure --with-features=huge \
            --enable-multibyte \
            --enable-rubyinterp=yes \
            --enable-pythoninterp=yes \
            --with-python-config-dir=/usr/lib/python2.7/config-x86_64-linux-gnu \
            --enable-python3interp=yes \
            --with-python3-config-dir=/usr/lib/python3.5/config-3.5m-x86_64-linux-gnu \
            --enable-perlinterp=yes \
            --enable-luainterp=yes \
            --enable-cscope --prefix=/usr \
--enable-fail-if-missing

Running vim --version resulted in +python/dyn and +python3/dyn so I thought - cool it's working.. Until I started vim and was greeted by:

Sorry, this command is disabled, the Python library could not be loaded.

To make things more interesting :echo has('python') did return 0 too - although the Vim was built with python support (and --enable-fail-if-missing is supposed to fail if python can't be linked).

So after trying around a bit and not getting anywhere I decided to just remove the python3 support from the configure line and voila - python is statically linked and working.. Yay!

./configure --with-features=huge \
            --enable-multibyte \
            --enable-rubyinterp=yes \
            --enable-pythoninterp=yes \
            --with-python-config-dir=/usr/lib/python2.7/config-x86_64-linux-gnu \
            --enable-perlinterp=yes \
            --enable-luainterp=yes \
            --enable-cscope --prefix=/usr \
--enable-fail-if-missing
Filed under vim, python, tools

Configuring Kong health-checks in Kubernetes

The first rule of cloud computing should be: Always have a health check!

Why? Well - without them your cluster will not know if the application is actually up or still starting/terminating or anywhere inbetween. As long as there are livenessProbes and readinessProbes Kubernetes can make sure no traffic gets routed to your app before it is really ready. And even more important: It will restart services and reschedule them once your health checks start going sideways.

But here is another insight into health checks: Do performance testing on them.

During the last couple of days I've had Kubernetes kill and restart perfectly healthy Kong Api-Gateway pods because apparently the /status route in Kong does some pretty expensive queries on the backend. Kong apparently thinks it's cool to do a SELECT COUNT(*) on most of it's tables to tell you how many consumers it has registered, how many oauth_tokens there are etc.. All totally irrelevant information for a health check - but it's still the only endpoint I was able to hit that would actually terminate on kong itself (anything else would also kill Kong if the upstream service is having a problem). And /status sounded like a reasonable endpoint for health-checking.

Now with Postgres that kind of queries would not really be a terrible problem (still not good), but for Cassandra it's pretty catastrophic since it's not really meant to do aggregation queries without a partition key. Looking at the code reveals the problem - and so once there was some moderate pressure, the slow queries would time out and Kubernetes would think the Kong pod was dead (although it was still serving requests) and killed it. Yay!

So the solution here was to move away from a httpGet liveness & readinessProbe to a exec probe. Exec probes are a one of my favorite feature Kubernetes - instead of doing Network calls to check if something is up it will just do a docker exec and determine based on the return code of the program executed if the pod is healthy or not.

And coincidentally Kong comes with a commandline utility called kong health that does exactly what it's named for - and is lightning fast with no database involved :).

Here is the relevant yaml configuration:

 readinessProbe:                                                                                                                                                                                            
   exec:                                                                                                                                                                                                    
     command:                                                                                                                                                                                               
       - kong                                                                                                                                                                                               
       - health 
Filed under kubernetes, devops

My Photography business

Projects

dynamic css for .NET

Archives

more