This page has an average rating of %r out of 5 stars based on a total of %t ratings
Ratings (%t)
Created on 05.06.2019

Inside stories from the developers: a debugging timeline

Bugs – they always catch us off guard and can be a real nuisance. Especially when dealing with new technologies, they are sometimes almost inevitable. Simon Vogt, a software engineer at PostFinance, has experience in this area – as his example shows.

The bug bursts in on the meeting.

It’s just coming up to midday on a Thursday. I’m in a meeting at our Mingerstrasse site in Berne, where some of our IT is located, when I get an e-mail from Jan. He is the product owner I’m working closely with on an innovation project. The subject line of the e-mail reads: “RE: No data”. The e-mail itself is just as brief: after the Apollo 13 quote “Houston, we have a problem!” there is a link to an evaluation. No data available. The evaluation is empty, which shouldn’t be the case, nor had it ever been the case before, and this was called “brikks” in the (internal) productive version. I’ve only just got over the initial shock when I get a message in the chat window that reads: “Simon, have you just deployed the software in the production environment?” No, of course not.

Was it a moment of carelessness?

My heart is pounding, my head is spinning. Maybe I did do that? But how could that have happened? Was I just being careless? It all seemed so strange to me. In the end, the productive version went live a month ago, and so far has been working just fine without any major deployment issues. Of course, the current code base contains sensitive database migrations that remove obsolete data from the database. Maybe something went wrong? That would be really embarrassing.

That’s impossible.

I message Jan in the chat saying I’ll call him back. I then leave the meeting early and get on my bike and go to Engehalde, which is where my normal place of work is. By this time I’m already on the phone with Jan. He asks me again whether I deployed the software in production in the morning. The current status of the master branch is apparently live. I am now certain I did not make any changes, so I reply: “No, not that I know of! Have a look in the deployment log to see who deployed it.” Jan’s a step ahead of me. “According to the log, the last production deployment was two weeks ago.”

Our very first inkling

We both have a quick think. Jan tells me he received an error message from the orchestration platform Kubernetes that very morning, and he asks whether I might have had anything to do with it.” Oh dear. This is when I start to suspect something, and I tell Jan we are still referring to the :latest tag of our Docker image in our Kubernetes set-up. Jan goes on, saying: “You mean Kubernetes automatically downloaded the latest Docker image after the error messages from this morning?” We both think to ourselves: yes, that could be it!

“It’s not a bug, it’s a feature” ... ;-)

And it turns out to be correct. Not long after, when we’re back at work, the first thing we do is tackle the empty evaluation issue. Fortunately, and to our amazement, we are able to resolve the problem in the user interface because it was not directly related to the deployment issue. We then launch a warning for our users, explaining to them that a new version of brikks has been deployed. “It’s not a bug, it’s a feature” being the general idea we go with... ;-)

Background information about the bug and the debugging process

At PostFinance, we work with new technologies, and this includes Kubernetes. Seeing as we did not want to needlessly complicate our very first experiences with this orchestration platform, we deliberately kept our deployment set-up as simple as possible. What’s more, we also wanted to go productive as soon as possible in order to quickly collect customer feedback and interactions. We intentionally went along with the following official warning during deployment set-up: “Note: You should avoid using the :latest tag when deploying containers in production as it is harder to track which version of the image is running and more difficult to roll back properly.”

We did record an improvement in the backlog, but we did not prioritize it, and the consequence of this was that it was a critical bug in the backlog that we had to deal with rather than the improvement we had planned. Maybe you’re now thinking: “What about the PSP, the potentially shippable product?” Exactly. For every change in the master branch, we essentially have a PSP that can be deployed. However, in this instance, we would have preferred the changes to have been implemented in a controlled manner seeing as we had also made various changes in our user interface that we would have liked to notify our users about in advance.

We learnt that from the bug

We now tag our Docker images, meaning we were able to abandon the :latest tag in the deployment configuration. As for why the Kubernetes pods restarted, that is something we didn’t manage to find out. Maybe the Chaos Monkey was up to its old tricks again. Effectively, what we had was a poor configuration which we just made do with in the beginning. It is fortunate we were still able to resolve the problem early on.

As a developer at PostFinance, you work with a lot of personal responsibility in an agile environment and in a team that fosters a constructive approach to errors. Interested? We look forward to receiving your application.

You can rate this page from one to five stars. Five stars is the best rating.
Ratings (%t)

This might interest you too