2017-02-05 Problem Found!

The main problem in identifying the cause of the hypervisor issue, was that it was not reproduceable at all. Most of the times the crash happened after several weeks of usage. Now we could find a setup which could reproduce the freeze within 1 hour. The idea was that a so called race-condition happened somewhere in the Hypervisor kernel. Race-conditions create random like crashes, like in our case with a timescale of a few weeks. The key is to create enough load, so that the race-condition would happen faster. We have tried several things to push the race-condition, but so far we had no success until now. By using the fio - I/O generating tool, we where finaly able to reprodruce the crash within 1 hour, this helped in testing different Kernels and settings out.

The setup for identifying the race condition

2xVMs where setted up:

Alpine Linux
Windows 10

On the hypervisor itself we started the fio tool (https://github.com/axboe/fio) with the command line:

Code Block
fio --name=test --rw=randrw --size=800M --numjobs=5 --time_based --runtime=8h --direct=1 --alloc-size=4096

In the Alpine VM we started another fio instance:

Code Block
fio --name=test --rw=randrw --size=300M --numjobs=5 --time_based --runtime=8h --direct=1 --alloc-size=512

this means 5 threads are generating random read / write accesses to the SSD, both within a VM and outside in dom0, thus the Hypervisor. The Windows VM is just a Test-VM to see if a second VM runs beside the main VM.

After a maximum of one hour this setup tended to create a kernel panic, or to freeze the hypervisor or both.

What is the problem now?

After trying out several things, finaly we can say that the main problem are the usage of SWAP and Overcommitting of Memory, probably the fair I/O scheduler in the Linux Kernel. After disabling Swap and overcommitting memory and enabling the noop I/O scheduler, no crash happened within 8 hours of above fio load.

2017-01-10 Current Hypervisor State is BETA

We experience a major problem with the beroNet Hypervisor Operating System. The problem results in a frozen state of the appliance after several days, sometimes weeks of production use. The freeze is a result of a Linux Kernel Panic which taints the Kernel, after a while the Kernel locks up. Unfortunately we have not detected this issue during testing, as it does not happen immediately. As a matter of fact the issue was not present in the 0.9.7 version and below. We know this, because all customers that use the 0.9.7 version and the beroNet Cloud all have uptimes of more than 100 days, sometimes even more than 200 days.

...

Versions Compared

Old Version 5

New Version 6

Key

2017-02-05 Problem Found!

The setup for identifying the race condition

What is the problem now?

2017-01-10 Current Hypervisor State is BETA

Page Comparison

Versions Compared

Old Version 5

New Version 6

Key

2017-02-05 Problem Found!

The setup for identifying the race condition

What is the problem now?

2017-01-10 Current Hypervisor State is BETA