Not sure what is causing it, but my Haswell system has started to freeze up when running “wspy –config topdown.config”. This started happening after I updated the system and started running benchmarks after a month on the road.
Some additional diagnosis and items I’ve tried:
- Observed that the hangs also happened in a debugger running single-step so investigated how many single-steps before it hung. That may have been a false lead, as the single step in middle of fopen(3C) suddenly jumps ahead. However, along the way, tightened up my strtok() calls to be strtok_r() to make sure nothing strange was happening with recursive open_config_file()/parse_command_line calls. These were set up to be tail-recursive, so shouldn’t matter but cleaned up anyways.
- Next observed that failure seemed to happen in setting up performance counters. Created small test program that made the same performance counters, and it didn’t hang.
- Looked through logs in /var/log and didn’t see any smoking guns. The kernel completely locks up, even for other logged in processes – so even if the program is faulty, there is vulnerability to locking the kernel.
- Further analysis started looking at grub to boot to an older kernel. In the process, uncovered that I was running under Xen hypervisor. Booting into bare metal fixed the problem. I’ve added a check to my 123.sh script that calls wspy. Not sure why this hung the system, but I don’t have virtual counters enabled, so this shouldn’t work. TODO item to look at more robust error detection in wspy to avoid stumbling into this again.
Conclusion: running under bare metal fixed the issue.