Recently I’ve encountered a bug which grinded my .NET Core application to a halt.
Obviously there were no weird log messages preceding the outage so the only way for me to go was utilizing the diagnostic toolkit. I want to share how I got my questions resolved step by step so that you understand the whole diagnostic process since it’s not so straightforward without prior experience.
The app stopped
My application had been working fine for a long period of time but once I noticed that it had stopped functioning. I didn’t get any additional information from log messages — the app looked as if it was just frozen. Before I terminated the application process, I had made some investigations.
I used the dotnet-counters tool to see the high picture of my application state.
I paid attention to a huge number of work items in the thread pool queue (3577) and an increasing number of threads in the pool (the picture above shows 424 running thread at the moment of monitoring). There was definitely something wrong so I decided to dig deeper to the internals and utilized dotnet-trace and dotnet-dump tools.
I ran dotnet-trace tool to capture what the threads were doing: this tool captures each thread’s stack every millisecond for a user-defined period of time (I gathered traces during 20 seconds) and exports this information into a trace.nettrace file which can be investigated further via PerfView.
Once I opened the thread time analysis view, I saw that 98% of the threads stuck in the ConsoleSink.Emit method. I opened the source code of Serilog’s Console sink and assumed that this might have been a blocking issue. In order to investigate which thread had blocked the object and why it’s not exiting the lock, I decided to collect a dump of the process via dotnet-dump tool.
I knew that there might have been a locking issue so I used syncblk command to see synchronization block information.
I saw that there was a single object that was locked by thread 256. In order to check this thread callstack I set that thread as the one to apply SOS command and called clrstack command.
As I understood the thread which was handling a diff message asked Serilog’s console sink to log the data and got stuck in WriteFile method. This conclusion gave me the context about the app halt and proved that the issue was not related to my code. I found a similar bug report on github and attached these details to the issue to help maintainers investigate the problem further.
So these were the diagnostic steps which I applied to figure out why my application had stopped working. Surely this is not the instruction on how to run every diagnostic investigation since those steps heavily depend on each case individually. You can find more examples on the github page. To get a better understanding of .NET memory analysis and diagnostics you may find useful the mem-doc.
Ultimately I hope you got the basic idea behind the process and enjoyed the reading. If you have any questions, feel free to ask!
What about Linux and Docker?
I did all the diagnostics against the app running on my local machine on Windows. The same is true for .NET Core apps hosted on Linux.
Also you can embed tools into Docker image and get diagnostic capabilities in Docker containers.