I’ve recently been trying to contribute more to the Haskell ecosystem, my focus
currently being on Haskell Language Server. I still remember the good (ha!) ol’
days when my main method of navigating Haskell code was through sheer memory,
grep and hoogle, repeatedly running cabal build to continue on to the next
compile error. In comparison, HLS is like a nice refreshing cold glass of water
in a desert.HLS’s testsuite was previously, unfortunately, quite a bit flaky. It’s now
a bit better. I’ll expand on two sources of flakiness I took down as part of
this issue.
The errors I’d see while making PRs looked like the following. I’ve omitted the
concrete tests that failed, because these were flakily appearing everywhere.> Failure 1
Exception: Language server unexpectedly terminated
> Failure 2
hPutBuf: resource vanished (Broken pipe)
> Failure 3
Segmentation fault
Wait, how do I open this door?
The first error comes from lsp-test, the framework used to write tests
in HLS. It provides a minimal editor-like interface that can be used to send
sequences of LSP messages to an underlying language server implementation. In
HLS, an instance of the language server is spawned on a thread and communication
with lsp-test occurs through pipes.runSessionWithTestConfig ... =
...
((inR, inW), (outR, outW)) <- (,) <$> createPipe <*> createPipe
server <- async $ defaultMain arguments { argsHandleIn = pure inR , argsHandleOut = pure outW }
result <- runSessionWithHandles inW outR ...
hClose inW
timeout 3 (wait server) >>= ... cancel server
The first two failures both actually have the same origin. They occur when
lsp-test tries to read an LSP message from the language server, but discovers
the pipe’s been closed (or symmetrically, the write-end discovers the read-end’s
been closed). I spent quite a bit of time looking for something that doesn’t
exist, an explicit call to close the write-end of the handle. Internally, HLS
spawns ghcide on a thread, which correctly doesn’t close the handle it’s been
passed.The fun bit that didn’t occur to me is what’s written as a footnote in the
documentation of HandleIt’s so normal to see withFile* functions or explicit hClose, that it’s easy to forget what happens if you don’t use those functions.
.
In GHC, a handle with no references is closed when it is GC’d. This opens up the
possibility of a race condition during LSP shutdowns. Consider the following
timeline of events. lsp-test initiates the shutdown sequence, ghcide confirms
and sends the notification that it’s exiting. It exits with both the thread and
the handles given to it subsequently GC’d. lsp-test hasn’t stopped reading
from the handle yet though, reading an unexpected EOF and crashing.
The fix here is the infamous GHC touch# function, or its more modern equivalent,
keepAlive#. They both
act as signals to the GHC RTS that a value should be considered active at that
point in the code.runSessionWithTestConfig ... =
...
pipes@((inR, inW), (outR, outW)) <- (,) <$> createPipe <*> createPipe
keepAlive pipes $ do
server <- async $ defaultMain arguments { argsHandleIn = pure inR , argsHandleOut = pure outW }
result <- runSessionWithHandles inW outR ...
hClose inW
timeout 3 (wait server) >>= ... cancel server
This was quite confusing, so I created a similar reproducing example in
the appendix that results in the same
implicit handle closes.gdb? Funsie
Segfaults in Haskell are a pretty scary thing to see. Good thing is I could
reproduce these locally. Using a small bash script to repeatedly run the failing
test, reproduces the segfault pretty quickly Only had to run 2 repetitions to get the segfault when writing this post.
.ghcide
constructor hover (#2904)
Constructors.hs
...
E: line 22: 145581 Segmentation fault (core dumped) "$@"
You may need to compile with -g3 to get debug symbols. Pointing gdb at the
coredump, immediately gives a lead, showing a trace full of references to
sqlite.Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000062ea359 in findElementWithHash ()
[Current thread is 1 (LWP 207720)]
(gdb) bt
#0 0x00000000062ea359 in findElementWithHash ()
#1 0x000000000632ebe5 in sqlite3FindTable ()
#2 0x00000000063b0620 in sqlite3LocateTable ()
...
#11 0x000000000639932a in sqlite3_prepare_v2 ()
A first intuition then tells me that there’s probably a
connection being used after it was freed, as I’d worked with the
direct-sqlite library
before and I remember needing to manually clean up connections. I was
somewhat familiar with how HLS uses sqlite as I’d contributed performance
improvements related to indexing.A bit of searching in the codebase for initialization gives the following
with-wrapper that handles cleanup of resources, including sqlite
connections in hieDb, via runWithDb.runWithWorkerThreads ... = evalContT $ do
(WithHieDbShield hieDb, indexQueue) <- runWithDb ...
restartQueue <- withWorkerQueueSimple ...
loaderQueue <- withWorkerQueueSimple ...
liftIO $ f hieDb (ThreadQueue indexQueue restartQueue loaderQueue)
It’s hard to spot the error without another bit of information. The continuation
given to runWithWorkerThreads, the f, uses background processing to
asynchronously handle LSP requests. Specifically, this function does not shut
down the shake session, leaving those alive to continue using the sqlite
connection in hieDb.runWithWorkerThreads ... = evalContT $ do
(WithHieDbShield hieDb, indexQueue) <- runWithDb ...
-- note that we're in ContT, shutdown happens bottom->up
ContT $ \action -> action () `finally` shutdownSession
restartQueue <- withWorkerQueueSimple ...
loaderQueue <- withWorkerQueueSimple ...
liftIO $ f hieDb (ThreadQueue indexQueue restartQueue loaderQueue)
This ensures the shake session is shutdown before the sqlite connection is
closed, resolving the segfault. While debugging race conditions is tricky,
there’s something deeply satisfying about solving them.Discussion links: Reddit