Description
Hi,
My team bumped into failed to begin subsegment
panics when using SQLContext
together with setting limits on connection numbers and durations.
With aws-xray-sdk-go v1.8.0
the problem still occurs, but logs instead of panics (which is great!). But I wanted to check if we are misusing the SDK in anyway, and give you a heads up if there is some kind of race condition here.
Issue
When using a database connection created like:
instance, err = xray.SQLContext("postgres", url)
...
instance.SetMaxOpenConns(10)
instance.SetConnMaxLifetime(30 * time.Second)
Executing queries with instrumented contexts occasionally caused panics like:
"LOG panic: failed to begin subsegment named 'xyz@localhost': segment cannot be found."
From what I can tell, there is a small chance that database/sql
will try to open a fresh conn using context.Background()
, instead of using the caller context, or re-using an existing connection.
Reprod
I managed to come up with this slightly contrived program (tons of connections and very low lifetimes) that reliably triggers the failure on my machine:
package main
import (
"context"
"database/sql"
"log"
"net"
"time"
"github.com/aws/aws-xray-sdk-go/xray"
// "sqlite" database driver.
_ "modernc.org/sqlite"
)
func initDB() (*sql.DB, error) {
db, err := xray.SQLContext("sqlite", "tmp.sqlite")
if err != nil {
return nil, err
}
db.SetMaxOpenConns(10)
db.SetConnMaxLifetime(time.Millisecond)
return db, nil
}
func oneRequest(db *sql.DB) error {
var err error
ctx, seg := xray.BeginSegment(context.Background(), "main")
defer seg.Close(err)
_, err = db.ExecContext(ctx, "select sqlite_version();")
if err != nil {
return err
}
return nil
}
func xraySink() {
pc, err := net.ListenPacket("udp", ":2000")
if err != nil {
log.Fatalf("listen: %s", err)
}
defer pc.Close()
buf := make([]byte, 1024)
for {
pc.ReadFrom(buf)
}
}
func main() {
// Give X-ray something to log to.
go xraySink()
db, err := initDB()
if err != nil {
log.Fatalf("open DB: %s", err)
}
start := make(chan struct{})
for i := 0; i < 10_000; i++ {
go func(i int) {
<-start
err := oneRequest(db)
if err != nil {
log.Fatalf("oneRequest failed: %s", err)
}
}(i)
}
close(start)
time.Sleep(time.Second)
}
The stacktrace at the failed subsegment is:
0 0x00000001029a3e90 in github.com/aws/aws-xray-sdk-go/xray.BeginSubsegment
at /Users/viktor/go/pkg/mod/github.com/aws/[email protected]/xray/segment.go:283
1 0x000000010299db84 in github.com/aws/aws-xray-sdk-go/xray.Capture
at /Users/viktor/go/pkg/mod/github.com/aws/[email protected]/xray/capture.go:19
2 0x00000001029ae85c in github.com/aws/aws-xray-sdk-go/xray.(*driverConnector).Connect
at /Users/viktor/go/pkg/mod/github.com/aws/[email protected]/xray/sql_go110.go:59
3 0x0000000102267f58 in database/sql.(*DB).openNewConnection
at /Users/viktor/local/go/src/database/sql/sql.go:1238
4 0x0000000102267ec0 in database/sql.(*DB).connectionOpener
at /Users/viktor/local/go/src/database/sql/sql.go:1228
5 0x0000000102265a00 in database/sql.OpenDB.func1
at /Users/viktor/local/go/src/database/sql/sql.go:792
6 0x00000001021a47e4 in runtime.goexit
at /Users/viktor/local/go/src/runtime/asm_arm64.s:1172
So its the connectionOpener
goroutine that gets a request for another connection, and tries to do it with context.Background()
.