Finding Ticketbleed

Ticketbleed (CVE-2016-9244) is a software vulnerability in the TLS stack of certain F5 products that allows a remote attacker to extract up to 31 bytes of uninitialized memory at a time, which can contain any kind of random sensitive information, like in Heartbleed.

If you suspect you might be affected by this vulnerability, you can find details and mitigation instructions at ticketbleed.com (including an online test) or in the F5 K05121675 article.

ticketbleed

In this post we'll talk about how Ticketbleed was found, verified and reported.

JIRA RG-XXX

It all started with a bug report from a customer using Cloudflare Railgun.

rg-listener <> origin requests fail with "local error: unexpected message"

A PCAP of the rg-listener <> origin traffic is attached and shows a TLS alert being triggered during the handshake.

Worth noting the customer is using an F5 Load Balancer in front of the Railgun and the Origin Web Server:
visitor > edge > cache > rg-sender > F5 > rg-listener > F5 > origin web server

Matthew was unable to replicate by using a basic TLS.Dial in Go so this seems tricky so far.

A bit of context on Railgun: Railgun speeds up requests between the Cloudflare edge and the origin web site by establishing a permanent optimized connection and performing delta compression on HTTP responses.

railgun

The Railgun connection uses a custom binary protocol over TLS, and the two endpoints are Go programs: one on the Cloudflare edge and one installed on the customer servers. This means that the whole connection goes through the Go TLS stack, crypto/tls.

That connection failing with local error: unexpected message means that the customer’s side of the connection sent something that confused the Go TLS stack of the Railgun running on our side. Since the customer is running an F5 load balancer between their Railgun and ours, this points towards an incompatibility between the Go TLS stack and the F5 one.

However, when my colleague Matthew tried to reproduce the issue by connecting to the load balancer with a simple Go crypto/tls.Dial, it succeeded.

PCAP diving

Since Matthew sits at a desk opposite of mine in the Cloudflare London office, he knew I've been working with the Go TLS stack for our TLS 1.3 implementation. We quickly ended up in a joint debugging session.

Here's the PCAP we were staring at.

So, there's the ClientHello, right. The ServerHello, so far so good. And then immediately a ChangeCipherSpec. Oh. Ok.

A ChangeCipherSpec is how TLS 1.2 says "let's switch to encrypted". The only way a ChangeCipherSpec can come this early in a 1.2 handshake, is if session resumption happened.

And indeed, by focusing on the ClientHello we can see that the Railgun client sent a Session Ticket.

A Session Ticket carries some encrypted key material from a previous session to allow the server to resume that previous session immediately instead of negotiating a new one.

resumption

To learn more about session resumption in TLS 1.2, watch the first part of the Cloudflare Crypto Team TLS 1.3 talk, read the transcript, or the "TLS Session Resumption" post on the Cloudflare blog.

After that ChangeCipherSpec both Railgun and Wireshark get pretty confused (HelloVerifyRequest? Umh?). So we have reason to believe the issue is related to Session Tickets.

In Go you have to explicitly enable Session Tickets on the client side by setting a ClientSessionCache. We verified that indeed Railgun uses this functionality and wrote this small test:

package main

import (
    "crypto/tls"
)

func main() {
    conf := &tls.Config{
        InsecureSkipVerify: true,
        ClientSessionCache: tls.NewLRUClientSessionCache(32),
    }

    conn, err := tls.Dial("tcp", "redacted:443", conf)
    if err != nil {
        panic("failed to connect: " + err.Error())
    }
    conn.Close()

    conn, err = tls.Dial("tcp", "redacted:443", conf)
    if err != nil {
        panic("failed to resume: " + err.Error())
    }
    conn.Close()
}

And sure enough, local error: unexpected message.

crypto/tls diving

Once I had it reproduced in a local crypto/tls it became a home game. crypto/tls error messages tend to be short of details, but a quick tweak allows us to pinpoint where they are generated.

Every time a fatal error occurs, setErrorLocked is called to record the error and make sure that all following operations fail. That function is usually called from the site of the error.

A well placed panic(err) will drop a stack trace that should show us what message is unexpected.

diff --git a/src/crypto/tls/conn.go b/src/crypto/tls/conn.go
index 77fd6d3254..017350976a 100644
--- a/src/crypto/tls/conn.go
+++ b/src/crypto/tls/conn.go
@@ -150,8 +150,7 @@ type halfConn struct {
 }

 func (hc *halfConn) setErrorLocked(err error) error {
-       hc.err = err
-       return err
+       panic(err)
 }

 // prepareCipherSpec sets the encryption and MAC states

panic: local error: tls: unexpected message

goroutine 1 [running]:
panic(0x185340, 0xc42006fae0)
	/Users/filippo/code/go/src/runtime/panic.go:500 +0x1a1
crypto/tls.(*halfConn).setErrorLocked(0xc42007da38, 0x25e6e0, 0xc42006fae0, 0x25eee0, 0xc4200c0af0)
	/Users/filippo/code/go/src/crypto/tls/conn.go:153 +0x4d
crypto/tls.(*Conn).sendAlertLocked(0xc42007d880, 0x1c390a, 0xc42007da38, 0x2d)
	/Users/filippo/code/go/src/crypto/tls/conn.go:719 +0x147
crypto/tls.(*Conn).sendAlert(0xc42007d880, 0xc42007990a, 0x0, 0x0)
	/Users/filippo/code/go/src/crypto/tls/conn.go:727 +0x8c
crypto/tls.(*Conn).readRecord(0xc42007d880, 0xc400000016, 0x0, 0x0)
	/Users/filippo/code/go/src/crypto/tls/conn.go:672 +0x719
crypto/tls.(*Conn).readHandshake(0xc42007d880, 0xe7a37, 0xc42006c3f0, 0x1030e, 0x0)
	/Users/filippo/code/go/src/crypto/tls/conn.go:928 +0x8f
crypto/tls.(*clientHandshakeState).doFullHandshake(0xc4200b7c10, 0xc420070480, 0x55)
	/Users/filippo/code/go/src/crypto/tls/handshake_client.go:262 +0x8c
crypto/tls.(*Conn).clientHandshake(0xc42007d880, 0x1c3928, 0xc42007d988)
	/Users/filippo/code/go/src/crypto/tls/handshake_client.go:228 +0xfd1
crypto/tls.(*Conn).Handshake(0xc42007d880, 0x0, 0x0)
	/Users/filippo/code/go/src/crypto/tls/conn.go:1259 +0x1b8
crypto/tls.DialWithDialer(0xc4200b7e40, 0x1ad310, 0x3, 0x1af02b, 0xf, 0xc420092580, 0x4ff80, 0xc420072000, 0xc42007d118)
	/Users/filippo/code/go/src/crypto/tls/tls.go:146 +0x1f8
crypto/tls.Dial(0x1ad310, 0x3, 0x1af02b, 0xf, 0xc420092580, 0xc42007ce00, 0x0, 0x0)
	/Users/filippo/code/go/src/crypto/tls/tls.go:170 +0x9d

Sweet, let's see where the unexpected message alert is sent, at conn.go:672.

 670     case recordTypeChangeCipherSpec:
 671         if typ != want || len(data) != 1 || data[0] != 1 {
 672             c.in.setErrorLocked(c.sendAlert(alertUnexpectedMessage))
 673             break
 674         }
 675         err := c.in.changeCipherSpec()
 676         if err != nil {
 677             c.in.setErrorLocked(c.sendAlert(err.(alert)))
 678         }

So the message we didn't expect is the ChangeCipherSpec. Let's see if the higher stack frames give us an indication as to what we expected instead. Let's chase handshake_client.go:262.

 259 func (hs *clientHandshakeState) doFullHandshake() error {
 260     c := hs.c
 261
 262     msg, err := c.readHandshake()
 263     if err != nil {
 264         return err
 265     }

Ah, doFullHandshake. Wait. The server here is clearly doing a resumption (sending a Change Cipher Spec immediately after the Server Hello), while the client... tries to do a full handshake?

It looks like the client offers a Session Ticket, the server accepts it, but the client doesn't realize and carries on.

RFC diving

At this point I had to fill a gap in my TLS 1.2 knowledge. How does a server signal acceptance of a Session Ticket?

RFC 5077, which obsoletes RFC 4507, says:

When presenting a ticket, the client MAY generate and include a
Session ID in the TLS ClientHello. If the server accepts the ticket
and the Session ID is not empty, then it MUST respond with the same
Session ID present in the ClientHello.

So a client that doesn't want to guess whether a Session Ticket is accepted or not will send a Session ID and look for it to be echoed back by the server.

The code in crypto/tls, clear as always, does exactly that.

func (hs *clientHandshakeState) serverResumedSession() bool {
    // If the server responded with the same sessionId then it means the
    // sessionTicket is being used to resume a TLS session.
    return hs.session != nil && hs.hello.sessionId != nil &&
        bytes.Equal(hs.serverHello.sessionId, hs.hello.sessionId)
}

Session IDs diving

Something must be going wrong there. Let's practice some healthy print-based debugging.

diff --git a/src/crypto/tls/handshake_client.go b/src/crypto/tls/handshake_client.go
index f789e6f888..2868802d82 100644
--- a/src/crypto/tls/handshake_client.go
+++ b/src/crypto/tls/handshake_client.go
@@ -552,6 +552,8 @@ func (hs *clientHandshakeState) establishKeys() error {
 func (hs *clientHandshakeState) serverResumedSession() bool {
        // If the server responded with the same sessionId then it means the
        // sessionTicket is being used to resume a TLS session.
+       println(hex.Dump(hs.hello.sessionId))
+       println(hex.Dump(hs.serverHello.sessionId))
        return hs.session != nil && hs.hello.sessionId != nil &&
                bytes.Equal(hs.serverHello.sessionId, hs.hello.sessionId)
 }

00000000  a8 73 2f c4 c9 80 e2 ef  b8 e0 b7 da cf 0d 71 e5  |.s/...........q.|

00000000  a8 73 2f c4 c9 80 e2 ef  b8 e0 b7 da cf 0d 71 e5  |.s/...........q.|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Ah. The F5 server is padding the Session ID to its maximum length of 32 bytes, instead of returning it as the client sent it. crypto/tls in Go uses 16 byte Session IDs.

From there the failure mode is clear: the server thinks it told the client to use the ticket, the client thinks the server started a new session, and things get unexpected.

In the TLS space we have seen quite some incompatibilities like this. Notoriously, ClientHellos have to be either shorter than 256 bytes or longer than 512 not to clash with some server implementations.

I was about to write this up as just another real world TLS quirk when...

00000000  79 bd e5 a8 77 55 8b 92  41 e9 89 45 e1 50 31 25  |y...wU..A..E.P1%|

00000000  79 bd e5 a8 77 55 8b 92  41 e9 89 45 e1 50 31 25  |y...wU..A..E.P1%|
00000010  04 27 a8 4f 63 22 de 8b  ef f9 a3 13 dd 66 5c ee  |.'.Oc".......f\.|

Uh oh. Wait. Those are not zeroes. That's not padding. That's... memory?

At this point the impression of dealing with a Heartbleed-like vulnerability got pretty clear. The server is allocating a buffer as big as the client's Session ID, and then sending back always 32 bytes, bringing along whatever unallocated memory was in the extra bytes.

Browser diving

I had one last source of skepticism: how could this not have been noticed before?

The answer is banal: all browsers use 32-byte Session IDs to negotiate Session Tickets. Together with Nick Sullivan I checked NSS, OpenSSL and BoringSSL to confirm. Here's BoringSSL for example.

  /* Generate a session ID for this session based on the session ticket. We use
   * the session ID mechanism for detecting ticket resumption. This also fits in
   * with assumptions elsewhere in OpenSSL.*/
  if (!EVP_Digest(CBS_data(&ticket), CBS_len(&ticket),
                  session->session_id, &session->session_id_length,
                  EVP_sha256(), NULL)) {
    goto err;
  }

BoringSSL uses a SHA256 hash of the Session Ticket, which is exactly 32 bytes.

(Interestingly, from speaking to people in the TLS field, there was an idle intention to switch to 1-byte Session IDs but no one had tested it widely yet.)

As for Go, it’s probably the case that client-side Session Tickets are not enabled that often.

Disclosure diving

After realizing the security implications of this issue we compartmentalized it inside the company, made sure our Support team would advise our customer to simply disable Session Tickets, and sought to contact F5.

After a couple misdirected emails that were met by requests for Serial Numbers, we got in contact with the F5 SIRT, exchanged PGP keys, and provided a report and a PoC.

The report was escalated to the development team, and confirmed to be an uninitialized memory disclosure limited to the Session Ticket functionality.

It's unclear what data might be exfiltrated via this vulnerability, but Heartbleed and the Cloudflare Heartbleed Challenge taught us not to make assumptions of safety with uninitialized memory.

In planning a timeline, the F5 team was faced with a rigid release schedule. Considering multiple factors, including the availability of an effective mitigation (disabling Session Tickets) and the apparent triviality of the vulnerability, I decided to adhere to the industry-standard disclosure policy adopted by Google's Project Zero: 90 days with 15 days of grace period if a fix is due to be released.

By coincidence today coincides with both the expiration of those terms and the scheduled release of the first hotfix for one of the affected versions.

I'd like to thank the F5 SIRT for their professionalism, transparency and collaboration, which were in pleasant contrast with the stories of adversarial behavior we hear too often in the industry.

The issue was assigned CVE-2016-9244.

Internet diving

When we reported the issue to F5 I had tested the vulnerability against a single host, which quickly became unavailable after disabling Session Tickets. That meant having both low confidence in the extent of the vulnerability, and no way to reproduce it.

This was the perfect occasion to perform an Internet scan. I picked the toolkit that powers Censys.io by the University of Michigan: zmap and zgrab.

zmap is an IPv4-space scanning tool that detects open ports, while zgrab is a Go tool that follows up by connecting to those ports and collecting a number of protocol details.

I added support for Session Ticket resumption to zgrab, and then wrote a simple Ticketbleed detector by having zgrab send a 31-byte Session ID, and comparing it with the one returned by the server.

diff --git a/ztools/ztls/handshake_client.go b/ztools/ztls/handshake_client.go
index e6c506b..af098d3 100644
--- a/ztools/ztls/handshake_client.go
+++ b/ztools/ztls/handshake_client.go
@@ -161,7 +161,7 @@ func (c *Conn) clientHandshake() error {
                session, sessionCache = nil, nil
                hello.ticketSupported = true
                hello.sessionTicket = []byte(c.config.FixedSessionTicket)
-               hello.sessionId = make([]byte, 32)
+               hello.sessionId = make([]byte, 32-1)
                if _, err := io.ReadFull(c.config.rand(), hello.sessionId); err != nil {
                        c.sendAlert(alertInternalError)
                        return errors.New("tls: short read from Rand: " + err.Error())
@@ -658,8 +658,11 @@ func (hs *clientHandshakeState) processServerHello() (bool, error) {

        if c.config.FixedSessionTicket != nil {
                c.resumption = &Resumption{
-                       Accepted:  hs.hello.sessionId != nil && bytes.Equal(hs.serverHello.sessionId, hs.hello.sessionId),
-                       SessionID: hs.serverHello.sessionId,
+                       Accepted: hs.hello.sessionId != nil && bytes.Equal(hs.serverHello.sessionId, hs.hello.sessionId),
+                       TicketBleed: len(hs.serverHello.sessionId) > len(hs.hello.sessionId) &&
+                               bytes.Equal(hs.serverHello.sessionId[:len(hs.hello.sessionId)], hs.hello.sessionId),
+                       ServerSessionID: hs.serverHello.sessionId,
+                       ClientSessionID: hs.hello.sessionId,
                }
                return false, FixedSessionTicketError
        }

By picking 31 bytes I ensured the sensitive information leakage would be negligible.

I then downloaded the latest zgrab results from the Censys website, which thankfully included information on what hosts supported Session Tickets, and completed the pipeline with abundant doses of pv and jq.

After getting two hits in the first 1,000 hosts from the Alexa top 1m list in November, I interrupted the scan to avoid leaking the vulnerability and postponed to a date closer to the disclosure.

While producing this writeup I completed the scan, and found between 0.1% and 0.2% of all hosts to be vulnerable, or 0.4% of the websites supporting Session Tickets.

For more details visit the F5 K05121675 article or ticketbleed.com, where you'll find a technical summary, affected versions, mitigation instructions, a complete timeline, scan results, IPs of the scanning machines, and an online test.

Otherwise, you might want to follow me on Twitter.