# (p.231) C Data, software, and mathematical derivations

# (p.231) C Data, software, and mathematical derivations

In this appendix we draw the reader’s attention to our respective websites

www.uva.nl/profile/f.j.g.m.klaassen

and

where we provide background material. This material includes the complete data set underlying most of our analyses; the computer program, called *Richard*, which we use to calculate probabilities and importances; and also derivations of some of the mathematical results.

# Data

The principal (but not the only) data set used in this book consists of point-by-point data of men’s and women’s singles matches at Wimbledon 1992–1995. These data were given to us by IBM UK, and they were supplemented by ranking data, which we received from the ITF. The data set is described and applied in Chapter 5. Chapters 6–9 use match data, aggregated from these point data. The point data themselves are used again in Chapters 10–12.

The complete data set is presented as an Excel file. This file also contains point-by-point data of the three matches (Federer-Nadal, Clijsters-Williams, Djokovic-Nadal) investigated in Chapters 3 and 4. Included in the Excel file is a separate sheet where the user can easily calculate most of the simple estimates presented (p.232) in the book, that is, the estimates that do not rely on the GMM estimation method. GMM estimation requires external statistical software.

As an example, consider the relative frequency of winning a point on service for the men. The overall estimate of 64.4% and the standard error of 0.2%, presented on page 71, is replicated in the file. The output automatically provides the breakdown of this frequency, based on whether the server and receiver are seeded or non-seeded players, as reported in Table 7.1. If the user wishes to analyze only a subsample, for example excluding points in a tiebreak and after a break in the previous game within the same set (as used in Table 12.3), he or she can enter this request as well. Precise instructions and explanations are provided in the file.

At several places in the book we also use other data. Summary statistics of the grand slam tournaments from 1992–2010 were provided by IBM UK and the ITF. Data from OnCourt are used in analyzing upsets and the Isner-Mahut match in Chapter 2, and in performing sensitivity analyses for two of the three matches in Chapter 3. The bookmakers’ odds at the beginning of each of the three matches were downloaded from Tennis-data.

More tennis data can be obtained from various sources. The websites of the grand slam tournaments provide data and illustrations using IBM SlamTracker. The ATP and WTA websites contain data on many tournaments and rankings. Other websites publish data as well, and their number is growing.

# Software: program *Richard*

*Richard* is a computer program. It computes the probabilities of winning the game (tiebreak), set, and match at any point of a tennis match. The program also produces functions of these probabilities, such as the importance of a game in the set or the importance of a point in the match.

The required inputs are the service point winning probabilities *p*_{i} and *p*_{j} of the two players $\mathcal{I}$ and $\mathcal{J}$, the current score, the current server, and the rules of the tournament (say, best-of-five sets with a tiebreak in all sets except the final set), but nothing more. The program is fast, accurate, has a convenient ‘matrix’ setup, is flexible for studying rule changes, and it is freely available.

(p.233)
*Richard* was born in 1994. It was first written in a computer language called Turbo Pascal. Since Turbo Pascal is not much used any more, we present the program in Matlab, which is more common and more user-friendly.

The program, including all source code, can be downloaded from our websites. Also available on the websites is the associated ‘inversion’ program, described on page 34, which is important for in-play forecasting and betting. The inversion program produces robust values of *p*_{i} and *p*_{j} , based on two inputs: the match-winning probability at the beginning of the match (obtained from an outside source) and the sum *p*_{i} + *p*_{j}.

We emphasize three features of *Richard*. First, it is based on the assumption that service points are independent and identically distributed (iid), so that two probabilities (*p*_{i} and *p*_{j} ) suffice. Since tennis has a hierarchical structure (points form games and tiebreaks, which form sets, which form matches), games, tiebreaks and sets are also iid. This allows for a step-by-step calculation of the probability of winning a match. First calculate the game (tiebreak) winning probability, then the set probability, and finally the match probability.

A second distinguishing feature of *Richard* concerns the calculation of the winning probabilities at each step. For example, consider a game served by player I. At score 0-0, I can win or lose the point, giving 15-0 or 0-15, with probabilities *p*_{i} and 1 − *p*_{i}, respectively. After 15-0 the score can be 30-0 or 15-15, with the same transition probabilities, and similarly after 0-15. Special scores are deuce and advantage. Deuce is equivalent to 30-30, and advantage server (receiver) is equivalent to 40-30 (30-40). The game continues until either I or J has won the game. We thus obtain seventeen different scores and at each score the probability of the next score is constant: either *p*_{i} or 1 − *p*_{i}. This type of structure is known as a ‘finite Markov chain’ (Kemeny and Snell, 1960, pp. 161–167), which has many important and useful properties. One of these properties is that the whole process can be represented in one square of numbers, called a ‘matrix’. This matrix has seventeen rows and seventeen columns, and each cell contains the probability of going from one score to the other. Given this matrix, it is then easy to calculate the probability of winning the game from each score. The matrix approach makes the program fast: a few hundredths of a second
(p.234)
to compute one match-winning probability. The total computing time for all 413 points in the Federer-Nadal Wimbledon 2008 final is only two seconds on a standard desktop computer.

Third, the program is not only fast, but also flexible. Winning probabilities under non-standard rules (match tiebreak, four-game set, no-ad rule), as discussed on page 27, can be obtained by just changing one single number. Because the full source code is available, the user can adjust the code, if needed, to investigate other, more exotic, rule changes.

# Mathematical derivations

For almost all mathematical results in the book we have explained how they were derived and why we needed them. For the remaining formulas we provide full derivations on our websites. These remaining formulas are the following.

## Chapter 2

(a) In Section ‘From point to game’ we consider a game where player $\mathcal{I}$ is serving with constant probability

*p*_{i}of winning a point. The probability that he or she wins the game is expressed by the formula for*g*_{i}on page 15.(b) Section ‘Long matches: Isner-Mahut 2010’ (pages 24–27) contains a three-step derivation. This derivation is complete, but the reader may wish to see the explicit expressions (and their derivations) of $\ell (5,5)$ and $\ell (a,b)$. These are provided on the websites.

(c) In Section ‘Rule changes: the no-ad rule’ we consider the situation where the traditional scoring system at deuce is replaced by the no-ad rule, so that only one deciding point is played at deuce, and we state that the probability that $\mathcal{I}$ wins the game would change from

*g*_{i}to ${\tilde{g}}_{i}$ on page 28.

## Chapter 4

In Section ‘Big points in a game’ (pages 50–52) we consider again a game where player $\mathcal{I}$ is serving with probability *p*_{i} of winning
(p.235)
a service point. Then we present formulas for *imp*_{pg}(40, 30), the importance of the score 40-30, and for *imp*_{pg}(30, 40), the importance of the score 30-40.