STAT-244-SC

Final Project Report

May 8, 2025
Eonbi Choi, Yerim Oh

F1 Miami Grand Prix 2024 Pit Stop Analysis

Abstract

In sports, data is collected to record athletes’ performances, which helps in making decisions to improve outcomes. In Formula 1 and other motorsports, tire management plays an important role in race strategy. Tires directly impact car performance as they are in contact with the track and transfers all the car’s power and movements to the track surface. Tire degradation influences major decisions such as pit stop timing during a race, where minor misjudgments can result in serious consequences in a fast-paced sports like Formula 1. Our study evaluates the relationship between pit stop laps and tire and lap information though employing linear regression models, cross validation, k-fold cross validation, and logistic regression. The objective is to improve pit stop timing prediction to facilitate effective strategic planning for optimizing pit stop timing, ultimately giving drivers a decisive advantage in race management.

Datasets

The data used in this study were obtained from the f1dataR R package that accesses Formula 1 data via the FastF1 Python library. The dataset includes lap-by-lap session data from the 2024 Miami Grand Prix and comprise 1,111 laps and 32 variables. These variables include driver details, lap times, pit in/out times, tire information, and track status. More detailed information about this package can be found in its API documentation.

Last accessed date/time: April 28, 2025 16:51 PM

# A tibble: 6 × 32
   time driver driver_number lap_time lap_number stint pit_out_time pit_in_time
  <dbl> <chr>  <chr>            <dbl>      <dbl> <dbl>        <dbl>       <dbl>
1 3438. VER    1                 94.3          1     1          NaN         NaN
2 3531. VER    1                 93.1          2     1          NaN         NaN
3 3624. VER    1                 93.1          3     1          NaN         NaN
4 3717. VER    1                 93.5          4     1          NaN         NaN
5 3810. VER    1                 92.8          5     1          NaN         NaN
6 3903. VER    1                 92.9          6     1          NaN         NaN
# ℹ 24 more variables: sector1time <dbl>, sector2time <dbl>, sector3time <dbl>,
#   sector1session_time <dbl>, sector2session_time <dbl>,
#   sector3session_time <dbl>, speed_i1 <dbl>, speed_i2 <dbl>, speed_fl <dbl>,
#   speed_st <dbl>, is_personal_best <list>, compound <chr>, tyre_life <dbl>,
#   fresh_tyre <lgl>, team <chr>, lap_start_time <dbl>, lap_start_date <dttm>,
#   track_status <chr>, position <dbl>, deleted <lgl>, deleted_reason <chr>,
#   fast_f1generated <lgl>, is_accurate <lgl>, session_type <chr>

Variables

quantitative variable:

lap_time: recorded time to complete a lap (seconds)
lap_number: lap number from which the telemetry data was recorded (number of laps)
tyre_life: number of laps completed on a set of tires (number of laps)

categorical variable:

compound: type of tire used (SOFT, MEDIUM, HARD)
pit_in: whether a driver made a pit stop during a lap (binary: 0 = no pit stop, 1 = pit stop occured)

Clean/Rearrange Data

Rearrange data to consist only the variables we are interested in

Check missing values

Data for lap_time are missing 5 values which are less than 0.1% of the entire observation

Finalize data we are going to use

# A tibble: 6 × 5
  lap_time lap_number compound tyre_life pit_in
     <dbl>      <dbl> <fct>        <dbl>  <dbl>
1     94.3          1 MEDIUM           1      0
2     93.1          2 MEDIUM           2      0
3     93.1          3 MEDIUM           3      0
4     93.5          4 MEDIUM           4      0
5     92.8          5 MEDIUM           5      0
6     92.9          6 MEDIUM           6      0

Why certain data points are missing

Out of 5 missing lap time records four records have a track status code of 41. However, no description of this code value is provided in the API. Thus, we assume that either the track was not fully cleared or conditions were not suitable for racing. The other missing record was due to a driver failing to complete a lap due to collision.

Data Visualization

Our response variable:pit_in

Number of pit stops for each lap - This plot shows the frequency of pit stops across laps during the 2024 Miami Grand Prix. It helps visualize when teams tend to stop during the race. Many teams pitted to change tires during the first half of the race and the most common pit stop occurring around lap 28. This race was unique in that some drivers performed a one-stop strategy, while others went for a two-stop approach. These decisions were influenced by various factors such as track position, gaps to nearby drivers, tire condition, and more. Pit stops in the later stages of the race likely reflect either a two-stop strategy or an attempt to set the fastest lap and earn an extra point.

Density plot for lap time of each team - This plot shows the distribution of lap times for each team during the race. We can compare the performance and variability in lap times across different teams. For most teams, the lap times are generally under 100 seconds, with some laps approaching 110 seconds. These patterns are largely consistent across teams, though some, such as Mercedes and Williams, show a few outliers on the higher end, which indicates occasional slower laps. - A smaller lap time in Formula 1 means that the driver completed a lap more quickly. In racing, lower lap times are better because they indicate higher performance.

Box plot of tire life for each compound - The plot shows the distribution of tire life (measured in laps) for each tire compound used in the race. On average, hard tires lasted very slightly longer than medium tires. Since hard and medium compounds were the most commonly used in this race, we have limited data on soft tires, roughly a quarter as much. This resulted in a narrower distribution for the soft compound. The tire compound directly affects tire life, with a trade-off between performance (speed and grip) and durability. As a result, softer compounds tend to wear out more quickly than harder ones.

Number of tires for each compound - Since the tire life is shorter for the softer compounds, but has a better performance, the drivers preferred MEDIUM tire to balance the trade-off.

Linear Regression Model

Research questions

Were drivers more likely to make pit stops when their lap time was longer and their tires were older compared to when their lap time was shorter and their tires were less used?
Were drivers more likely to make pit stops when their lap times were longer, their tires were older, and considering the type of tires they were using and their progress in the race?

Linear models considering based on the research question

Model 1: \[ \mathbb{E}(pit\_in \mid lap\_time,\ tyre\_life) = \beta_0 + \beta_1(lap\_time) + \beta_2(tyre\_life) \]
Model 2: \[ \begin{aligned} \mathbb{E}(pit\_in \mid lap\_time, \ lap\_number, \ compound, \ tyre\_life) &= \beta_0 + \beta_1(lap\_time) \\ &+ \beta_2(lap\_number) + \beta_3(compound) \\ &+ \beta_4(tyre\_life) \end{aligned} \]

# STEP 1: Model Specification
lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

# STEP 2: Model estimation
# first linear model
pit_lm1 <- lm_spec %>%
  fit(pit_in ~ lap_time + tyre_life, data = miami2024)
pit_lm1 %>% tidy()

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) -0.422    0.0543       -7.77 1.74e-14
2 lap_time     0.00429  0.000537      7.98 3.63e-15
3 tyre_life    0.00243  0.000509      4.79 1.94e- 6

# second linear model
pit_lm2 <- lm_spec %>%
  fit(pit_in ~ lap_time + lap_number + compound + tyre_life, data = miami2024)
pit_lm2 %>% tidy()

# A tibble: 6 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    -0.446    0.0544       -8.21 6.32e-16
2 lap_time        0.00468  0.000534      8.76 7.32e-18
3 lap_number     -0.00214  0.000387     -5.54 3.88e- 8
4 compoundMEDIUM  0.0117   0.00936       1.25 2.10e- 1
5 compoundSOFT    0.0312   0.0240        1.30 1.93e- 1
6 tyre_life       0.00519  0.000698      7.44 1.97e-13

The regression results show that drivers were slightly more likely to make pit stops when their lap times were longer and their tires were older. In the extended model, lap time and tire age remained strong predictors and suggested that there are fewer stops later in the race with lap number having a slight negative effect. Tire compound had a small and non-significant effect. This indicates that tire compound did not meaningfully influence pit stop decisions when other factors were considered.

Cross Validation

Cross-validation is a statistical method used to evaluate how well a model performs by splitting the data into multiple subsets to train the model on some subsets and validate it on the remaining subsets.

Goal: Provide a more reliable and unbiased estimate of a model’s performance predicting new data, in order to detect overfitting and improve model generalization

Dividing data into test set and training set

k-fold CV: We can use k-fold cross-validation to estimate the typical error in our model predictions for new data:
- Divide the data into \(k\) folds (or groups) of approximately equal size.
- Repeat the following procedures for each fold \(j = 1,2,...,k\):
  - Remove fold \(j\) from the data set.
  - Fit a model using the data in the other \(k-1\) folds (training).
  - Use this model to predict the responses for the \(n_j\) cases in fold \(j\): \(\hat{y}_1, ..., \hat{y}_{n_j}\).
  - Calculate the MAE/MSE for fold \(j\) (testing):
- Combine this information into one measure of model quality

Error metric to use

Mean absolute error (MAE) of an estimator measures the absolute difference between the predicted values and the actual values in the dataset.
- \(\text{MAE}_j = \frac{1}{n_j}\sum_{i=1}^{n_j} |y_i - \hat{y}_i|\)
- \(\text{CV}_{(k)} = \frac{1}{k} \sum_{j=1}^k \text{MAE}_j\)
Mean squared error (MSE) of an estimator measures the average squared difference between the predicted values and the actual values in the dataset.
- \(\text{MSE}_j = \frac{1}{n_j}\sum_{i=1}^{n_j} (y_i - \hat{y}_i)^2\)
- \(\text{CV}_{(k)} = \frac{1}{k} \sum_{j=1}^k \text{MSE}_j\)

MAE vs. MSE

The advantage of using MAE is that it’s more robust to outliers, giving equal weight to all errors. Thus, it’s more suitable when outliers are not a significant concern.

On the other hand, MSE gives more weight to larger errors than smaller ones, making it highly sensitive to outliers. MSE is more suitable when the risk of prediction mistakes is crucial and the goal is to minimize the risk of errors.

Since outliers are less of a concern for us as they don’t lead to any life threatening or other major issues, we prioritize models that are directly interpretable. Our data is less common and less familiar to many people, so we decided to choose a model based on MAE.

pit_lm1 %>% augment(new_data = miami2024) %>%
  mae(truth = pit_in, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard      0.0505

pit_lm2 %>% augment(new_data = miami2024) %>%
  mae(truth = pit_in, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard      0.0587

k-fold CV implementation for different values of k

k=5

Model 1

# set seed for reproducibility
set.seed(123)

pit_lm1_k5 = lm_spec %>%
  fit_resamples(
    pit_in ~ lap_time + tyre_life,
    resamples = vfold_cv(miami2024, v = 5),
    metrics = metric_set(mae, rmse)
  )
pit_lm1_k5 %>% collect_metrics()

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   0.0507     5 0.00159 Preprocessor1_Model1
2 rmse    standard   0.152      5 0.00536 Preprocessor1_Model1

# get fold-by-fold results
pit_lm1_k5 %>% unnest(.metrics) %>%
  filter(.metric == "mae")

# A tibble: 5 × 7
  splits            id    .metric .estimator .estimate .config          .notes  
  <list>            <chr> <chr>   <chr>          <dbl> <chr>            <list>  
1 <split [884/222]> Fold1 mae     standard      0.0477 Preprocessor1_M… <tibble>
2 <split [885/221]> Fold2 mae     standard      0.0560 Preprocessor1_M… <tibble>
3 <split [885/221]> Fold3 mae     standard      0.0528 Preprocessor1_M… <tibble>
4 <split [885/221]> Fold4 mae     standard      0.0486 Preprocessor1_M… <tibble>
5 <split [885/221]> Fold5 mae     standard      0.0485 Preprocessor1_M… <tibble>

Based on the random folds above, MAE was best for fold 1 (0.048) and worst for fold 2 (0.056).

Model 2

# set seed for reproducibility
set.seed(123)

pit_lm2_k5 = lm_spec %>%
  fit_resamples(
    pit_in ~ lap_time + lap_number + compound + tyre_life,
    resamples = vfold_cv(miami2024, v = 5),
    metrics = metric_set(mae, rmse)
  )
pit_lm2_k5 %>% collect_metrics()

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   0.0592     5 0.00161 Preprocessor1_Model1
2 rmse    standard   0.150      5 0.00490 Preprocessor1_Model1

# get fold-by-fold results
pit_lm2_k5 %>% unnest(.metrics) %>%
  filter(.metric == "mae")

# A tibble: 5 × 7
  splits            id    .metric .estimator .estimate .config          .notes  
  <list>            <chr> <chr>   <chr>          <dbl> <chr>            <list>  
1 <split [884/222]> Fold1 mae     standard      0.0533 Preprocessor1_M… <tibble>
2 <split [885/221]> Fold2 mae     standard      0.0621 Preprocessor1_M… <tibble>
3 <split [885/221]> Fold3 mae     standard      0.0617 Preprocessor1_M… <tibble>
4 <split [885/221]> Fold4 mae     standard      0.0585 Preprocessor1_M… <tibble>
5 <split [885/221]> Fold5 mae     standard      0.0606 Preprocessor1_M… <tibble>

Based on the random folds above, MAE was best for fold 1 (0.053) and worst for fold 2 (0.062).

# 5-fold CV MAE and sd
pit_lm1_k5 %>% unnest(.metrics) %>%
  filter(.metric == "mae") %>%
  summarize(mean = mean(.estimate), sd = sd(.estimate))

# A tibble: 1 × 2
    mean      sd
   <dbl>   <dbl>
1 0.0507 0.00356

pit_lm2_k5 %>% unnest(.metrics) %>%
  filter(.metric == "mae") %>%
  summarize(mean = mean(.estimate), sd = sd(.estimate))

# A tibble: 1 × 2
    mean      sd
   <dbl>   <dbl>
1 0.0592 0.00360

In-sample and 5-fold CV MAE and standard deviation for both models.

Model	In-sample MAE	5-fold CV MAE	In-sample SD	5-fold CV SD
`model_1`	0.05045	0.05073	0.15247	0.00356
`model_2`	0.05975	0.05922	0.15035	0.00360

5-fold cross-validation was used to assess the performance of two models predicting pit stops. Model 1, using only lap time and tire life, achieved a mean MAE of 0.05073 with a low standard deviation (0.00356). Model 2, which adds lap number and tire compound, had a higher mean MAE of 0.05922 with a similar standard deviation (0.00360).

Although Model 2 includes more predictors, it performed slightly worse than Model 1 in both cross-validation and in-sample metrics. This suggests that the additional variables do not improve prediction. Model 1 is therefore more accurate and efficient for predicting pit stops.

k=10

Model 1

# set seed for reproducibility
set.seed(123)

pit_lm1_cv = lm_spec %>%
  fit_resamples(
    pit_in ~ lap_time + tyre_life,
    resamples = vfold_cv(miami2024, v = 10),
    metrics = metric_set(mae, rmse)
  )
pit_lm1_cv %>% collect_metrics()

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   0.0510    10 0.00294 Preprocessor1_Model1
2 rmse    standard   0.150     10 0.0109  Preprocessor1_Model1

# get fold-by-fold results
pit_lm1_cv %>% unnest(.metrics) %>%
  filter(.metric == "mae")

# A tibble: 10 × 7
   splits            id     .metric .estimator .estimate .config        .notes  
   <list>            <chr>  <chr>   <chr>          <dbl> <chr>          <list>  
 1 <split [995/111]> Fold01 mae     standard      0.0368 Preprocessor1… <tibble>
 2 <split [995/111]> Fold02 mae     standard      0.0544 Preprocessor1… <tibble>
 3 <split [995/111]> Fold03 mae     standard      0.0614 Preprocessor1… <tibble>
 4 <split [995/111]> Fold04 mae     standard      0.0472 Preprocessor1… <tibble>
 5 <split [995/111]> Fold05 mae     standard      0.0379 Preprocessor1… <tibble>
 6 <split [995/111]> Fold06 mae     standard      0.0602 Preprocessor1… <tibble>
 7 <split [996/110]> Fold07 mae     standard      0.0600 Preprocessor1… <tibble>
 8 <split [996/110]> Fold08 mae     standard      0.0434 Preprocessor1… <tibble>
 9 <split [996/110]> Fold09 mae     standard      0.0505 Preprocessor1… <tibble>
10 <split [996/110]> Fold10 mae     standard      0.0581 Preprocessor1… <tibble>

Based on the random folds above, the MAE was best for fold 1 with an MAE of approximately 0.037 and worst for fold 3 with an MAE of 0.061 approximately.

Model 2

# set seed for reproducibility
set.seed(123)

pit_lm2_cv = lm_spec %>%
  fit_resamples(
    pit_in ~ lap_time + lap_number + compound + tyre_life,
    resamples = vfold_cv(miami2024, v = 10),
    metrics = metric_set(mae, rmse)
  )
pit_lm2_cv %>% collect_metrics()

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   0.0594    10 0.00262 Preprocessor1_Model1
2 rmse    standard   0.148     10 0.0104  Preprocessor1_Model1

# get fold-by-fold results
pit_lm2_cv %>% unnest(.metrics) %>%
  filter(.metric == "mae")

# A tibble: 10 × 7
   splits            id     .metric .estimator .estimate .config        .notes  
   <list>            <chr>  <chr>   <chr>          <dbl> <chr>          <list>  
 1 <split [995/111]> Fold01 mae     standard      0.0436 Preprocessor1… <tibble>
 2 <split [995/111]> Fold02 mae     standard      0.0616 Preprocessor1… <tibble>
 3 <split [995/111]> Fold03 mae     standard      0.0698 Preprocessor1… <tibble>
 4 <split [995/111]> Fold04 mae     standard      0.0566 Preprocessor1… <tibble>
 5 <split [995/111]> Fold05 mae     standard      0.0518 Preprocessor1… <tibble>
 6 <split [995/111]> Fold06 mae     standard      0.0658 Preprocessor1… <tibble>
 7 <split [996/110]> Fold07 mae     standard      0.0655 Preprocessor1… <tibble>
 8 <split [996/110]> Fold08 mae     standard      0.0521 Preprocessor1… <tibble>
 9 <split [996/110]> Fold09 mae     standard      0.0601 Preprocessor1… <tibble>
10 <split [996/110]> Fold10 mae     standard      0.0671 Preprocessor1… <tibble>

Based on the random folds above, MAE was best for fold 1 (0.044) and worst for fold 3 (0.070).

# 10-fold CV MAE and sd
pit_lm1_cv %>% unnest(.metrics) %>%
  filter(.metric == "mae") %>%
  summarize(mean = mean(.estimate), sd = sd(.estimate))

# A tibble: 1 × 2
    mean      sd
   <dbl>   <dbl>
1 0.0510 0.00931

pit_lm2_cv %>% unnest(.metrics) %>%
  filter(.metric == "mae") %>%
  summarize(mean = mean(.estimate), sd = sd(.estimate))

# A tibble: 1 × 2
    mean      sd
   <dbl>   <dbl>
1 0.0594 0.00829

In-sample and 10-fold CV MAE and standard deviation for both models.

Model	In-sample MAE	10-fold CV MAE	In-sample SD	10-fold CV SD
`model_1`	0.05045	0.05100	0.15247	0.00931
`model_2`	0.05975	0.05939	0.15035	0.00829

With 10-fold cross-validation, Model 1 had a mean MAE of 0.0510, while Model 2 had a slightly higher MAE of 0.0594. Both models showed low standard deviations approximately 0.009. As in the 5-fold case, Model 1 remained slightly more accurate and stable than Model 2.

k = 20

Model 1

# set seed for reproducibility
set.seed(123)

pit_lm1_k20 = lm_spec %>%
  fit_resamples(
    pit_in ~ lap_time + tyre_life,
    resamples = vfold_cv(miami2024, v = 20),
    metrics = metric_set(mae, rmse)
  )
pit_lm1_k20 %>% collect_metrics()

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   0.0509    20 0.00399 Preprocessor1_Model1
2 rmse    standard   0.140     20 0.0142  Preprocessor1_Model1

# get fold-by-fold results
pit_lm1_k20 %>% unnest(.metrics) %>%
  filter(.metric == "mae")

# A tibble: 20 × 7
   splits            id     .metric .estimator .estimate .config        .notes  
   <list>            <chr>  <chr>   <chr>          <dbl> <chr>          <list>  
 1 <split [1050/56]> Fold01 mae     standard      0.0451 Preprocessor1… <tibble>
 2 <split [1050/56]> Fold02 mae     standard      0.0519 Preprocessor1… <tibble>
 3 <split [1050/56]> Fold03 mae     standard      0.0509 Preprocessor1… <tibble>
 4 <split [1050/56]> Fold04 mae     standard      0.0658 Preprocessor1… <tibble>
 5 <split [1050/56]> Fold05 mae     standard      0.0439 Preprocessor1… <tibble>
 6 <split [1050/56]> Fold06 mae     standard      0.0412 Preprocessor1… <tibble>
 7 <split [1051/55]> Fold07 mae     standard      0.0602 Preprocessor1… <tibble>
 8 <split [1051/55]> Fold08 mae     standard      0.0429 Preprocessor1… <tibble>
 9 <split [1051/55]> Fold09 mae     standard      0.0402 Preprocessor1… <tibble>
10 <split [1051/55]> Fold10 mae     standard      0.0263 Preprocessor1… <tibble>
11 <split [1051/55]> Fold11 mae     standard      0.0266 Preprocessor1… <tibble>
12 <split [1051/55]> Fold12 mae     standard      0.0585 Preprocessor1… <tibble>
13 <split [1051/55]> Fold13 mae     standard      0.0704 Preprocessor1… <tibble>
14 <split [1051/55]> Fold14 mae     standard      0.0274 Preprocessor1… <tibble>
15 <split [1051/55]> Fold15 mae     standard      0.0302 Preprocessor1… <tibble>
16 <split [1051/55]> Fold16 mae     standard      0.0825 Preprocessor1… <tibble>
17 <split [1051/55]> Fold17 mae     standard      0.0591 Preprocessor1… <tibble>
18 <split [1051/55]> Fold18 mae     standard      0.0429 Preprocessor1… <tibble>
19 <split [1051/55]> Fold19 mae     standard      0.0611 Preprocessor1… <tibble>
20 <split [1051/55]> Fold20 mae     standard      0.0901 Preprocessor1… <tibble>

Based on the random folds above, MAE was best for fold 10 (0.026) and worst for fold 20 (0.090).

Model 2

# set seed for reproducibility
set.seed(123)

pit_lm2_k20 = lm_spec %>%
  fit_resamples(
    pit_in ~ lap_time + lap_number + compound + tyre_life,
    resamples = vfold_cv(miami2024, v = 20),
    metrics = metric_set(mae, rmse)
  )
pit_lm2_k20 %>% collect_metrics()

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 mae     standard   0.0593    20 0.00398 Preprocessor1_Model1
2 rmse    standard   0.139     20 0.0134  Preprocessor1_Model1

# get fold-by-fold results
pit_lm2_k20 %>% unnest(.metrics) %>%
  filter(.metric == "mae")

# A tibble: 20 × 7
   splits            id     .metric .estimator .estimate .config        .notes  
   <list>            <chr>  <chr>   <chr>          <dbl> <chr>          <list>  
 1 <split [1050/56]> Fold01 mae     standard      0.0508 Preprocessor1… <tibble>
 2 <split [1050/56]> Fold02 mae     standard      0.0623 Preprocessor1… <tibble>
 3 <split [1050/56]> Fold03 mae     standard      0.0564 Preprocessor1… <tibble>
 4 <split [1050/56]> Fold04 mae     standard      0.0755 Preprocessor1… <tibble>
 5 <split [1050/56]> Fold05 mae     standard      0.0596 Preprocessor1… <tibble>
 6 <split [1050/56]> Fold06 mae     standard      0.0535 Preprocessor1… <tibble>
 7 <split [1051/55]> Fold07 mae     standard      0.0652 Preprocessor1… <tibble>
 8 <split [1051/55]> Fold08 mae     standard      0.0492 Preprocessor1… <tibble>
 9 <split [1051/55]> Fold09 mae     standard      0.0474 Preprocessor1… <tibble>
10 <split [1051/55]> Fold10 mae     standard      0.0324 Preprocessor1… <tibble>
11 <split [1051/55]> Fold11 mae     standard      0.0347 Preprocessor1… <tibble>
12 <split [1051/55]> Fold12 mae     standard      0.0630 Preprocessor1… <tibble>
13 <split [1051/55]> Fold13 mae     standard      0.0818 Preprocessor1… <tibble>
14 <split [1051/55]> Fold14 mae     standard      0.0362 Preprocessor1… <tibble>
15 <split [1051/55]> Fold15 mae     standard      0.0405 Preprocessor1… <tibble>
16 <split [1051/55]> Fold16 mae     standard      0.0838 Preprocessor1… <tibble>
17 <split [1051/55]> Fold17 mae     standard      0.0652 Preprocessor1… <tibble>
18 <split [1051/55]> Fold18 mae     standard      0.0537 Preprocessor1… <tibble>
19 <split [1051/55]> Fold19 mae     standard      0.0724 Preprocessor1… <tibble>
20 <split [1051/55]> Fold20 mae     standard      0.102  Preprocessor1… <tibble>

Based on the random folds above, MAE was best for fold 10 (0.032) and worst for fold 20 (0.101).

# 20-fold CV MAE and sd
pit_lm1_k20 %>% unnest(.metrics) %>%
  filter(.metric == "mae") %>%
  summarize(mean = mean(.estimate), sd = sd(.estimate))

# A tibble: 1 × 2
    mean     sd
   <dbl>  <dbl>
1 0.0509 0.0178

pit_lm2_k20 %>% unnest(.metrics) %>%
  filter(.metric == "mae") %>%
  summarize(mean = mean(.estimate), sd = sd(.estimate))

# A tibble: 1 × 2
    mean     sd
   <dbl>  <dbl>
1 0.0593 0.0178

In-sample and 20-fold CV MAE and standard deviation for both models.

Model	In-sample MAE	20-fold CV MAE	In-sample SD	20-fold CV SD
`model_1`	0.05045	0.05086	0.15247	0.01785
`model_2`	0.05975	0.05925	0.15035	0.01781

In the 20-fold CV setup, Model 1 performed better than Model 2 with a lower mean MAE, 0.05086 and 0.05925, respectively. Even with increased fold, the simpler model generalized better across the dataset.

Compare different values of k


Model	In-sample MAE	5-fold CV MAE	10-fold CV MAE	20-fold CV MAE
`model_1`	0.05045	0.05073	0.05100	0.05086
`model_2`	0.05975	0.05922	0.05939	0.05925

Final model based on the smallest CV error

Across all cross-validation settings (5, 10, and 20-fold), Model 1 consistently showed lower MAE than Model 2. The differences were small but consistent and this suggests that Model 1 is a better model than Model 2 in predicting pit-stops.

Therefore, our final model based on the smallest CV error is:

\[ \mathbb{E}(pit\_in \mid lap\_time,\ tyre\_life) = \beta_0 + \beta_1(lap\_time) + \beta_2(tyre\_life) \]

Logistic Regression Model

Variables of interest

Predictors

lap_time: recorded time to complete a lap (seconds)
lap_number: lap number from which the telemetry data was recorded (number of laps)
tyre_life: number of laps completed on a set of tires (number of laps)
compound: type of tire used (SOFT, MEDIUM, HARD)

Response variable

pit_in: whether a driver made a pit stop during a lap (binary: 0 = no pit stop, 1 = pit stop occurred)

\[\begin{align*} Y_i &= \begin{cases} 1 & \text{ if a driver pitted on a lap } \\ 0 & \text{ otherwise (i.e., the driver did not pit on lap)} \end{cases} \end{align*}\]

Our logistic regression model

We are interested in determining the probability of making a pit stop during the 2024 Miami Grand Prix, considering factors such as lap time, track progress, tire age, and the type of tire used.

\[ \begin{aligned} \log(odds(pit\_in \mid lap\_time, \ lap\_number, \ tyre\_life, \ compound)) &= \beta_0 + \beta_1 (lap\_time) \\ &+ \beta_2(lap\_number) + \beta_3 (tyre\_life) \\ &+ \beta_4 \ I(compound = MEDIUM) \\ &+ \beta_5 \ I(compound = SOFT) \end{aligned} \]

# factor `pit_in` for logistic regression analysis
miami2024_glm <- miami2024 %>%
  mutate(pit_in_fac = as.factor(pit_in))

# logistic regression model
logistic_fit <- train(
  form   = pit_in_fac ~ lap_time + lap_number + tyre_life + compound,
  data   = miami2024_glm,
  family = "binomial", # this is an argument to glm; response is 0 or 1, binomial
  method = "glm",      # method for fit; "generalized linear model"
  trControl = trainControl(method = "none")
)

summary(logistic_fit$finalModel)


Call:
NULL

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -18.48162    2.27137  -8.137 4.06e-16 ***
lap_time         0.13739    0.02007   6.846 7.58e-12 ***
lap_number      -0.16001    0.03630  -4.408 1.04e-05 ***
tyre_life        0.27508    0.04580   6.006 1.91e-09 ***
compoundMEDIUM   0.49495    0.49718   0.996    0.319    
compoundSOFT     1.86135    1.17923   1.578    0.114    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 261.16  on 1105  degrees of freedom
Residual deviance: 176.46  on 1100  degrees of freedom
AIC: 188.46

Number of Fisher Scoring iterations: 8

Interpretation of exponentiated \(\hat{\beta}\) coefficients

exp(logistic_fit$finalModel$coefficients)

   (Intercept)       lap_time     lap_number      tyre_life compoundMEDIUM 
  9.408757e-09   1.147280e+00   8.521343e-01   1.316630e+00   1.640419e+00 
  compoundSOFT 
  6.432386e+00

\(\exp(\beta_0)\): The odds of a driver making a pit stop during a lap, when lap time is 0 seconds, lap number is 0, 0 laps have been completed on the current set of tires, and the HARD compound is, is approximately \(9.4088 \times 10^{-9}\).
\(\exp(\beta_1)\): For every of 1 second increase in lap time, the odds of a driver pitting increase by a factor of 1.1473.
\(\exp(\beta_2)\): For every additional lap (i.e., increase of 1 in the lap number), we expect the odds of a driver pitting to increase by a factor of 0.8521.
\(\exp(\beta_3)\): For each additional lap completed on the current set of tires, the odds of a driver pitting increase by a factor of 1.3166.
\(\exp(\beta_4)\): When using MEDIUM compound tires instead of HARD, the odds of a driver pitting increase by a factor of 1.6404, holding all other variables constant.
\(\exp(\beta_5)\): When using SOFT compound tires instead of HARD, we expect the odds of a driver pitting to increase by a factor of 6.4324, holding all other variables constant.

Mathematically derive \(\exp(\beta_1)\)

\[ \begin{aligned} &\log(odds(pit\_in \mid lap\_time = a)) = -18.4816 + 0.1374a \\ \\ &\log(odds(pit\_in \mid lap\_time = a+1)) = -18.4816 + 0.1374(a+1) \end{aligned} \]

\[ \begin{aligned} & \log\left( \frac{odds(pit\_in \mid lap\_time = a+1)}{odds(pit\_in \mid lap\_time = a)} \right)\\ &= \log(odds(pit\_in \mid lap\_time = a+1)) - \log(odds(pit\_in \mid lap\_time = a)) \\ &= (-18.4816 + 0.1374(a+1)) - (-18.4816 + 0.1374) \\ &= 0.1374 \\ &= \hat{\beta_1} \end{aligned} \]

Therefore, \(\exp(\beta_1) = e^{0.1374} = 1.1473\)

Predicting High Probability of a Pit Stop

To predict a probability of a driver making a pit stop that is very close to 1, we need to input extreme values for the predictors. Based on the five-number summary of our data, we use the following scenario: a lap time of 148.74 seconds, lap number 57, SOFT compound, and a tire age of 45 laps.

# miami2024_glm %>% 
#   ggplot(aes(x=lap_time)) +
#     geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)

summary(miami2024_glm)

    lap_time        lap_number      compound     tyre_life    
 Min.   : 90.63   Min.   : 1.00   HARD  :500   Min.   : 1.00  
 1st Qu.: 92.38   1st Qu.:14.00   MEDIUM:562   1st Qu.: 7.00  
 Median : 93.28   Median :28.00   SOFT  : 44   Median :13.50  
 Mean   : 96.00   Mean   :28.62                Mean   :14.78  
 3rd Qu.: 94.29   3rd Qu.:43.00                3rd Qu.:22.00  
 Max.   :148.74   Max.   :57.00                Max.   :45.00  
     pit_in        pit_in_fac
 Min.   :0.00000   0:1078    
 1st Qu.:0.00000   1:  28    
 Median :0.00000             
 Mean   :0.02532             
 3rd Qu.:0.00000             
 Max.   :1.00000

log_prid_fst <- predict(logistic_fit$finalModel,
            newdata = data.frame(lap_time = 148.74,
                                 lap_number = 57,
                                 tyre_life = 45,
                                 compoundMEDIUM = 0,
                                 compoundSOFT = 1),
            type = "response")
    
odds_pitting_fst = exp(log_prid_fst)
(prob_pitting_fst = odds_pitting_fst/(1+odds_pitting_fst))

        1 
0.7308921

Using our logistic regression model, we estimate the probability of a pit stop under these conditions to be approximately 0.731. This indicates a high likelihood of a pit stop given these extreme race conditions.

Predicting Pit Stops with our Logistic Regression Model

Estimate the probability of a driver making a pit stop on a lap with the following conditions: 96.00 seconds lap time, 28th lap, 14.78 laps completed on a set of HARD tires.

log_prid_hard <- predict(logistic_fit$finalModel,
        newdata = data.frame(lap_time = 96,
                             lap_number = 28,
                             tyre_life = 14.78,
                             compoundMEDIUM = 0,
                             compoundSOFT = 0),
        type = "response")

odds_pitting_hard = exp(log_prid_hard)
(prob_pitting_hard = odds_pitting_hard/(1+odds_pitting_hard))

        1 
0.5008283

There is approximately a 50.08% probability that a driver will make a pit stop on this lap when using HARD tires, holding all other variables constant.

Estimate the probability of a driver making a pit stop on a lap under the same conditions as above but using a set of MEDIUM tires.

log_prid_med <- predict(logistic_fit$finalModel,
        newdata = data.frame(lap_time = 96,
                             lap_number = 28,
                             tyre_life = 14.78,
                             compoundMEDIUM = 1,
                             compoundSOFT = 0),
        type = "response")

odds_pitting_med = exp(log_prid_med)
(prob_pitting_med = odds_pitting_med/(1+odds_pitting_med))

        1 
0.5013559

With MEDIUM tires, the probability of making a pit stop increases to 50.14%.

Estimate the probability of a driver making a pit stop on a lap under the same conditions as above but using a set of SOFT tires.

log_prid_soft <- predict(logistic_fit$finalModel,
        newdata = data.frame(lap_time = 96,
                             lap_number = 28,
                             tyre_life = 14.78,
                             compoundMEDIUM = 0,
                             compoundSOFT = 1),
        type = "response")

odds_pitting_soft = exp(log_prid_soft)
(prob_pitting_soft = odds_pitting_soft/(1+odds_pitting_soft))

        1 
0.5052337

With SOFT tires, the probability increases slightly to 50.52%.

While all the other variables stay the same, we predict that the probability a driver to made a pit stop is higher if the driver is on a set of SOFT tires compared to other compounds.

Pros/Cons of logistic regression vs. regular linear regression

Logistic Regression

Pros

Since logistic regression is based on a Bernoulli/binomial likelihood, it is a natural model for binary outcomes.

Coefficients are interpretable in terms of odds ratios (with log-odds as the linear predictor).

Cons

The relationship between predictors and the probability is not linear.

Linear Regression

Pros

Straightforward linear regression

Easy to interpret the coefficients

Cons

Cannot gaurantee that the predicted probabilities to be between 0 and 1.

Conclusion

The linear and logistic regression models provide a practical method for predicting pit-stop laps based on multiple factors, while k-fold cross-validation offers an approach for selecting the most optimized models. However, the prediction results from the logistic model were somewhat limited, with predictions averaging around 50% even with extreme parameters, making it challenging to accurately predict pit-stop laps. This limitation is understandable because predicting pit-stop timing in real races is one of the most crucial and difficult decisions in race strategy.

For future studies, we could consider additional factors that influence car speeds or tire management, as well as apply methods that are capable of handling more complex models to observed more detailed and comprehensive analyses.