-
Notifications
You must be signed in to change notification settings - Fork 8
Expand file tree
/
Copy pathexploration.Rmd
More file actions
577 lines (377 loc) · 23.6 KB
/
exploration.Rmd
File metadata and controls
577 lines (377 loc) · 23.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
---
title: "Data Exploration Project"
author: "Mohit Sainani"
date: "April 23, 2019"
output: pdf_document
urlcolor: cyan
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = T, warning = F)
```
```{r, echo=FALSE, message=FALSE}
library(jsonlite)
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
library(ggplot2)
library(grid)
library(igraph)
```
# Prologue
[GSMArena](http://gsmarena.com/) is a gadget review website with a focus on cellular and mobile devices. I believe to have come across the site a long time ago through the now defunct iGoogle start-page. It was among the first few tech blogs among the likes of [How-To Geek](https://www.howtogeek.com/), [XDA Developers](http://xda-developers.com/), [Android Police](http://androidpolice.com/), and [AnandTech](https://www.anandtech.com/) that had invoked in me the sort of enthusiasm for gadgets and devices that I have carried on ever since.
# Introduction
Over the years, GSMArena has compiled a rich database of mobile device specifications. The decision to scrape off their website - despite there being numerous data sets out in the wild to run exploratory analysis upon - was based upon multiple factors. Some of them are:
- an excuse to demonstrate (read : hone, improve) my data mining skills
- experiment with alternative looping techniques in R using the [purrr](https://cran.r-project.org/package=purrr) library
- most importantly, the nostalgia attached with working with such a dataset
You can find the code used for scraping device specifications and the generated raw data [here](https://github.com/cigarplug/scrape-gsma) on my GitHub. The exploration code and rmarkdown file are also included in above repository.
The questions aimed to be answered at the end of this project are:
1. Observe the trend in screen-to-body ratio of devices over the years
2. Trends in battery capacity of devices vs overall weight of devices
3. How the CPU chipmakers are related through OEMs / device manufacturers
4. Trends in adoption of multiple camera lenses and fingerprint sensors
5. Number of devices launched with each Android over the years
This project has been entirely performed in the R (ver 3.5.3) programming language.
# Data Wrangling
We start off with reading the exported csv file (created after data scrape-ing) into R and converting it into R's bread and butter data.frame structure. Each data frame element from the list is changed into the wide format. After this operation, each row represents specifications of one device.
```{r read_data, warning=FALSE, cache=T, echo=F}
gsm <- read.csv("gsm.csv")
gsm[gsm == ""] <- NA
```
```{r fix_column_names, echo=F}
colnames(gsm) <- colnames(gsm) %>% str_replace( "_\\Z", "") %>%
str_replace_all(" ", "_") %>% tolower()
```
The initial dataset comprises of 8980 rows x 86 columns
Comparing this dataset with the GSMArena repository, we find that we managed to capture data for 93.5% (8980 out of 9601) devices listed on the website.
We restrict the dataset to about 24 columns which fall under the scope of the proposed questions.
```{r helper_fnxs, echo=F}
get_dimensions <- function(x){
x %>% str_split("x", n = 3) %>% map( ~str_trim(. ,side = "left")) %>%
map(~str_extract(., "\\d+(.\\d+)?(?= )")) %>% map(~as.numeric(.)) %>%
unlist() %>% paste(., collapse = ",")
}
get_price <- function(x){
x %>% str_replace("About ", "") %>% str_split(" ") %>% unlist() %>% paste(., collapse = ",")
}
get_ram <- function(x) {
ram <- x %>% str_extract("\\d+(.\\d+)? (GB|MB)(?= RAM)")
if (grepl("MB", ram, ignore.case = F)){
(str_extract(ram, "\\d+") %>% as.numeric())/1024
} else {
str_extract(ram, "\\d+") %>% as.numeric()
}
}
get_os_type <- function(raw_df) {
if (nrow(raw_df[raw_df$sub_type == "OS",]) == 0) {
return (NA)
} else {
os_info <- raw_df %>% filter(sub_type == "OS") %>% select(val) %>%
str_split(pattern = ",|;", simplify = T) %>% extract(1)
return(os_info)
}
}
get_android_version <- function(x){
x %>% str_extract("Android \\d+(.\\d+)?")
}
sbr <- function(resolution, length, breadth, display_size){
# note: industry standard for reporting display size is the measure of diagonal
# phone screens are predominantly rectangular in shape (duh).
res <- strsplit(resolution, 'x') %>% unlist() %>% as.numeric()
# display size converted to millimeters from inches
mm <- 25.4*(display_size)
# using pythagoras theorem and ratio-proportions
x <- mm/(1 + (res[2]/res[1])^2)**0.5
screen.body.ratio <- (x*(res[2]/res[1])*x) / (length*breadth)
return(screen.body.ratio)
}
safe_sbr <- safely(sbr, otherwise = NA)
fix_andr_v <- function(x){
if (!is.na(x)){
ver <- x %>% str_extract("\\d+(.\\d+)?")
if (str_detect(x, "\\.")){
x
} else {
paste0("Android ", ver, ".0")
}
} else NA
}
```
Next, we'll clean some of the data columns that we know will be required for answering the suggested questions. Regex is used for pattern matching, and the R tidyr and dplyr libraries are used for wrangling purposes. The tasks performed are:
1. Device weight in grams
The *body_weight* column is a string which has information in both grams and ounces. We extract the weight in grams and cast it as a numeric value
2. Body dimensioms
Extract the length, breadth, and width (in millimeters) of each device from the the *body_dimensions* column
3. Display size
Extract the display size (diagonal) of the device in inches
4. Battery capacity
Extract the battery capacity (in mili ampere hours, mAh)
5. Display resolution
Extract the screen resolution as horizontal and vertical pixel counts, and save them in respective fields
6. Price
Separate the price amount and price currency into different variables
7. RAM
Derive the RAM from *memory_internal* field and convert it into gigabytes of memory
8. Camera lense count
Collpase the *main_camera* fields into a new column which indicates the count of main camera lense(s)
9. Year announced
Obtain the year in which the device was first announced
10. Screen-to-body Ratio
Although GSMArena provides this metric for devices announced in the last few years, we’d prefer to obtain this number for all devices. It may be calculated based on the diagnoal screen size, the display resolution of the screen, and length and width of the phone. Simple use of Pythagoras' theorem yields the desired value.
# Data Checking
Our dataset consists of all sorts of devices including tablets and smartwatches. These were discovered using filters based on screen size, available cellular networks, and year of announcement. We know that the first modern smartwatch was announced in 2011^[https://en.wikipedia.org/wiki/Smartwatch#2010s]. Also, tablets usually tend to be over 7 inches in screen size^[https://en.wikipedia.org/wiki/Tablet_computer#Mini_tablet].
### Removing smartwatches and tablet devices
Since we are only concerned with smart (or dumb/feature) _phones_ we can filter for devices that match certain criteria:
- Remove devices with *No cellular connectivity*
- Remove devices that have screens over 6.9 inches (predominantly tablets)
- Remove devices that were announced after 2011 and have screens smaller than 1.7 inches (mostly smartwatches)
\leavevmode\newline
### Fix Android version number
We also notice that the android versions for some of the devices are stripped down to an integer value. For example, Android 4.0 is written as Android 4 for a small subset of the devices. For all such values, we can fix them by appending a *0* _(zero)_ if the version number is an integer
```{r wrangle_clean, echo=F, warning=F}
gsm_df <- gsm %>%
mutate(weight = str_extract(body_weight, "\\d+(.\\d+)?(?= g)") %>% as.numeric()) %>%
rowwise() %>%
mutate(tmp_dim = get_dimensions(body_dimensions)) %>%
separate(tmp_dim, into = c("dim_length", "dim_breadth", "dim_width"), sep = ",") %>%
mutate(display_size_inches =
str_extract(display_size, ".*(?= inches)") %>% as.numeric()) %>%
mutate(display_tech = str_extract(display_type, "[A-Z\\s]{3,}") %>% str_trim(side = "both")) %>%
mutate(display_res = str_extract(display_resolution, ".*x.*(?= pixel)") ) %>%
mutate(battery_cap = str_extract(battery, "\\d+(?= mAh)") %>% as.numeric()) %>%
rowwise() %>%
mutate(price = get_price(misc_price)) %>%
separate(price, into = c("price_amount", "price_currency"), sep = ",") %>%
mutate(price_amount = as.numeric(price_amount)) %>%
mutate(bluetooth_v = str_extract(comms_bluetooth, "\\d+(.\\d+)?") %>% as.numeric()) %>%
rowwise() %>%
mutate(ram_gb = get_ram(memory_internal), android_ver = get_android_version(platform_os)) %>%
mutate(cam = case_when(!is.na(main_camera_single) ~ "single",
!is.na(main_camera_dual) ~ "dual",
!is.na(main_camera_triple) ~ "triple",
!is.na(main_camera_quad) ~ "quad",
!is.na(main_camera_five) ~ "five",
!is.na(main_camera_dual_or_triple) ~ "dual/triple",
TRUE ~ "none"
)
) %>%
mutate(year = str_extract(launch_announced, "\\d{4}")) %>%
mutate(sensors = features_sensors %>% str_remove_all("\\(.*?\\)") %>% str_remove_all(" ")) %>%
select(oem, model, year, weight, dim_length, dim_breadth, dim_width, ram_gb,
display_size_inches, display_tech, display_res, battery_cap,
price_amount, price_currency, android_ver, bluetooth_v, cam, sensors, sound_3.5mm_jack,
network_technology, platform_chipset
) %>%
filter(display_size_inches < 6.9) %>% # remove tablet-sized devices
filter(network_technology != "No cellular connectivity") %>% # remove non-gsm devices
filter(!(display_size_inches < 1.7 & year > 2011)) %>% # remove smartwatches
separate(display_res, c("pixels_horz", "pixels_vert"), " x ", remove = F) %>%
mutate_at(.vars = c("pixels_horz", "pixels_vert", "dim_length",
"dim_breadth", "dim_width", "year"), .funs = as.numeric) %>%
rowwise() %>%
mutate(sbr = safe_sbr(display_res, dim_length, dim_breadth, display_size_inches)$result)
```
```{r android_fix, echo = F}
gsm_df <- gsm_df %>% rowwise() %>%
mutate(android_vers = fix_andr_v(android_ver))
```
# Exploration
## Part 1
### Say No to Bezels
Bezel-less display is one of the many buzz words surrounding the mobile phone industry these days. The term bezel refers to the portion of a device’s front surface that is not screen estate. Most mobile phones today look more or less the same with their ubiquitous rectangular profile and a display ingrained on the front. However this wasn’t always the case when companies were trying to innovate on the design and usability fronts. Some of the popular designs that have gone extinct are flip-phones, phones with slide-out or front-facing hardware keyboards, devices with track-balls, and phones resembling portable gaming consoles.
\leavevmode\newline
```{r sbr, echo=F}
gsm_df %>% filter(!is.na(year) & year > 2000) %>%
ggplot(aes(x = factor(year), sbr)) +
geom_boxplot() +
labs(title = "Screen-to-Body Ratio vs Year Announced ", x = "Year", y = "Ratio") +
theme(axis.text=element_text(size=10, angle = 90),
axis.title=element_text(size=12,face="bold"),
plot.title = element_text(size=16, face = "bold")
)
```
The boxplot depicts a general trend of increasing screen estate over the years.
Notice how the rectangular boxes keep getting smaller in lenght as we move ahead the x-axis. This shift represents a homegenity in design which the industry seems to have embraced following in the footsteps of the first monolithic iPhone^[Citation Pending]. Outliers below the first quartile also witness a drop in number, signalling a decrease in feature-phone announcements, as these are typically the devices that have both a display and hardware keypad, and thus a lower screen-body-ratio.
## Part 2
### Heavier doesn't always imply higher battery capacity
How long a phone can last on a single charge is an important choice factor for many people.
We explore the trends in battry capacity of devices and how it is affected by their overall weight.
```{r battery_weight, echo=F}
gsm_df %>% filter(weight < 300 & battery_cap > 0 & year > 2002) %>%
ggplot(aes(y = weight, x = battery_cap)) +
geom_point(size = 0.2) +
facet_wrap(~year) +
labs(title = "Trends in Weight vs Battery Capacity over the Years", x = "battery capacity (mAh)",
y= "Weight"
) +
geom_smooth(method = "loess")
```
In the above plot, a few trends stand out:
- The battery capacity of devices was limited to 2000 mAh until the year 2010.
- In the earlier years - until 2012 - the weight of a device rose rapidly along with the battry capacity. That is to say, the icline of the model-fit is much steeper
- From 2015 onwards, a major chunk of the devices habe battery capacities between 2000 and 4000 mAh
- Nowadays, there is a much more gradual increase in weigh as the battery capacity rises
- A majority of the devices weigh under 200 grams
\leavevmode\newline
\newline
\newline
## Part 3
### SoC Provider Network
In this section, we try to identify the relationship amongst mobile SoC (system-on-chip) manufacturers based on the their clients: viz the OEMs. While some OEMs are large enough to design their own CPUs (like Samsung and Apple), most other OEMs buy their CPU hardware from other companies. Notable major mobile SoC manufacturers are Qualcomm, Texas Instruments (TI) and Mediatek^[https://www.anandtech.com/show/8389/state-of-the-part-soc-manufacturers].
```{r igraph, echo=F, include=F, eval=F}
chips <- gsm_df %>% filter(!is.na(platform_chipset) & oem != "Apple"
) %>%
select(oem, platform_chipset) %>%
mutate(chipset = word(platform_chipset) %>% tolower()) %>%
select(-platform_chipset) %>%
group_by(chipset, oem) %>% tally() %>%
filter(n > 10)
m <- chips %>% spread( chipset, n, fill = 0) %>%
tibble::column_to_rownames("oem") %>%
as.matrix(.)
# m[m>1] <- 1
links2 <- t(m)
# net2 <- graph_from_incidence_matrix(links2)
net2 <- graph_from_incidence_matrix(links2)
lay = layout.reingold.tilford(net2)
plt <- plot(net2,
vertex.color=c("magenta","cyan")[V(net2)$type+1],
edge.width = E(net2)$weight,
edge.label=E(net2)$weight,
edge.label.cex=4,
vertex.label.color = "black",
vertex.size = 25,
vertex.label.cex = 1.5,
# vertex.label = NA,
# arrow.mode = 3,
layout=layout.fruchterman.reingold
)
## used tkplot and took a screenshot, since it scaled better
```
```{r screenshot, echo=FALSE, fig.cap="OEM - SoC Provider Network", out.width = '100%'}
# 
knitr::include_graphics("~/www/scrape-gsma/igraph.png")
```
The above graph (Figure 1) was generated using the [igraph](https://cran.r-project.org/package=igraph) R library.
Here we see the relationship between SoC providers and OEMs:
- Exynos being Samsung's in-house brand is only (yet) used by Samsung itself
- Asus is the only OEM to have ever used Intel chipsets
- Mediatek and Qualcomm are the SoC providers for a majority of OEMs. While some OEMs source their chipsets from both of them, some are loyal to either of the two.
\leavevmode\newline
\newline
## Part 4
```{r adoption, echo=F, message=F}
adoption <- gsm_df %>%
rowwise() %>%
mutate(has_fingerprint = str_detect(sensors, regex("fingerprint", ignore_case = T)),
multi_lens = case_when(cam %in% c("single", "none") ~ "no",
TRUE ~ "yes"
)
) %>%
select(year, has_fingerprint, multi_lens) %>%
group_by(year, has_fingerprint, multi_lens) %>%
tally(name = "device_count") %>%
filter(!is.na(year))
multi_lens <- adoption %>%
select(year, multi_lens, device_count) %>%
ungroup() %>% group_by(year) %>%
mutate(perc = device_count/sum(device_count)*100) %>%
ggplot() +
geom_bar(aes(x = factor(year), y = perc, fill = multi_lens), position = "stack", stat = "identity") +
scale_fill_manual(values = c("grey75", "olivedrab3")) +
labs(title = "% of Devices with Multiple Main Camera Lenses",
x = "year", y = "percentage", fill = "Multi Lens"
) +
theme(axis.text=element_text(size=10, angle = 90),
axis.title=element_text(size=12,face="plain"),
plot.title = element_text(size=16, face = "bold"),
text = element_text(size = 10, face = "plain"),
legend.position = "bottom"
)
finger <- adoption %>%
select(year, has_fingerprint, device_count) %>%
ungroup() %>% group_by(year) %>%
mutate(perc = device_count/sum(device_count)*100) %>%
ggplot() +
geom_bar(aes(x = factor(year), y = perc, fill = has_fingerprint),
position = "stack", stat = "identity") +
scale_fill_manual(values = c("grey75", "olivedrab3")) +
labs(title = "% of Devices with Fingerprint Sensor",
x = "year", y = "percentage", fill = "fingerprint sensor"
) +
theme(axis.text=element_text(size=10, angle = 90),
axis.title=element_text(size=12,face="plain"),
plot.title = element_text(size=16, face = "bold"),
text = element_text(size = 10, face = "plain"),
legend.position = "bottom"
)
```
In terms of senosrs and features, mobile phones have become extremely versatile devices. What started as a device meant for serving the purpose of calling people remotely and sending text messages, has evolved into a feature-packed gadget that comes with camera(s) for photography, GPS for navigation, fingerprint readers for secure access, and accelerometers and gyro that help in motion sense gaming. A few mobile phones even come equipped with health-related sensors like heart-rate monitors, or are IP certified to be used in extereme conditions, such as underwater.
Some of these features (or trends) have become more-or-less standardized and we will take look at two of them: multiple main camera lenses, and fingerprint readers
```{r fingerprint, echo=F, fig.height=4}
finger
```
While fingerprint sensors first appeared on mobile devices in the year 2006, they weren't particularly popular. The technology behind them wasn't much involved either. Adoption of fingerprint sensors started gaining momentum from the year 2014.
In 2016, almost half the devices had a fingerprint reader.
In 2018 and 2019, over 75% mobile phones shipped with the sensor.
```{r multi_lens, echo=F, fig.height=4}
multi_lens
```
The momentum around multiple camera lenses witnesses a similar exponential growth: within 4 years of the first multi-camera setups being announced in 2015, nearly 75% devices boast of the feature in 2019. The benefits of multi-lens systems aren't particularly well laid-out, but that's just how the industry seems to operate. No OEM wants to feel left out of the trends.
## Part 5
The first android device (according GSMArena's database) was released in the year 2008. It was T-Mobile's G1 running Android 1.6 (Donut). It probably marked the beginning of a new era for smartphone operating systems. Since then, Android has become wildly dominant and perhaps the only OS (apart from Apple's iOS) to survive in today's market. The likes of Symbian, Microsoft's Windows Mobile, BlackBerry OS, Palms OS, and Samsung's Bada OS have all become extinct now.
We take a look at the ratio of Android and non-Android devices announced in different years:
```{r android_dominance, echo = F}
# gsm_df %>%
# filter(!is.na(year)) %>%
# mutate(is_android = ifelse(is.na(android_vers), "non-android", "android")) %>%
# select(is_android, year) %>%
# group_by(is_android, year) %>% tally(name = "dev_count") %>%
# ggplot() +
# geom_bar(aes(x = factor(year), y = dev_count, fill = is_android),
# position = "stack", stat = "identity") +
# theme(axis.text=element_text(size=10, angle = 90),
# axis.title=element_text(size=12,face="bold"),
# plot.title = element_text(size=16, face = "bold")
# ) +
# labs(title = "Number of Android and Non-Android Device launches",
# x = "year", y = "count of devices", fill = "is android?"
# )
gsm_df %>%
filter(!is.na(year)) %>%
mutate(is_android = ifelse(is.na(android_vers), "non-android", "android")) %>%
select(is_android, year) %>%
group_by(is_android, year) %>% tally(name = "dev_count") %>%
ungroup() %>% group_by(year) %>%
mutate(perc = dev_count/sum(dev_count)*100) %>%
ggplot() +
geom_bar(aes(x = factor(year), y = perc, fill = is_android), position = "stack", stat = "identity") +
annotate("segment", x = 9, y = 107,
xend = 9, yend = 101, arrow = arrow()
) +
geom_text(aes(label = "First Android phone", x = 9, y = 110)) +
labs(title = "% of Android and Non-Android Device launches",
x = "year", y = "percentage", fill = "is android?"
) +
theme(axis.text=element_text(size=10, angle = 90),
axis.title=element_text(size=12,face="bold"),
plot.title = element_text(size=16, face = "bold"),
text = element_text(size = 8, face = "plain")
) +
scale_y_continuous(breaks = c(0,25,50,75,100))
```
The above chart clearly displays the sort of dominance that Android has managed to gain in the smartphone market. In the last few of years, an overwhelming majority of mobile phones were running Android. The other devices are either running iOS, or some community-driven/ enthusiast software such as the Kai OS.
# Conclusion
The questions proposed initially were tweaked slightly. This was done in order to allow for experimenting with more interesting plots (such as the SoC network graph) rather than the typical bars and lines. All the modified questions have been reasonably explored in this report.
I learnt from this project that web-scraping is a challenging task, esp. when we're dealing with thousands of iterations over web pages.
I learnt that trend adoption happens very quickly in the mobile world - take fingerprint readers and multiple camera lenses for example. It must be acknowledged that Apple may not be the first to introduce a particular technology, but it has been the trend setter when it comes to touchscreens, fingerprint sensors, and notched displays.
I also realized how dominant Android has become as a smartphone OS. The relationship between mobile SoC manufacturers is also worth pondering upon.
Overall, the mobile device industry has come a long way in the roughly 2 decades of time that it has existed for. Trends are dropped as easily as they are adopted, and staying a pioneer is difficult. This is evident from Nokia's demise and re-surgence^[Note: Nokia's case wasn't explored in the report, but serves as an example].
# Reflection
While I managed to obtain the tabular-like data from the website, I could also have included more information, such as the device poplarity and number of page hits of each device. It would have allowed for some more interesting results involving only the most popular devices.
The 93.5% success rate of scraping is nice to have but it would have been better to store the error logs for failed scraping instances. A majority of these errors, I suspect, were due to page encoding (or XML) related issues.
The dataset itself is pretty huge, and deeper analysis was possible. Most of the exploration performed here is at an year (time) level. Other grouping variables such as OEMs could have been used. I also realized that the information about a device having notched display is not available in the GSMArena tables.
# References
1. Gsmarena.com. (2019). GSMArena.com - mobile phone reviews, news, specifications and more.... [online] Available at: https://www.gsmarena.com/
2. Technocracy. (2019). Mobile Device Specifications from GSMArena. [online] Available at: http://cigarplug.ml/blog/mobile-device-specifications-from-gsmarena/