|
1 | | -(ns book.chapter-2-input-output.2-1-loading-data) |
| 1 | +(ns book.chapter-2-input-output.2-1-loading-data |
| 2 | + (:require |
| 3 | + [tablecloth.api :as tc])) |
2 | 4 |
|
3 | 5 | ;; # 2.1 How to get data into the notebook |
4 | 6 |
|
|
21 | 23 | ;; TODO: Link to useful explainer on lazy seqs |
22 | 24 |
|
23 | 25 | ;; #### With tablecloth |
| 26 | + |
24 | 27 | ;; For most work involving tabular/columnar data, you'll use tablecloth, Clojure's go-to data |
25 | 28 | ;; wrangling library. These all return a `tech.ml.dataset Dataset` object. The implementation |
26 | 29 | ;; details aren't important now, but `tech.ml.dataset` is the library that allows for efficient |
|
69 | 72 |
|
70 | 73 | ;; ##### Specify file encoding |
71 | 74 |
|
72 | | -;; |
| 75 | +;; TODO: does this really matter? test out different file encodings.. |
73 | 76 |
|
74 | 77 | ;; ##### Normalize values into consistent formats and types |
75 | 78 |
|
76 | | -;; Tablecloth makes it easy to apply arbitrary transformations to all values in a given column: |
| 79 | +;; Tablecloth makes it easy to apply arbitrary transformations to all values in a given column |
| 80 | + |
| 81 | +;; We can inspect the column metadata with tablecloth: |
| 82 | + |
| 83 | +(-> dataset |
| 84 | + (tc/info :columns)) |
| 85 | + |
| 86 | +;; Certain types are built-in (it knows what to do convert them, e.g. numbers:) |
| 87 | + |
| 88 | +(-> dataset |
| 89 | + (tc/convert-types "CO2" :double) |
| 90 | + (tc/info :columns)) |
| 91 | + |
| 92 | +;; The full list of magic symbols representing types tablecloth supports comes from the underlying |
| 93 | +;; `tech.ml.dataset` library: |
| 94 | +(require '[tech.v3.datatype.casting :as casting]) |
| 95 | +@casting/valid-datatype-set |
| 96 | + |
| 97 | +;; More details on [supported types here](https://github.com/techascent/tech.ml.dataset/blob/master/topics/supported-datatypes.md). |
| 98 | + |
| 99 | +;; You can also process multiple columns at once, either by specifying a map of columns to data types: |
| 100 | + |
| 101 | +(-> dataset |
| 102 | + (tc/convert-types {"CO2" :double |
| 103 | + "adjusted CO2" :double}) |
| 104 | + (tc/info :columns)) |
| 105 | + |
| 106 | +;; Or by changing all columns of a certain type to another: |
| 107 | + |
| 108 | +(-> dataset |
| 109 | + (tc/convert-types :type/numerical :double) |
| 110 | + (tc/info :columns)) |
| 111 | + |
| 112 | +;; The supported types of columns are: |
| 113 | + |
| 114 | +;; :type/numerical - any numerical type |
| 115 | +;; :type/float - floating point number (:float32 and :float64) |
| 116 | +;; :type/integer - any integer |
| 117 | +;; :type/datetime - any datetime type |
| 118 | + |
| 119 | +;; Also the magical `:!type` qualifier exists, which will select the complement set -- all columns that |
| 120 | +;; are _not_ the specified type |
| 121 | + |
| 122 | +;; For others you need to provide a casting function yourself, e.g. parsing strings: |
| 123 | +(-> dataset |
| 124 | + ;; (tc/convert-types "Date" :local-date-time) |
| 125 | + (tc/info :columns)) |
| 126 | + |
| 127 | +;; For full details on all the possible options for type conversion of columns see the |
| 128 | +;; [tablecloth API docs](https://scicloj.github.io/tablecloth/index.html#Type_conversion) |
0 commit comments