Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data frame RFC #3

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
252 changes: 252 additions & 0 deletions text/0003-data-frames.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
- Feature Name: data_frames
- Start Date: 2020-08-05
- RFC PR: [nushell/rfcs#0003](https://github.com/nushell/rfcs/pull/3)
- Nushell Issue: [nushell/nushell#0000](https://github.com/nushell/nushell/issues/0000)

# Summary

[summary]: #summary

This RFC merges the Row and Table Value types into a single new value type: Frame. Data frames take inspiration from data processing systems like R and Pandas. Data frames will play the fundamental role of modelling data in Nu and will have enough descriptive power to describe all forms of structure, including streaming tables, lists, and objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm completely unfamiliar with R or pandas, and I've never heard the term 'data frame'. Maybe a little more detail or an example could make this summary more accessible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, will definitely fill that out. The way I'm using the term here is that it's a 2-dimensional block of data. There are some columns, and these are uniform across all the rows in the block. I think technically data frames are a bit more configurable than that, but I wanted to start with a slightly more restricted definition and adjust from there.


# Motivation

[motivation]: #motivation

The current system has a few unexpected limitations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inlining of nested tables is a limitation right now too, correct? If the nested table is incredibly large, we could easily run out of memory since it doesn't get streamed.

Arguably we could solve this without data frames, but it seems like what's being proposed here will potentially solve that problem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that to the list.

Yes, this protocol lets us stream inner tables also, so you could get the initial structure, and remember where the inner tables are, then read the contents of those inner tables from the stream.


- The top-level rows represent a table of rows, but it's unclear how to represent a top-level list of strings vs a stream of strings.
- A similar ambiguity exists between an "object" (a data structure denoted by key/value pairs) and a table of one row
- Inner-tables are modelled differently than top-level tables, leading to confusion
- There is no way to currently represent a matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably, this could be a matrix:

echo [[1 2] [3 4]]

Not saying it'd be easy to work with, but I think it's representable 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, true. I guess a real matrix vs a list of lists. I could call that out

- As rows are streamed instead of tables, it's not possible to predict how to display this information. This is currently mitigated by buffering some number of rows and treating them as one "table"
- Likewise, since rows are streamed instead of tables, it's unclear what the user should expect if they request a column that is not present, as this column may appear in the following row instead.
- When table data is serialized, there is a large amount of duplication, as columns are repeated with each row sent.
- Additionally, there is currently no way to represent rows using a row literal. We propose a frame literal that will fill this role.

# Guide-level explanation

[guide-level-explanation]: #guide-level-explanation

Data frame representation:

```rust
struct DataFrame {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I'm not familiar with nu's current representations, repeating that here for a quick comparison could help.

headers: Option<Vec<String>>,
rows: Vec<Vec<Value>>,
Copy link

@obust obust Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas dataframes are stored as lists of column, each of which is an array for column-based arithmetic efficiency.

Maybe more insightful is a document of a hypothetical pandas 2.0 design if pandas was rewritten.
https://dev.pandas.io/pandas2/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the heads up! Will definitely check it out

partial_frame_id: Option<Uuid>,
is_object: bool,
}
```

## Tables

A self-contained table would look like this:

```rust
let frame = DataFrame {
headers: Some(vec!["name", "age"]),
rows: vec![
vec![Value::from("Bob"), Value::from(30)],
vec![Value::from("Sally"), Value::from(43)]
],
partial_frame_id: None,
is_object: false,
};
```

The above code could be created using this Nu syntax:

```sh
[name: [Bob, Sally], age: [30, 43]]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you express this in nu syntax, and/or the above in rust syntax, if you were to state the shape without data? For instance, to say that Windows ls will always have 4 columns and n rows and Windows ls -l will always have 8 columns and n rows.

Alternatively, is there a constructor that says this df is 5 rows and 6 columns?

At some point, I expect we'll have variables that can hold a dataframe. It's hard for me to visualize how this will work in a streaming environment where things are built up and torn down in a pipeline.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fdncred - for the first question, I think you're asking "how do you write types in Nu?" We'll probably need a separate RFC for that, as types will be their own topic.

Or may you're asking how we handle matrices and how this differs from a list of lists?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking about initializing a dataframe with a predetermined shape as ls may have. ls -l on Windows will have a predetermined amount of columns.

One could think of making dataframes with 2 columns and 3 rows as an empty dataframe except with column names, and then, as the pipeline progresses, update the information in those rows. In order to do this, some type of initialization of the df would have to take place. Maybe the term is dataframe literal. I think this is what you've created here [name: [Bob, Sally], age: [30, 43]] but this one is fully populated. Can I do [name: [], age:[]] and then populate it later in the pipeline?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fdncred - ah, I think I got it.

There isn't a way to fill in a dataframe, though we could think of creating some API around that like we do for TaggedDictBuilder and related.

Not sure what you mean by populate it later in the pipeline. Since we're passing values through, you'd create a new value. But maybe these helpers would be able to take in a shape and let you fill it in? Seems doable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, take in a shape and fill it. This may not be exactly functional but once we get to scripts i can easily see initializing a dataframe variable (assuming we have variables) and populating it with various pipelines.

  1. Define a shape with just columns df --define columns 3 name size sum
  2. Now populate it | update sum { ls | get size + accumulator } (bad syntax but hopefully you get the point)
  3. @andrasio probably has examples because he's frequently doing | default wassup 0 | blah | blah | update wassup wassam

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @fdncred questions from below is a better fit here:

If I have 100 rows of data, do I have to repeat the column names for each row? It may be nice to consider something like [columns: [name, level], rows:[[Thomas, 12], [Fred, 15], [Mark, 3]]]. Maybe not that exact syntax but you get my meaning.

The data frame keeps each row separate, but the proposed table syntax groups by column. That's surprising and maybe not enough.

Maybe the column names can use the argument syntax from alias? With new lines:

[
 {name, age},
 [Thomas, 35],
 [Fred, 15]
]

single line: [{name, age},[Thomas, 35],[Fred, 15]]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, yeah. Something I'm not sure of is rather we should be row-major or column-major inside of the data frame. In practice, we probably filter by column more than row, so grouping column values together internally might make the most sense.

If so, perhaps we reflect that in the syntax.

This feels like something we'll need to actually experiment with to see how it feels in Nu.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine that there's going to be two ways (syntax) to specify tables in columns, both row-major and column-major, while the internal representation should be more predictable. But yeah, some experiments make sense. Since I want to learn Rust, I might try to build a tiny "table" parser myself. Nothing to wait for 😅

```

## Lists and matrices

A self-contained list would look like this:

```rust
let list = DataFrame {
headers: None,
rows: vec![
vec![Value::from(1), Value::from(2), Value::from(3)]
],
partial_frame_id: None,
is_object: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if is_object were true here? If that's allowed, how should it be interpreted? If that's not allowed without headers, perhaps we should have some enum to capture the different styles – object, non-object with headers, non-object without headers – so as to prevent developer error.

[EDIT]
Keeping this comment here, but you discuss it in the note below. If we don't take the enum approach, we should at least mention is_object should only be considered if headers.is_some() (I think?).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so.

}
```

The above code could be created using this Nu syntax:

```sh
[1, 2, 3]
```

## Objects (aka hash tables)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aka dictionary, map? 'hash table' sounds rather implementation specific to me


A self-contained object would look like this:

```rust
let obj = DataFrame {
headers: Some(vec!["name", "level"]),
rows: vec![
vec![Value::from("Thomas"), Value::from(12)]
],
partial_frame_id: None,
is_object: true,
}
```

**Note:** we use the boolean in the table rather than enumeration because all processing on the frame remains uniform regardless of if the frame is a single row with headers vs an object. This simplifies algorithms to only have to work with the data directly, and we can later represent this data and/or serialize this data in a way that maintains the user's model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use the boolean in the table

What 'table' is this referring to?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 'data frame'. I'm trying to say here that the using a boolean rather than making an enum of the contents allows commands to ignore the object vs row distinction and focus on the data. It's a pretty minor point, admittedly.


The above code could be created using this Nu syntax:

```sh
[name: Thomas, level: 12]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have 100 rows of data, do I have to repeat the column names for each row? It may be nice to consider something like [columns: [name, level], rows:[[Thomas, 12], [Fred, 15], [Mark, 3]]]. Maybe not that exact syntax but you get my meaning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I give an example above for how to write a dataframe. This example is about "objects", or hash tables, so we only have one value per column.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we must be talking passed each other because I really understand what you're saying and I think you didn't understand what I was saying. I'm just showing a possible way of creating a dataframe literal without repeating the column names for every row. I define the column names one time with [columns: [name, level] and then add the rows with [rows:[[Thomas, 12], [Fred, 15], [Mark, 3]]]. [Thomas, 12] is one row, [Fred, 15] is another row, and [Mark, 3] is the last row.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right. I totally missed what the example was saying. Yeah, we could do some kind of tagging like that to differentiate the headers from the rows.

If we go this route, how would it look when there aren't header values?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With no header values, I think we'd just use indexes like 0 and 1 for a two column table and be able to do df | get 0 to get the first columns data.

If we want to leave a column blank and did not previously define the columns, using example above, I'd do something like this [rows:[[,12], [,15], [,3]]]. That creates a 2 column 3 row table. The columns are named 0 and 1, indexes, and the first column is blank but the second column is filled in with 12, 15, 3.

```

## Streaming

An important part of Nu is being able to work with potentially endless streams of data. As is often the case when working with external commands, there's no guarantee that the output will terminate.

We need to be able to represent the results of processing a stream of unprocessed data as a stream of data frames.

To be able to output a data frame as a stream, we need to know two key elements: that the data frame is incomplete as-is, and a unique identifier that allows us to later stream additional data for this frame.

To accomplish this streaming, we also introduce an `EndFrame`. Frame and EndFrame work together to allow a frame to be streamed as a multi-part frame, ending once the corresponding EndFrame has been read.

As an example, let's say we were processing some content and wanted to output the first row and later the second row of this table:

| tag | length |
| ---- | ------ |
| head | 1024 |
| body | 8192 |

To do this, we would send the two separate frames, both marked as partial frames.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could add, ", ending the stream with an EndFrame with the same frame id."


```rust
let frame_id = Uuid::new_v4();
output.send(UntaggedValue::DataFrame {
headers: Some(vec!["tag", "length"]),
length: vec![
vec![ Value::from("head"), Value::from(1024)]
],
partial_frame_id: Some(frame_id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we stream nested frames, will different ids be interleaved? Will the onus be on commands to track that? Will there be helper methods/structs for dealing with that? Maybe a light discussion on that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we'd probably want some helper methods. Will have to think about that more.

is_object_false,
}.into_value());

// ... time passes

output.send(UntaggedValue::DataFrame {
headers: Some(vec!["tag", "length"]),
length: vec![
vec![ Value::from("body"), Value::from(8192)]
],
partial_frame_id: Some(frame_id),
is_object_false,
}.into_value());

output.send(UntaggedValue::EndFrame(frame_id).into_value());
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to have some code sample in the above section showing what it would look like on the receiving end. I'm particularly interested in the more complex cases, like frames being nested in frames.


# Reference-level explanation

[reference-level-explanation]: #reference-level-explanation

Much of the implementation is part of the explanation in the previous section. In this section, we'll explore more of the impact of making this change.

`UntaggedValue` will have `Table` and `Row` replaced with `DataFrame` (possibly just called `Frame` for brevity) and `EndFrame`.

All commands that operate on Row today will need to be updated to work with the data frame instead.

Rather than operating on a single row, commands will need to be updated to handle a frame at a time. Here, the mapping should largely be the same, though an additional inner loop to process over the rows will be necessary. This processing can be done serial or in parallel and may be done synchronously or asynchronously.

Commands that worked over inner tables should be able to migrate to data frames, as there is a strong overlap in functionality.

Commands that filter will work similarly as before, and may opt to output streams which are flattened by the output stream. This allows them to optionally return no frame if no rows in the frame met the requirements of the filter.

# Drawbacks

[drawbacks]: #drawbacks

Some drawbacks come to mind:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are mostly drawbacks on implementing data frames. Are there no drawbacks or limitations to frames themselves, similar to the drawbacks you listed for rows/tables above?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I should list those out.

Some that come to mind:

  • data frames (at least in the proposal) are rigid in terms of structure, which means that it becomes harder to describe heterogeneous data where columns may only be present on some but not all rows.
  • data frames have the potential to grow their own complexity. In addition to row- and column-major layouts, we may also be tempted to explore optimizations like unboxing columns when we know all values are the same type, allowing the whole frame to lay in memory unboxed, etc. This type of thing opens us up to more speed at the cost of adding addition complexity

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thought of a potentially rather large one. Let's say we go column-based internally, how do we handle data coming from source likes JSON?

echo '[{"name": "Bob"}, {"name": "Sally"}]' | from json

Today, the above gives you a table just fine, and you can immediately start using that. It'd be a shame if we landed on a design where it stopped working.


- This is a large, non-trivial amount of work. Getting this landed, updating the commands to use the new model, and thoroughly testing will take time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any plan for transitioning slowly, or does this work have to be done as a single unit? No need to document the plan here, but the ability to iterate on this is not immediately obvious to me, so it may be worth describing if (and perhaps how) that would be possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we could do is to document how to transition code from one style to another. We could also support bow Row and Frame for a time, allowing people to transition off the old protocol while we roll onto the new.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also support bow Row and Frame for a time, allowing people to transition off the old protocol while we roll onto the new.

I think that'd be the way to go in the future. We would need backwards-incompatible protocol changes an RFC process with a clear timeline. We'll also need to figure out how to communicate that to the nushell community 🙂

- This will break most, if not all, third-party plugins
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we don't have to do this again anytime soon, but I'm thinking this would be a good opportunity for us to think about deprecations in our protocol. How do we version the plugins, and know what version of the protocol they want? How long do we keep old Value types around before fully deprecating them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll definitely want to add that to the plugin protocol. I don't think it's currently part of it.


# Rationale and alternatives

[rationale-and-alternatives]: #rationale-and-alternatives

## Everything is a frame

One alternative is to require everything to live inside of a frame. There are some advantages here: this is seemingly more uniform, but at the risk of overloading the data frame concept.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this paragraph. What "everything" isn't included in the proposal, for this to be an alternative?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here "everything" would mean all of the data primitives. In practice, this largely changes what data type would be streamed between commands. Commands would interact with each other firstly with a data frame, so that each step would start with a frame first.

I'm not sure if, in practice, this buys us much simplification, but I wanted to at least mention it.


# Prior art
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since streaming seems to be a big motivator for this, I wonder if there's other prior art regarding streams.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a total outsider, I'd take a look at Apache Arrow here. A lot of their messaging/docs are focused on efficient columnar storage (which I assume is not relevant here), but they have two features that are probably interesting for Nu to learn from:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanhdu - thanks for the tip, I'll definitely check these out.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High Level API Docs on Apache Arrow for Rust...

https://docs.rs/arrow/1.0.1/arrow/

Seems to have most of the relevant stuff needed
for generating ideas on how to move forward...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/nevi-me/rust-dataframe/blob/master/notes/update-01__04-04-2020.md

Some more thoughts on dataframes in rust using arrow and a dataframe package


[prior-art]: #prior-art

Other data processing systems and languages have a data frame concept. The R language and the `pandas` library for Python use it as a model for working with data in tabular format.

## R data frame

```r
Live Demo

# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)
```

which outputs:

```
emp_id emp_name salary start_date
1 1 Rick 623.30 2012-01-01
2 2 Dan 515.20 2013-09-23
3 3 Michelle 611.00 2014-11-15
4 4 Ryan 729.00 2014-05-11
5 5 Gary 843.25 2015-03-27
```

## Pandas data frame

Below is an example of the pandas data frame:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone is interested, this is where pandas defines the DataFrame class. Lots of code here but interesting. https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py

```python
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 1 3
1 2 4
```

# Unresolved questions

[unresolved-questions]: #unresolved-questions

- Are there syntactic ambiguities with the proposed syntax? This will require that we support parsing data frames, which includes colons and commas at the end of bare words.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably that could break things for some users, but the idea of not being 1.0 yet is that we're still trying to figure things out. I doubt it will break much.

That being said, I've been wanting to put together a description of our grammar. Calling out what you're adding and what would break in a grammar would make this super clear 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a section about this.

- How do we want to handle partial inner data frames? That is, a data frame that is inside of another data frame.
Copy link
Contributor

@thegedge thegedge Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a big one (e.g., sys). Is the idea to answer that here after thinking about it a bit more, or is it something to explore during implementation?

[EDIT]
Or are you thinking we don't solve that right now, and maintain status quo where nested rows/tables currently don't stream.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it seems like something we could add later as needed. The protocol would support it, but we'd want to have more API surface to deal with it.

- How do we handle non-data frames in between data frames? Do all partial data frames have to stream out until complete?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, no, but we'd probably have to relay information back through the stream to allow that. Probably an RFC on its own 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you consider a non-data frame? As far as I can tell, this proposal doesn't define it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use a better term here. I meant "data types that aren't data frames", like strings, numbers, etc.


# Future possibilities

[future-possibilities]: #future-possibilities

We would like to be able to extend Data Frames further to be able to handle sending snapshots of data at the current time. This allows us to stream updates to existing tables, allowing viewers to animate as data is updated.

We may also elect to add type information to the columns, so that we can maintain a more rigorous internal representation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, maybe the headers should be more than an optional list of strings, so that further information can be added there later.

In JavaScript/JSON the solution is to start with a list of objects, instead of a list of strings, so that more properties can be added to the object later. I guess that can be applied here, too.

Maybe in Rust that would be a HashMap, starting with only a name property?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I was hoping we could evolve in that direction rather than trying to figure it out with this RFC. One thing we could do (which I proposed recently) is to create an experimental implementation for data frames and try to add support to a few commands. See how it works in practice, and if it turns out we almost always have the type information there because the source knows it, we can just add it. For example, ls knows all the types of its columns head of time, so just do it.


Frames also allow us to store values in an unboxed way if we can ensure all the values in a column match, and that this holds for all columns in the frame.

Commands that collect a stream into a list could potentially have the optional to merge together all partial data frames into self-contained data frames for further processing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? optional => option

merge together all partial data frames into self-contained data frames for further processing.

Merging partial frames (when the end frame is received) into a single data frame makes sense to me. Though I don't understand the distinction with "self-contained data frames" - how are those different to partial frames? Why would it still be multiple frames, not a single one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a way of saying "a data frame that isn't partial", so all of its data is in that one frame. It would only be the single one, yeah.