Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Basic Usage of Manipulation Functions #3360

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

nathanrboyer
Copy link
Contributor

@nathanrboyer nathanrboyer commented Jul 18, 2023

Replaces PR #2907 since it was too out of sync with the main branch.

@nathanrboyer nathanrboyer changed the title Initial commit Updated Basic Usage of Manipulation Functions Jul 18, 2023
docs/src/man/basics.md Outdated Show resolved Hide resolved
docs/src/man/basics.md Outdated Show resolved Hide resolved
@nathanrboyer
Copy link
Contributor Author

Friendly bump 🙂

@bkamins
Copy link
Member

bkamins commented Sep 15, 2023

Agreed. But first we need to make a decision on #3361. I will have a look at it and comment there.

| ------------ | -------------------------------- | -------------------------------------------- | ------------------------------------------------- |
| `transform` | Creates a new data frame. | Retains both source and manipulated columns. | Retains same number of rows as source data frame. |
| `transform!` | Modifies an existing data frame. | Retains both source and manipulated columns. | Retains same number of rows as source data frame. |
| `select` | Creates a new data frame. | Retains only manipulated columns. | Retains same number of rows as source data frame. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"manipulated" or "created"?

| `transform!` | Modifies an existing data frame. | Retains both source and manipulated columns. | Retains same number of rows as source data frame. |
| `select` | Creates a new data frame. | Retains only manipulated columns. | Retains same number of rows as source data frame. |
| `select!` | Modifies an existing data frame. | Retains only manipulated columns. | Retains same number of rows as source data frame. |
| `subset` | Creates a new data frame. | Retains only source columns. | Number of rows is determined by the manipulation. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better say that by condition (to differentiate it from combine)

docs/src/man/basics.md Outdated Show resolved Hide resolved
### Constructing Operation Pairs
All of the functions above use the same syntax which is commonly
`manipulation_function(dataframe, operation)`.
The `operation` argument is a `Pair` which defines the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add that Pair is constructed using =>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - it does not have to be the pair, so maybe say it is usually a pair?
(your first example is not a pair below)

These rules are typically called transformation mini-language.

Let us move to the examples of application of these rules
## Basic Usage of Manipulation Functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A larger question - maybe create a separate page for this tutorial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be better. Maybe "Manipulation Functions" under "User Guide" before "Split-apply-combine"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented below - I would put it as a "top level" with a name something along "A gentle introduction to manipulation functions" (so that we clearly signal that this material is less formal than the rest of the manual).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of making the new Top Level section something like "Beginner's Guide" or "User's Guide for Beginners" and then placing "Manipulation Functions" at a second level under that? I'm not volunteering to rewrite the entire User's Guide, but it could leave room for others to add similar "gentle" content to the documentation. It would also make the sidebar look cleaner by splitting up the current long name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. The other section that could go there is https://dataframes.juliadata.org/stable/man/basics/ as it has the same objective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, but Basics contains the existing "Basic Usage of Manipulation Functions". I don't know how to differentiate it from this new section if they live next to each other.

I initially intended to just clarify some topics within that section, but now the scope has grown.

I can maybe overwrite that section if I add these topics that I don't currently cover:

  • "Note that this time we use string column selectors because some of the column names have spaces in them."
  • "The benefit of select or combine over indexing is that it is easier to get the union of several column selectors."
  • "It is important to note that select always returns a data frame, even if a single column selected as opposed to indexing syntax."
  • "By default select copies columns of a passed source data frame. In order to avoid copying, pass the copycols=false keyword argument."

The other sections under Basics use the German dataset, but I think it is easier to understand what is going on with smaller data frames where you know all the data values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can maybe overwrite that section

Yes - I think it is OK just to expand that section (especially that it is top-level now already)

I think it is easier to understand what is going on with smaller data frames

Agreed. just please use different variable names than these already used there so that using different dataframes does not lead to confusion.

Thank you! (sorry for so many comments, but - unfortunately - writing documentation is hard)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (sort of). I did not use new variables names though. I had been and continued to just overwrite the definition of df. My data frames are so small and frequent that coming up with a new name each time would be a pain.

!!! Note
The Julia parser sometimes prevents `:` from being used by itself.
`ERROR: syntax: whitespace not allowed after ":" used for quoting`
means your `:` must be wrapped in either `(:)` or `Cols(:)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically recommend All() which is easy to understand I think.

These rules are typically called transformation mini-language.

Let us move to the examples of application of these rules
These functions and their methods are explained in more detail in the section
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that not only more detail, but with more "slow paced" and informal approach :).

@@ -0,0 +1,1345 @@
# Data Frame Manipulation Functions

The seven functions below can be used to manipulate data frames
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, as I have commented - maybe add a sentence at the top explaining the teaching approach of this material. Thank you!

@nathanrboyer
Copy link
Contributor Author

documenter fails - have you checked why?

I've never used Documenter.jl before. I just tried to edit the make.jl and index.md files by copying what was already there.

@nathanrboyer
Copy link
Contributor Author

I don't know why the cross references aren't working. It looks to me like I am doing it right. https://documenter.juliadocs.org/dev/man/syntax/#@ref-link

@nathanrboyer
Copy link
Contributor Author

Need to see if documenter tests pass this time. There is also a subsection I would like at the end on performance, but I am hoping someone else can write it, even if that is as a separate PR.

@bkamins
Copy link
Member

bkamins commented Oct 14, 2023

I think that the documenter error is due to the fact that your references are on level three, i.e. ### header. And these headers are not included in TOC.

@mortenpi - how could @nathanrboyer reference e.g. this section
https://github.com/JuliaData/DataFrames.jl/pull/3360/files#diff-80725fcadb5509a5fb0986e534b25ae5c65bab29c2903e8319c67c7a42143ffaR2615
in this
https://github.com/JuliaData/DataFrames.jl/pull/3360/files#diff-80725fcadb5509a5fb0986e534b25ae5c65bab29c2903e8319c67c7a42143ffaR1642
position? Thank you!

docs/src/man/basics.md Outdated Show resolved Hide resolved
docs/src/man/basics.md Outdated Show resolved Hide resolved
@mortenpi
Copy link
Contributor

mortenpi commented Dec 6, 2023

@mortenpi - how could @nathanrboyer reference e.g. this section

Sorry, missed this earlier. This actually looks like a bug. The fact that it's level 3 heading shouldn't matter. As a workaround, you might be able to use @id-s in the headers though.

@nathanrboyer
Copy link
Contributor Author

I think I fixed my errors, but I am still getting some errors below which seem unrelated to me.

I get this error on main with julia --project make.jl:

ERROR: LoadError: ArgumentError: makedocs() got passed invalid keyword arguments:
  strict = true

I get this error on this branch with julia --project make.jl:

[ Info: Doctest: running doctests.
┌ Error: doctest failure in C:\Users\nboyer.AIP\.julia\packages\DataFrames\58MUJ\src\abstractdataframe\iteration.jl:38-73
│ 
│ ```jldoctest
│ julia> df = DataFrame(x=1:4, y=11:14)
│ 4×2 DataFrame
│  Row │ x      y
│      │ Int64  Int64
│ ─────┼──────────────
│    1 │     1     11
│    2 │     2     12
│    3 │     3     13
│    4 │     4     14
│ 
│ julia> eachrow(df)
│ 4×2 DataFrameRows
│  Row │ x      y
│      │ Int64  Int64
│ ─────┼──────────────
│    1 │     1     11
│    2 │     2     12
│    3 │     3     13
│    4 │     4     14
│ 
│ julia> copy.(eachrow(df))
│ 4-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
│  (x = 1, y = 11)
│  (x = 2, y = 12)
│  (x = 3, y = 13)
│  (x = 4, y = 14)
│ 
│ julia> eachrow(view(df, [4, 3], [2, 1]))
│ 2×2 DataFrameRows
│  Row │ y      x
│      │ Int64  Int64
│ ─────┼──────────────
│    1 │    14      4
│    2 │    13      3
│ ```
│ 
│ Subexpression:
│ 
│ copy.(eachrow(df))
│ 
│ Evaluated output:
│ 
│ 4-element Vector{@NamedTuple{x::Int64, y::Int64}}:
│  (x = 1, y = 11)
│  (x = 2, y = 12)
│  (x = 3, y = 13)
│  (x = 4, y = 14)
│ 
│ Expected output:
│ 
│ 4-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
│  (x = 1, y = 11)
│  (x = 2, y = 12)
│  (x = 3, y = 13)
│  (x = 4, y = 14)
│ 
│   diff =
│    4-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
│     Vector{@NamedTuple{x::Int64, y::Int64}}:
│     (x = 1, y = 11)
│     (x = 2, y = 12)
│     (x = 3, y = 13)
│     (x = 4, y = 14)
└ @ Documenter C:\Users\nboyer.AIP\.julia\packages\DataFrames\58MUJ\src\abstractdataframe\iteration.jl:38
ERROR: LoadError: `makedocs` encountered a doctest error. Terminating build

@mortenpi
Copy link
Contributor

mortenpi commented May 30, 2024

strict = true

If this is using Documenter 1.0+, then that keyword was removed (and strict=true is now the default). It looks like it got correctly removed on master though: https://github.com/JuliaData/DataFrames.jl/pull/3416/files#diff-4aae2d1c783cade58bd2cb13748da956e568b7f2aed5fafd9e2a46fb97daf613L45

The doctest failures seem to be due to different printing between Julia versions (i.e. make sure you use the same Julia version locally as is being used for the docs CI).

@nathanrboyer
Copy link
Contributor Author

I thought that I already was in sync from this: 7614fc3
but I found another button to sync on my repository page which fixed the strict line.

I am still getting the second error, and I am using the juliaup release branch which seems to match the ci.yml.

version:
- '1.6'
- '1' # automatically expands to the latest stable 1.x release of Julia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants