Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Make explicit in pandas IO doc the imports and options #28089

Merged
merged 10 commits into from
Nov 7, 2019

Conversation

TanyaaCJain
Copy link
Contributor

@TanyaaCJain TanyaaCJain commented Aug 22, 2019

Reference @datapythonista - 149

- Removes code header in IO doc page.
- Uses explicit imports in the IO page of user guide on first occurence of requirement.
- In reference to pandas-dev#28038
@datapythonista
Copy link
Member

Thanks for the work on this @TanyaaCJain, looks great. Did you check if the pandas and numpy options are being used? It'd probably be useful to generate a diff of the current (master) html generated for that page, and the one with this changes, and without the options. So we know for sure what's being affected by this change.

@TomAugspurger @jorisvandenbossche @jreback this is the option that makes more sense to me. I think having a block with the imports and the options at the top of every file doesn't really look great. Thought?

@TomAugspurger
Copy link
Contributor

This (suppressing the imports) seems fine for rendered HTML on the majority of the pages. I think we intentionally don't suppress them on the first few of the getting started.

For the binder version, presumably these will be visible / executed? Or are we not thinking that far ahead yet?

@datapythonista
Copy link
Member

The main motivation of this change is not having hidden code, so the integration with Binder is easier and there is no magic for the users.

For the API pages I guess it will be more controversial. But in my opinion the changes to the user guide are very small and surely worth.

@jorisvandenbossche
Copy link
Member

The main motivation of this change is not having hidden code,

Note that in the current diff of this PR, it is still "hidden" to the user for the online html docs

.. ipython:: python
:suppress:

import pandas as pd
pd.options.display.max_rows = 15
clipdf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': ['p', 'q', 'r']},
index=['x', 'y', 'z'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize this was hidden, thanks @jorisvandenbossche for pointing it out.

@TanyaaCJain this block is not shown to the user, can you move the clipdf data where it's first used, and the import to the first next block (assuming there are no more hidden blocks before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the changes in the commit 18a9b7c.

- imported pandas and added its options in the first visible code block
- moved the suppressed or invisible code block using clipdf just before the code block using the clipdf variable
- imported `BytesIO` in its first occurrence of requirement unlike previously imported along with `StringIO`
- imported os again for adding it on its first occurrence of requirement in a visible code block and is earlier imported in a suppressed code block.
@TanyaaCJain
Copy link
Contributor Author

The new commit does this:

  • imported pandas and added its options in the first visible code block
  • moved the suppressed or invisible code block using clipdf just before the code block using the clipdf variable
  • imported BytesIO in its first occurrence of requirement unlike previously imported along with StringIO
  • imported os again for adding it on its first occurrence of requirement in a visible code block and is earlier imported in a suppressed code block.

I need help with:

  • np.random.seed(123456) is currently (in this PR) being called just before the first occurrence of np.random.randn() which is in a suppressed block. What is to be done?
    • Add it again in a visible code block with first occurrence of np.random.randn() as done with import os?
    • Let it only be in the suppressed code block?
  • I am not able to get any diff while comparing the current (master) html with the one in this PR. Can you help me with the code. Probably, not writing the right one.

Please suggest if there are any other changes as well.

@@ -137,7 +128,9 @@ usecols : list-like or callable, default ``None``

.. ipython:: python

from io import StringIO, BytesIO
import pandas as pd
pd.options.display.max_rows = 15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can actually remove this line.

I quickly checked and there are only a few longer dataframes shown in this page, and they are all longer. So with the current default of only showing first/last 5 rows when truncated, this option has no effect (it would only have the effect that dataframes longer than 15 but shorter than 60 would still be truncated, but such dataframes are not present in this file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the changes in commit d3036f3.

@jorisvandenbossche
Copy link
Member

np.random.seed(123456)

In general, do we care much about setting this?

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference regarding the seed. I think for the long term would be nice to avoid examples with random data. But for now I think it's perfectly fine to remove the seed and have different examples every time (I don't think anybody cares, including search engines, and we the production docs are frozen between releases).

It'd also be ok to me if we decide to remove the numpy printooptions. I think data looks nicer with it, but if setting it can cause confusion to users, not a problem for me to get rid of it.

:suppress:

clipdf = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': ['p', 'q', 'r']},
index=['x', 'y', 'z'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is really needed. Do you mind having a look @TanyaaCJain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes.. I missed mentioning this. Thanks @datapythonista for the reminder! Even I couldn't find this assignment be of use anywhere in the file. So, should we go ahead with getting rid of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have he feeling that the read_clipboard was an ipython block before, and we had this clipdf and a to_clipboard before reading from the clipboard. But since not it's a python block and the code is written manually, I guess we can simply get rid of this.

@TanyaaCJain
Copy link
Contributor Author

  • I am not able to get any diff while comparing the current (master) html with the one in this PR. Can you help me with the code. Probably, not writing the right one.

Also, @datapythonista can you help me with this?

@datapythonista
Copy link
Member

Also, @datapythonista can you help me with this?

My idea was to go to an updated master branch git checkout master && git fetch upstream && git merge upstream/master, then generate the documentation (just this page if you want to save time) ./doc/make.py --single=user_guide/io.rst and then save somewhere the resulting html file doc/build/html/user_guide/io.html.

The, go to your branch, generate that page again and do a diff -u io_from_master.html io_without_header.html.

That should highlight everything that will be different for the user. Better to do it wihout removing the seed, otherwise all the data will be different.

Does that make sense?

@TanyaaCJain
Copy link
Contributor Author

TanyaaCJain commented Aug 23, 2019

This is the diff.txt generated, can be viewed as html if the extension is changed. The file compared to original did not have the clipdf assignment along, but the numpy print option and seed. I'll commit the new changes after this.

@datapythonista
Copy link
Member

didn't think about that, but the changes make the number of the In [x] and Out [x] of jupyter change, and the diff is huge, not so easy to see what changed. But thank you for it.

@@ -137,7 +128,8 @@ usecols : list-like or callable, default ``None``

.. ipython:: python

from io import StringIO, BytesIO
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize I missed the original discussion but just to be clear, is the assumption here that we only show the import the first time it shows up one of the rst files and that the rest of the code blocks use it from there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's an assumption. It is implicitly being proposed, for simplicity and for inertia on how the header worked, but it's surely open to discussion. I thought about that too, and I was also wondering if it'd make sense to split the long pages we have now in the user guide in shorter pages. I guess that could make things easier if we finally make the code runnable.

@datapythonista
Copy link
Member

Answering this comment from @WillAyd on the issue: #28038 (comment)

This is still open to discussion, but personally I much like this approach than the header. There are some advantages:

  • Simpler
  • Runnable code with no magic for users
  • If we make the code runnable with binder, not having headers/hidden code will probably help
  • Sphinx will report the correct line on errors
  • If what we did in this PR can extrapolate to the rest, there is not so much needed in the header anyway

I agree with your point that in a huge documentation page, having the import just at the beginning doesn't necessarily make sense, the document will often not be accessed sequentially. But discussed with @jorisvandenbossche, and seems reasonable to split very long pages. And with binder in mind, probably worth having the imports repeated at the beginning of main sections, rather than hidden.

So, while this PR may not be a perfect solution yet, I think it's a step forward in the right direction.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 10, 2019 via email

@TanyaaCJain
Copy link
Contributor Author

Would you all prefer to split up this file into smaller sections too or just want it for way more lengthier ones? Though I think splitting doc into sections would help to understand difference in file much easily, if preferred.

didn't think about that, but the changes make the number of the In [x] and Out [x] of jupyter change, and the diff is huge, not so easy to see what changed. But thank you for it.

I tried for a few other ways that could possibly generate a difference ignoring the ipython lines but none generated the desired output. I’ll update if I do find one.

@WillAyd
Copy link
Member

WillAyd commented Sep 10, 2019

If runnable code with binder is the main draw here (that's what I am inferencing at least, perhaps incorrectly) is there any reason why we don't defer moving things around like this until something like binder would actually get implemented?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 10, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Sep 10, 2019

Sounds good. I'll defer to you all then as I think you have a better big picture view of this and I don't think anything here is a showstopper. Just wanted to offer an alternative view - thanks!

@jorisvandenbossche
Copy link
Member

@WillAyd to come back to your concern (as it was not fully addressed IMO):

I'm not sure I am on board with the concept. Associating each import with its first reference in the rst feels a little arbitrary and I think is just confusing. Seems like it could be somewhat of a nuisance when moving things around; nothing major but in any case a step backwards from just knowing standard imports are done in the header

Currently, all standard imports (pandas, numpy) are at the beginning of each rst file, but hidden for the user. Now we still do them at the top of the file, but visible. So a net improvement, and not arbitrary.

Or at least, that's the case for the pandas import. But you are right that if the from io import BytesIO import is somewhere in the middle of the file, and then re-used without import near the end of the file in a totally different section, this can be confusing (not necessarily more confusing than not showing the import at all as it is now, though):

  • for writing docs, I don't think this is a problem, we have CI checking for this now to prevent errors
  • for really common imports (I think mainly numpy), it's maybe worth to still put this at the top of the file with the pandas import instead at the location of first usage? (in the end, that is also what I do in my notebook workflow)
  • for less common imports that we want to do at the place of usage, I think we can actually repeat this import multiple times if they happen in a single file but in unrelated sections.

And as Marc said, going to smaller files will also help with this.

@jorisvandenbossche
Copy link
Member

Would you all prefer to split up this file into smaller sections too

@TanyaaCJain I think we want to work towards that, but that's certainly for another PR (for the IO page we have #10446 about this)

@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

status on this PR @jorisvandenbossche @datapythonista

@datapythonista
Copy link
Member

If nobody objects to remove the header in the docs, the seeds, and make the imports explicit, this is ready to be merged, and we can start working on the rest of the docs.

@jorisvandenbossche
Copy link
Member

I am fully behind removing the header and moving the imports in actual code-blocks.

One thing that might be good to decide on before starting other PRs on the rest of the docs, is where to put those imports.
In the current PR, it is "the first code block where it is used" for everything.

I think that is fine for less common imports, but it could be an option to still put eg numpy and pandas (and matplotlib.pyplot for those rst files that use it) in a visible code block in the beginning of each file.

(but also fine to first try out with the approach as in this PR, and evaluate later about that question)

@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2019

@datapythonista @jorisvandenbossche still want to merge?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 7, 2019 via email

@WillAyd WillAyd added this to the 1.0 milestone Nov 7, 2019
@WillAyd WillAyd merged commit 7adc14a into pandas-dev:master Nov 7, 2019
@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2019

Thanks @TanyaaCJain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants