How to make good reproducible pandas examples

ghz 1years ago ⋅ 413 views

Question

Having spent a decent amount of time watching both the r and pandas tags on SO, the impression that I get is that pandas questions are less likely to contain reproducible data. This is something that the R community has been pretty good about encouraging, and thanks to guides like [this](https://stackoverflow.com/questions/5963269/how- to-make-a-great-r-reproducible-example), newcomers are able to get some help on putting together these examples. People who are able to read these guides and come back with reproducible data will often have much better luck getting answers to their questions.

How can we create good reproducible examples for pandas questions? Simple dataframes can be put together, e.g.:

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})

But many example datasets need more complicated structure, e.g.:

  • datetime indices or data
  • Multiple categorical variables (is there an equivalent to R's expand.grid() function, which produces all possible combinations of some given variables?)
  • MultiIndex data

For datasets that are hard to mock up using a few lines of code, is there an equivalent to R's dput() that allows you to generate copy-pasteable code to regenerate your datastructure?


Answer

Note: Most of the ideas here are pretty generic for Stack Overflow, indeed questions in general. SeeMinimal, Reproducible Example or Short, Self Contained, Correct Example.

Disclaimer: Writing a good question is hard.

The Good:

  • Do include a small example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').

    In [2]: df
Out[2]:
   A  B
0  1  2
1  1  3
2  4  6

Test it yourself to make sure it works and reproduces the issue.

* You can [format the text for Stack Overflow](/editing-help#code) by highlighting and using `Ctrl`+`K` (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.

* I really do mean **small**. The vast majority of example DataFrames could be fewer than 6 rows,[citation needed] and **I bet I can do it in 5**. Can you reproduce the error with `df = df.head()`? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

But every rule has an exception, the obvious one being for performance issues (in which case definitely use [%timeit](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic- timeit) and possibly [%prun](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic- prun) to profile your code), where you should generate:

            df = pd.DataFrame(np.random.randn(100000000, 10))

Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.

* For getting runnable code, [`df.to_dict`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html) is often useful, with the different `orient` options for different cases. In the example above, I could have grabbed the data and columns from `df.to_dict('split')`.
  • Write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    

    Out[3]: A B 0 1 5 1 4 6

Explain where the numbers come from:

The 5 is the sum of the B column for the rows where A is 1.

  • Do show the code you've tried:

    In [4]: df.groupby('A').sum()
    

    Out[4]: B A 1 5 4 6

But say what's incorrect:

The A column is in the index rather than a column.

The docstring for sum simply states "Compute sum of group values"

The [groupby documentation](http://pandas.pydata.org/pandas- docs/stable/groupby.html#cython-optimized-aggregation-functions) doesn't give any examples for this.

Aside: the answer here is to usedf.groupby('A', as_index=False).sum().

  • If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
    

Sometimes this is the issue itself: they were strings.

The Bad:

  • Don't include a MultiIndex, which we can't copy and paste (see above). This is kind of a grievance with Pandas' default display, but nonetheless annoying:

    In [11]: df
    

    Out[11]: C A B 1 2 3 2 6

The correct way is to include an ordinary DataFrame with a [set_index](http://pandas.pydata.org/pandas- docs/stable/generated/pandas.DataFrame.set_index.html) call:

    In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C'])

In [13]: df = df.set_index(['A', 'B'])

In [14]: df
Out[14]:
     C
A B
1 2  3
  2  6
  • Do provide insight to what it is when giving the outcome you want:

       B
    

    A 1 1 5 0

Be specific about how you got the numbers (what are they)... double check they're correct.

  • If your code throws an error, do include the entire stack trace. This can be edited out later if it's too noisy. Show the line number and the corresponding line of your code which it's raising against.

  • Pandas 2.0 introduced a number of changes, and Pandas 1.0 before that, so if you're getting unexpected output, include the version:

    pd.__version__
    

On that note, you might also want to include the version of Python, your OS, and any other libraries. You could use pd.show_versions() or the session_info package (which shows loaded libraries and Jupyter/IPython environment).

The Ugly:

  • Don't link to a CSV file we don't have access to (and ideally don't link to an external source at all).

    df = pd.read_csv('my_secret_file.csv') # ideally with lots of parsing options
    

Most data is proprietary , we get that. Make up similar data and see if you can reproduce the problem (something small).

  • Don't explain the situation vaguely in words, like you have a DataFrame which is "large", mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably no one is even going to read to the end of this paragraph.

Essays are bad; it's easier with small examples.

  • Don't include 10+ (100+??) lines of data munging before getting to your actual question.

Please, we see enough of this in our day jobs. We want to help, but not like this.... Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.