A blog about technical stuff

Jan Schulz

BI Engineer @ kfzteile24.de | Python enthusiast (knitpy, pandas, statsmodels, matplotlib, ggplot, pypandoc) | PhD student @ TU Freiberg (bibliometrics, academic research productivity, social networks, social capital) | Private homepage (de)


Archive | RSS

Python data pipelines similar to R's '%>%'

Since a few years, pipelines (via %>% of the magrittr package) are quite popular in R and the grown ecosystem of the “tidyverse” is built around pipelines. Having tried both the pandas syntax (e.g. chaining like df.groupby().mean() or plain function2(function1(input))) and the R’s pipeline syntax, I have to admit that I like the pipeline syntax a lot more.

In my opinion the strengths of R’s pipeline syntax are:

  • The same verbs can be used for different inputs (there are SQL backends for dplyr), thanks to R’s single-dispatch mechanism (called S3 objects).
  • Thanks to using function instead of class methods, it’s also more easily extendable (for a new method on pandas.DataFrame you have to add that to the pandas repository or you need to use monkey patching). Fortunatelly, both functions and singledispatch are also available in python :-)
  • It uses normal functions as pipline parts: input %>% function() is equivalent to function(input). Unfortunately, this isn’t easily matched in python, as pythons evaluation rules would first evaluate function() (e.g. call functions without any input). So one has to make function() return a helper object which can then be used as a pipeline part.
  • R’s delayed evaluation rules make it easy to evaluate arguments in the context of the pipeline, e.g. df %>% select(x) would be converted to the equivalent of pandas df[["x"]], e.g. the name of the variable will be used in the selection. In python it would either error (if x is not defined) or (if x was defined, e.g. x = "column"), would take the value of x, e.g. df[["column"]]. For this, some workarounds exist by using helper objects like select(X.x), e.g. pandas-ply and its Symbolic expression.

There exist a few implementation of dplyr like pipeline verbs for python (e.g. pandas itself, pandas-ply (uses method chaining instead of a pipe operator), dplython, and dfply), but they all focus on implementing dplyr style pipelines for pandas.DataFrames and I wanted to try out a simpler but more general approach to pipelines.

The code

The following shows my take on how to implement the first three things (I left out “Symbolic expressions”). The code is available in https://github.com/janschulz/pydatapipes. The short (removed the docstrings) version is actually only a few lines of code:

from functools import singledispatch, wraps

class PipeVerb():
    """Object which represents a part of a pipeline"""
    def __init__(self, func, *args, **kwargs):
        self.pipe_func = func
        self.args = args
        self.kwargs = kwargs

    def __rrshift__(self, input):
        return self.pipe_func(input, *self.args, **self.kwargs)

def pipeverb(func):
    """Decorator to convert a function to a pipeline verb (without singledispatch)"""
    def decorated(*args, **kwargs):
        return PipeVerb(func, *args, **kwargs)
    # If it is a singledispatch method, expose the register method here as well
    if hasattr(func, 'register'):
        decorated.register = func.register

    return decorated

def make_pipesource(cls):
    """Enables a class to function as a pipe source"""
    if hasattr(cls, '__rshift__') and (not getattr(cls.__rshift__, 'pipeoperator', False)):
        def __rshift__(self, other):
            """Pipeline operator if the right side is a PipeVerb"""
            if isinstance(other, PipeVerb):
                return other.__rrshift__(self)
                return self.__orig_rshift__(other)

        cls.__orig_rshift__ = cls.__rshift__
        cls.__rshift__ = __rshift__
        setattr(cls.__rshift__, "pipeoperator", True)

def singledispatch_pipeverb(func):
    """Convenience decorator to convert a function to a singledispatch pipeline verb"""
    return pipeverb(singledispatch(func))

Simple pipeline verbs

For end users wanting to build a new pipeline verb or add pipeline functionality to a new data source, there are two functions to build new pipeline parts:

#from pydatapipes.pipes import singledispatch_pipeverb, make_pipesource
import pandas as pd
# generic version which defines the API and should raise NotImplementedError
def append_col(input, x = 1):
    """Appends x to the data source"""
    raise NotImplementedError("append_col is not implemented for data of type %s" % type(input))

# concrete implementation for pandas.DataFrame
def append_col_df(input, x = 1):
    # always ensure that you return new data!
    copy = input.copy()
    copy["X"] = x
    return copy

# ensure that pd.DataFrame is usable as a pipe source

This can then be used in a pipeline:

import pandas as pd
print(pd.DataFrame({"a" : [1,2,3]}) >> append_col(x=3))
   a  X
0  1  3
1  2  3
2  3  3

The above example implements a pipeline verb for pandas.DataFrame, but due to the useage of singledispatch, this is generic. By implementing additional append_col_<data_source_type>() functions and registering it with the original append_col function, the append_col function can be used with other data sources, e.g. SQL databases, HDF5, or even builtin data types like list or dict:

def append_col_df(input, x = 1):
    return input + [x]

[1, 2] >> append_col()
[1, 2, 1]

If a verb has no actual implementation for a data source, it will simply raise an NotImplementedError:

    1 >> append_col()
except NotImplementedError as e:
append_col is not implemented for data of type <class 'int'>

A more complex example: grouped and ungrouped aggregation on DataFrames

singledispatch also makes it easy to work with grouped and ungrouped pd.DataFrames:

def groupby(input, columns):
    """Group the input by columns"""
    raise NotImplementedError("groupby is not implemented for data of type %s" % type(input))

def groupby_DataFrame(input, columns):
    """Group a DataFrame"""
    return input.groupby(columns)    
def summarize_mean(input):
    """Summarize the input via mean aggregation"""
    raise NotImplementedError("summarize_mean is not implemented for data of type %s" % type(input))

def summarize_mean_DataFrame(input):
    """Summarize a DataFrame via mean aggregation"""
    return input.mean()

def summarize_mean_GroupBy(input):
    """Summarize a grouped DataFrame via mean aggregation"""
    return input.mean()
df = pd.DataFrame({"a" : [1, 2, 3, 4], "b": [1, 1, 2, 2]})
print(df >> summarize_mean())
a    2.5
b    1.5
dtype: float64
print(df >> groupby("b") >> summarize_mean())
1  1.5
2  3.5


Compared to R’s implementation in the magrittr package, input >> verb(x) can’t be rewritten as verb(input, x).

The problem here is that verb(x) under the hood constructs a helper object (PipeVerb) which is used in the rshift operation. At the time of calling verb(...), we can’t always be sure whether we want an object which can be used in the pipeline or want to already compute the result. As an example consider a verb merge(*additional_data). You could call that as data >> merge(first, second) to indicate that you want all three (data, first, and second) merged. On the other hand, merge(first, second) is also valid (“merge first and second together).

Usage as function and pipeline verb

To help work around this problem, the convenience decorator singledispatch_pipeverb is actually not the best option if you want to create reusable pipeline verbs. Instead, the singledispatch_pipeverb decorator is also available in two parts, so that one can both expose the original function (with singledispatch enabled) and the final pipeline verb version:

#from pydatapipes.pipes import pipeverb, singledispatch

# first use singledispatch on the original function, but define it with a trailing underscore
def my_verb_(input, x=1, y=2):
    raise NotImplemented("my_verb is not implemented for data of type %s" % type(input))

# afterwards convert the original function to the pipeline verb:
my_verb = pipeverb(my_verb_)

# concrete implementations can be registered on both ``my_verb`` and ``my_verb_``
def my_verb_df(input, x=1, y=2):
    return input + [x, y]

A user can now use both versions:

[1] >> my_verb(x=2, y=3)
[1, 2, 3]
my_verb_([9], x=2, y=3)
[9, 2, 3]

Rules and conventions

To work as a pipline verb, functions must follow these rules:

  • Pipelines assume that the verbs itself are side-effect free, i.e. they do not change the inputs of the data pipeline. This means that actual implementations of a verb for a specific data source must ensure that the input is not changed in any way, e.g. if you want to pass on a changed value of a pd.DataFrame, make a copy first.
  • The initial function (not the actual implementations for a specific data source) should usually do nothing but simply raise NotImplementedError, as it is called for all other types of data sources.

The strength of the tidyverse is it’s coherent API design. To ensure a coherent API for pipeline verbs, it would be nice if verbs would follow these conventions:

  • Pipeline verbs should actually be named as verbs, e.g. use input >> summarize() instead of input >> Summary()
  • If you expose both the pipeline verb and a normal function (which can be called directly), the pipeline verb should get the “normal” verb name and the function version should get an underscore _ appended: x >> verb() -> verb_(x)
  • The actual implementation function of a verb() for a data source of class Type should be called verb_Type(...), e.g. select_DataFrame()

Missing parts

So what is missing? Quite a lot :-)

  • Symbolic expressions: e.g. select(X.x) instead of select("x")
  • Helper for dplyr style column selection (e.g. select(starts_with("y2016_")) and select(X[X.first_column:X.last_column]))
  • all the dplyr, tidyr, … verbs which make the tidyverse so great

Some of this is already implemented in the other dplyr like python libs (pandas-ply, dplython, and dfply), so I’m not sure how to go on. I really like my versions of pipelines but duplicating the works of them feels like a waste of time. So my next step is seeing if it’s possible to integrate this with one of these solutions, probably dfply as that looks the closest implementation.

[This post is also available as a jupyter notebook]

Python development on Windows: making it comfortable

Recently someone was surprised that I use windows as my main dev machine as other OS usually are developer friendly. Out of the box, this is true. But to make yourself at home as a developer, you usually change a lot of things, no matter if you are using OS X, Linux or Win. So here is what I use:

  • proper command line: cmder with git
  • Pycharm + Notepad++ as editor
  • python from miniconda with multiple envs
  • jupyter notebook with a conda env kernel manager

Not all is windows specific… I actually suspect that a lot is windows agnostic and I would use a similar setup on a different OS…

A proper command line: cmder

Windows cmd is pretty limited, both because there is almost no commands available and because of the terminal window itself lacks tab competition, history, proper C&P… I use cmder as a replacement. Use the upcoming 1.3 version, it changes the way the config / startup files are handled -> available as an artifact in the Appveyor builds (e.g. this one). It comes with better tab completion (including for git commands), history, search previous commands, c&p, git integration in the prompt, and can be customized via a startup profile. It also includes a copy of git for windows 2.x, so for most case, there is no need to install git by yourself. You can use cmd, bash (comes with the copy of git) and powershell.

I install it in a dropbox subfolder, which means that I have the same environment even at work. Run cmder.exe /REGISTER ALL once as admin to get the cmder here item in the right click menu in windows explorer.

In config\user-profile.cmd, I add a few more path items and also start an ssh agent:

:: needs the private ssh key in %USERPROFILE%\.ssh\
@call start-ssh-agent

:: add my own scripts
@set "PATH=%PATH%;%CMDER_ROOT%\vendor\jasc"

:: add unix commands from existing git -> last to not shadow windows commands...
@set "PATH=%PATH%;%GIT_INSTALL_ROOT%\usr\bin\"

Thanks to the last line, I’ve ls, grep, find, ssh, … available in the command line.

Aliases are in config\aliases. I add things like

w=where $1
cdp=cd c:\data\external\projects 
ls_envs=ls c:\portabel\miniconda\envs\
note="C:\Program Files (x86)\Notepad++\notepad++.exe" $*

I also customize the prompt (via a config/conda.lua file) so that activating a conda env will show up in the prompt (The need for the reset is IMO a bug):

 -- Find out the basename of a file/directory (last element after \ or /
 -- @return {basename}
function basename(inputstr)
        sep = "\\/"
        local last = nil
        local t={} ; i=1
        for str in string.gmatch(inputstr, "([^"..sep.."]+)") do
                --t[i] = str
                --i = i + 1
                last = str
        return last

 -- Find out if the String starts with Start
 -- @return {boolean}
function string.starts(String,Start)
   return string.sub(String,1,string.len(Start))==Start

 -- Find out current conda env
 -- @return {false|conda env name}
function get_conda_env()
    env_path = clink.get_env('CONDA_DEFAULT_ENV')
    if env_path then
        basen = basename(env_path)
        return basen
    return false

 -- after conda activate: reset prompt 
function reset_prompt_filter()
    -- reset to original, e.g. after conda activate destroyed it...
    if string.match(clink.prompt.value, "{lamb}") == nil or not string.starts(clink.prompt.value,"\x1b[") then
        -- orig: $E[1;32;40m$P$S{git}{hg}$S$_$E[1;30;40m{lamb}$S$E[0m
        -- color codes: "\x1b[1;37;40m"
        cwd = clink.get_cwd()
        prompt = "\x1b[1;32;40m{cwd} {git}{hg} \n\x1b[1;30;40m{lamb} \x1b[0m"
        new_value = string.gsub(prompt, "{cwd}", cwd)
        clink.prompt.value = new_value

 -- add conda env name 
function conda_prompt_filter()
    -- add in conda env name
    local conda_env = get_conda_env()
    if conda_env then
        clink.prompt.value = string.gsub(clink.prompt.value, "{lamb}", "["..conda_env.."] {lamb}")

clink.prompt.register_filter(reset_prompt_filter, 10)
clink.prompt.register_filter(conda_prompt_filter, 20)

local function tilde_match (text, f, l)
    if text == '~' then
        return true

clink.register_match_generator(tilde_match, 1)

git setup

I usually add two remotes: the upstream repo as origin (using the https URL for git clone) and my fork as mine (using the ssh URL for git remote add mine <ssh-url>). I do that even in cases where I am the upstream.

mine is setup as the default remote push location and git push defaults to the current branch. That way I can do things like git push without specifying a remote or without getting a confirmation message on first push of a branch.

Thanks to the ssh agent started by cmder on startup, I only have to give my password once per session.

I’ve setup notepad as the git commit editor but probably will switch to Sublime Text because of the better spell checking…

The following are the relevant lines of my %USERPROFILE%\.gitconfig:

	editor = \"C:\\Program Files (x86)\\Notepad++\\notepad++.exe\"  -multiInst -nosession -noPlugin
	excludesfile = ~/.gitignore-global # for things like the .idea dir from pycharm
	# don't show a setup message on first push of the branch
	default = current
	# per default push to "mine"
	pushdefault = mine
	unadd = reset HEAD --
    fixup = commit --amend --no-edit
	slog = log --pretty=oneline --abbrev-commit
    dc = diff --cached
    # specially for word files which are shown as text in the latest git for windows 2.x builds
    wd = diff --word-diff

I also install git-extras, mainly for git pr (checkout a github PR directly from origin), git ignore, git changelog

Python development: editors, conda

Editors: Pycharm, Notepad++, Sublime Text 3

I mainly use a combination of Pycharm (IDE for bigger projects/changes), Notepad++ (small patches, build related stuff) and recently Sublime Text 3 (replacement for notepad++, lets see…). Notepad++ is setup to replace notepad.exe, so anything which calls notepad will bring up Notepad++. Other than that, I use no special config for the IDE/editors…

conda python

I currently use a miniconda py27 setup (which I should update to a py3.x based one, but am too lazy…), but use envs for most of the work (e.g. the main env has mostly only conda + conda build related stuff in it). The default env is added to the default path (either by the installer or by using setx path C:\portabel\miniconda;C:\portabel\miniconda\Scripts;%PATH% in a cmd, not cmder window). I create additional envs with conda create -n <env-name> python=x.x pandas matplotlib ... as needed. Pycharm can use envs as additional interpreters, so no problem there… On the command line, thanks to the above cmder setup, an ls_envs will show all environments and activate <env-name> works without problems and the conda env name is shown in the command line.

I installed the visual studio compilers for 2.7, 3.4 and 3.5 by religiously following the following blog post on “Compiling Python extensions on Windows” by @ionelmc. It works!

If conda has no package for the package you want, activate the env, conda install pip and then use pip to install the package into that env. conda list shows both conda packages and pip packages.

Jupyter notebook

I have one “jupyter-notebook” env which holds the install for the notebook (e.g. conda create -n jupyter-notebook python=3.5 notebook). I start notebook servers via shortcuts, which point to the jupyter-notebook.exe entry in the jupyter-notebook env (e.g. C:\portabel\miniconda\envs\jupyter-notebook\Scripts\jupyter-notebook.exe) and which are setup to start in the main project directory (e.g. c:\data\external\projects\projectA\). That way I can startup multiple notebook servers in different project dirs by using multiple shortcuts.

Add all conda envs as kernels

I use Cadair/jupyter_environment_kernels (with an additional PR) as a kernel manager, so all my conda environments show up as additional kernel entries. For each project, I setup a new conda environment which is then used in the project notebooks as kernel.

Add-ons for jupyter notebook

I install the jupyter notebook extensions (installed in the jupyter-notebook conda environment), mainly for the Table of Content support.

I also add some ipython magic commands to each python environment which is used as notebook kernel:

proper diffs and commits for notebooks

I usually don’t want to commit the outputs of a notebook to git, so I strip them with a git clean filter.

I also want git diff to show something which I can actually read instead of the raw json file content, so I also setup a special converter which is used by git diff before comparing the files.

There are a lot of scripts around for that, but most use python (e.g. strip output (gist, kynan/nbstripout) and nbflatten but this is slow for big notebooks. :-( Fortunately, the nbflatten gist also introduced me to jq, something like grep and sed/awk for json data. After sorting out a windows bug, this jq based nbflatten script now works on windows, too. Below is a slightly adjusted variant of that script.

This needs a recent jq.exe (>=1.6, not yet released, go to https://ci.appveyor.com/project/stedolan/jq and click on one of the passing builds -> 64bit -> ARTIFACTS) due to a bug in 1.5. Put jq.exe in your path (e.g. <cmder>\bin) and add the following file somewhere:

# based on https://gist.github.com/jfeist/cd00aa3b681092e1d5dc
def banner: "\(.) " + (28-(.|length))*"-";
# metadata
("Non-cell info" | banner), del(.cells), "",
# content
(.cells[] | (
     ("\(.cell_type) cell" | banner), 
     (.source[] | rtrimstr("\n")), # output source
     if ($show_output == "1") then # the cell output only when it is requested..
       (select(.cell_type=="code" and (.outputs|length)>0) | (
         ("output" | banner),
         (.outputs[] | (
            (select(.text) | "\(.text|add)" | rtrimstr("\n")),
            (select(.traceback) | (.traceback|join("\n"))),
            (select(.text or .traceback|not) | "(Non-plaintext output)")

I put mine as nbflatten.jq into a cmder subdir.

I now have the following in my global .gitconfig:

[filter "stripoutput"]
	# removes output and execution counts form the notebook before committing it
    clean = "jq --indent 1 '(.cells[] | select(has(\"outputs\")) | .outputs) = [] | (.cells[] | select(has(\"execution_count\")) | .execution_count) = null'"
[diff "ipynb"]
	# uses a "flattend" representation of the notebook for diffing
	# note the quotes and the escapes for the quotes around the filename and the backslashes...
    textconv = "jq -r -f \"C:\\Users\\jschulz\\Dropbox\\Programme\\cmder\\vendor\\jasc\\nbflatten.jq\" --arg show_output 0"
    cachetextconv = false

If I have notebooks in a repo which I want cleaned up before committing and/or diffing, I add a .gitattribute file with the following content:

*.ipynb filter=stripoutput
*.ipynb diff=ipynb

Please note that both together mean that the ipynb git diff engine never sees the output in a notebook (as the filter is run before the diff), so most of the above nbflatten.jq file is useless in that case (and even without the filter it would still not show up until you change “show_output 0” to “show_output 1”) . But you can use it via an alias (in <cmder>\config\aliases) ala

nbflat=jq -r -f "C:\Users\jschulz\Dropbox\Programme\cmder\vendor\jasc\nbflatten.jq" --arg show_output 1 $*

and then use it like nbflat whatever.ipynb | less to get a text representation.


I installed nbconvert into the main conda env: deactivate & conda install nbconvert

For pdf output, I installed miktex and pandoc:

  • miktex: latex environment. Installed via the portable installer and added to the path (via setx path c:\path\to\MIKTEX\miktex\bin;%path% in a cmd window, not cmder -> that way you have latex available in all programs and not only in a cmder window).
  • pandoc: converter between text formats (e.g. markdown to word or pdf). Also added to the path like miktex.

It has to go to the main path (not setup via cmder), as the way I startup a notebook server does not get the path additions from cmder…

Other stuff

  • everything: search for filenames (not content). Installed as a service and then put es.exe in a dir in %PATH% (e.g. <cmder>\bin). es whatever.py will now show all files with that name.
  • launchy: search and startup commands fast. Faster than Start-><search box>-><Enter>… I used that much more when I had WinXP installed. Nowadays, I have most programs added as a shortcut to the quickstart area.
  • Chrome with ublock (ad blocking) and The Great Suspender (suspend tabs which you haven’t touched in days so that they don’t waste resources).
  • sysinternals: procexplorer (graphical process explorer, replacement for the task manager). Setup to start as admin during windows startup. I also use autostarts from time to time to clean up the autostart entries.
  • Keepass 2: holds all my passwords, integrated with chrome via chromeIPass. The keepass file is synced via dropbox to my mobile (+ a keyfile which is manually transfered…).

final remarks

So, you can make yourself at home on windows as a (python) developer… Unfortunately, it seems that there are not a lot of people who do dev work on windows (based on the many projects which fail on windows when I check them out). If you want to make your project windows friendly: add Appveyor to your CI tests… :-)

Anyway: anything I missed to make my life on windows any easier?

How to refresh conda patches

Conda recipes can contain patches which are applied on top of the source for the package. When updating the package to a new upstream version, these patches need to be checked if the still apply (or are still needed).

This is the way I do it currently (be aware that I work on windows, so you might need to change some slashes…)…


# makes the "patch" command available...
set "PATH=%path%;C:\Program Files\Git\usr\bin\"
# Update the latest source for matplotlib...
cd matplotlib 
git fetch 
git checkout origin/master
# conda package recipe for matplotlib is in ci\conda_recipe

Apply a patch

patch -i ci\conda_recipe\osx-tk.patch

The next step depends whether the patch applied cleanly or not. There are three possible outcomes:

  • The patch applied cleanly (e.g. no error message): nothing further to do, on to the next patch…
  • The patch is fuzzy (Hunk #1 succeeded at 1659 with fuzz 1 (offset 325 lines).) -> the patch only needs to be refreshed
  • The patch (or one of the hunks) didn’t apply (1 out of 1 hunk FAILED -- saving rejects to file matplotlibrc.template.rej) -> the patch needs to be redone and afterwards the patch needs to be refreshed

For redoing the patch, look into the <patch>.rej file and apply similar changes to the source. Or check whether this patch is needed anymore…

For refreshing the patch, make sure that only the changes for the patch are currently included in you checked out copy (e.g. make sure that refreshed patches are git added before the next command…).

Then run the following command:

git diff --no-prefix > ci\conda_recipe\osx-tk.patch

[I actually used a different filename to pipe the patch to and then compared the output before overwriting the old patch…]

Demo mode for IPython (works in the notebook)

R has a demo mode, which lets you execute some demo of a function or a package. See e.g. demo(lm.glm) for such a thing.

An PR in IPython-extensions lets you do much the same:

It will get some demo code (which can be a function in a package or the matplotlib examples on github) and lets you execute that code by yourself. Specially formatted comments in the function will get turned into formatted text, if the frontend suppports it. It works in the notebook by adding new cells with the demo content or in the qtconsole/ipython directly by presetting it as new input (simple press enter) until the demo is over.

Writing a demo

Writing a demo is simple writing a function in a module. Markdown formatting in comments is possible and works in the notebook. In the qtconsole/IPython, they are simple comments. This is the demo example:


def demo_example():
    """An example how to write a demo."""
    # ## Comments
    # Comments are interpreted as markdown syntax, removing the 
    # initial `# `. If a comment starts only with `#`, it is interpreted 
    # as a code comment, which will end up together with the code.
    #change your name:
    name = "Jan"
    print("Hello {0}!".format(name))
    # ## Magics
    # Using magics would result in not compiling code, so magics 
    # have to be commented out. The demo will remove the comment
    # and insert it into the cell as code.
    _sum = 0
    for x in range(10000):
        _sum += x
    # Print the sum:

# This lets the `demo(ipyext.demo)` find only the `demo_example`. 
# Only modues with that variable will display an overview of 
# the available demos.
__demos__ = [demo_example]

Demo of demo mode :-)

Here are some videos of it in action:

IPython qtconsole

IPython demo mode in qtconsole

Jupyter Notebook (with IPython kernel)

IPython demo mode in jupyter notebook

If you have any comments, or know of examples for a package (needs to be plain python files available on github -> like for matplotlib), please leave it below or in the PR. Thanks!

Automatic building of python wheels and conda packages

Recently I found the conda-forge project on github which makes it easy to auto build and upload your python project as a (native) conda package. Conda-forge introduces the concept of a “smithy” (a repository on github) which builds the conda packages for the main repository. A smithy connects to three different CI services to get builds for all three major platforms: Travis for Mac OS X, CircleCI for Linux and AppVeyor for Windows.

But not everyone is using conda environments so I thought it would be nice to to also build wheels with it. Turns out this is actually possible, you “only” need to hack the conda build scripts to also build a wheel and upload that to PyPI.

For those who just want the code: you can find it in the JanSchulz/package-builder repo on github. The smithy currently builds the conda packages and wheels for pypandoc. PRs welcome :-)

These were the broad steps to setup the smithy:

  • Setup a smithy repository: I copied the conda-forge/staged-recipes, which is setup to build multiple recipes, but required a little more “handholding” until I got it to run (the “one main repo, one smithy repo” case has helper scripts in conda-forge/conda-smithy, which will make the setup of the smithy a lot easier. So try that first if that fits your case…):
    • Add your own conda recipe: it’s usually three easy files in a subdir: one metadata file (meta.yaml) and one script for each windows (bld.bat) and unix-like platforms (build.sh). Take a look at some examples
    • Connect the various CI services to your github repo.
    • Get your anaconda access TOKEN via anaconda auth --create --name WhateverName --scopes "repos conda api" (I used a different token for each CI service). The examples in the conda-forge/staged-recipes files didn’t quite work, as I needed to add api access…
    • Add your anaconda access TOKEN to each CI service so that it is available in your build environment.
  • Hack your conda recipe to also build a wheel and upload that to PyPI. This is a bit more involved, as conda builds happen in a temporary environment and have their environment variables cleaned up. So:
    • Install twine in the environment, by adding pip install twine to the various CI setup scripts (unfortunately it’s not packaged for conda, so you can’t simple install it via meta.yaml).
    • Add your PyPI username and password as a environment variable to each CI service.
    • Let conda build know that you want to have these two variables available during the conda build by adding them to the build -> script_env section of your meta.yaml.
    • Add a line to your build scripts to build a wheel (python setup.py bdist_wheel).
    • Generate a pypirc file so that the PyPI upload can happen. This is a bit tricky, as the build process has no access to the recipe directory and therefore you have to generate this file on the fly during build. On unix-like it’s a cat << EOF > pypirc\n...\nEOF, but on windows you have to use either a lot of echo ... >>pypirc or a trick with parenthesis: ( echo ...; echo ... ...) > "pypirc". It seems that twine doesn’t work without such a file :-(.
    • Use twine to upload the package: this currently means that you have to add a username and password (using the added environment variables) to the commandline, so make sure that this line isn’t echo’ed to the CI log: use @twine ... in bld.bat and set +x; twine ...; set -x in build.sh.
    • I also added a test to build.sh to only build wheels on darwin, as it seems that PyPI does not accept linux wheels…
  • Fix all the errors you introduced and repush the repo… this step took a bit… :-/

Now making a release works like this:

  • Release the package (in my case pypandoc) as usual.
  • Build and upload the sdist to PyPI.
  • Update the conda recipe for the new version.
  • Push the smithy repo with the recipe and let the CI services build the conda packages and the wheels.

The CI scripts will only build packages for which there are no current conda packages yet. If you need to redo the build because of some problems, you need to delete the conda packages for the affected builds or bump the package versions (you can set build versions for the conda packages without bumping the version of the upstream project).

If you have any feedback, please leave it in the comments (or as an issue in one of the above repos… :-) ).

The next step will be adding builds for R packages

More functions for working with JSON data / nested structures

I updated the functions in my last blog post (rename the functions and added a few corner cases) and added a new convert_to_dataframe_input function:

# can be a dict or a list of structures
data = {"ID1":{"result":{"name":"Jan Schulz"}},
        "ID2":{"result": {"name":"Another name", "bday":"1.1.2000"}}}

converter_dict = dict(
    names = "result.name",
    bday = "result.bday"
import pandas as pd
print(pd.DataFrame(convert_to_dataframe_input(data, converter_dict)))
##   _index      bday         names
## 0    ID1       NaN    Jan Schulz
## 1    ID2  1.1.2000  Another name

The (updated) code can be found in the old blogpost.

Two functions for working with JSON/dicts

I recently had to explore a JSON API and came up with the following twothree functions to make working with the returned JSON/dict easier:

[Update 2015-11-10: you might like dripper, which does much of this code snippet…]

[Update 2015-09-26: updates to code and new convert_to_dataframe_input function: see here for a post about it]

_null = object()
def get_from_structure(data, name, default=_null):
    """Return the element with the given name.
    `data` is a structure containing lists, dicts or scalar values. 
    A name is a '.' separated string which specifies the path in the data.
    E.g. '0.name.first' would return `data[0]["name"]["first"]`.
    If such a path does not exist and no default is given, a
    KeyError is raised. Otherwise, the default is returned. 
    names = name.split(".")
    for n in names:
            i = int(n)
            data = data[i]
            data = data.get(n, _null)
        if data is _null:
            if default is _null:
                raise KeyError("Key not found: %s (%s)" % (n, name))
                return default
    return data

def find_in_structure(data, value):
    """Find the value in the data and returns a name for that element.
    `value` is either found by "==" (elements are equal) or "in" (part of
    a string or other element in an iterable other than list).
    The name is a '.' separated path (string) suitable for `get_from_dict`.
    Raises a ValueError if the value is not found in data.
    _stack = []
    def _find(data, stack):
        if data is None:
            return False
        if isinstance(data, list):
            for i, val in enumerate(data):
                if _find(val, stack):
                    return True
        elif isinstance(data, dict):
            for key, val in data.items():
                if _find(val, stack):
                    return True
        elif data == value or value in data:
            return True
        return False 
    if _find(data, _stack):
        return ".".join(_stack)
        raise ValueError("Not found in data: %s" % (text,))
def convert_to_dataframe_input(data, converter_dict):
    """Convert the input data to a form suiteable for pandas.Dataframe
    Each element in data will be converted to a dict of key: values by using 
    the functions in converter_dict. If feed to a pandas.DataFrame, keys 
    in converter_dict will become the column names.
    If an element in converter_dict is not callable, it will be used 
    as an name for `get_from_dict`. If the function raises an Exception,
    NA will be filled in.
    If data is a dict, the key will be used for a `_index` column, 
    otherwise a running index is used.
    This function does not do any type conversations.    
    from functools import partial
    NA = float('nan')
    converted = []
    assert '_index' not in converter_dict, "'_index' is not allowed as a key in converter_dict"
    temp = {}
    for k, val in converter_dict.items():
        if not callable(val):
            temp[k] = partial(get_from_structure, name=val)
            temp[k] = val
    converter_dict = temp
    if isinstance(data, dict):
        gen = data.items()
        gen = enumerate(data)
    for index, item in gen:
        d = {"_index": index}
        for name, func in converter_dict.items():
                d[name] = func(item)
                d[name] = NA

    return converted            


data = {"ID1":{"result":{"name":"Jan Schulz"}},
        "ID2":{"result": {"name":"Another name", "bday":"1.1.2000"}}}
print(find_in_structure(data, "Schulz"))
## ID1.result.name 
print(get_from_structure(data, find_in_structure(data, "Schulz")))
## Jan Schulz

And the DataFrame conversion

converter_dict = dict(
    names = "result.name",
    bday = "result.bday"
import pandas as pd
print(pd.DataFrame(convert_to_dataframe_input(data, converter_dict)))
##   _index      bday         names
## 0    ID1       NaN    Jan Schulz
## 1    ID2  1.1.2000  Another name

Someone might find this useful (and at least I can find it again :-) )