/var/

Various programming stuff

My essential django package list

In this article I’d like to present a list of django packages (add-ons) that I use in most of my projects. I am using django for more than 5 years as my day to day work tool to develop applications for the public sector organization I work for. So please keep in mind that these packages are targeted to the “enterpripse” audience and some things that target public (open access) web apps may be missing.

Some of these packages are included in a django cookiecutter I am using to start new projects and also you can find example usage of some of these packages in my Mailer Server project. I’ll also be happy to answer any questions about these, preferably in stackoverflow if the questions are fit for that site.

Finally, before going to the package list, notice that most of the following are django-related and I won’t include any generic python packages (with the exception of xhtml2pdf which I discuss later). There are a bunch of python-generic packages that I usually use (xlrd, xlwt, requests, raven, unidecode, database connectors etc) but these won’t be discussed here.

Packages I use in all my projects

These packages are used more or less in all my projects - I just install them when I create a new project (or create the project through the cookiecutter I mention above)/

django-tables2

django-tables2, along with django-filter (following) are the two packages I always use in my projets. You create a tables.py module inside your applications where you define a bunch of tables using more or less the same methodology as with django-forms: For a Book model, Create a BookTable class, relate it with the model and, if needed override some of the columns it automatically generates. For example you can configure a column that will behave as a link to a detail view of the book, a column that will behave as a date (with the proper formatting) or even a column that will display a custom django template snippet.

You’ll then be able to configure the data (queryset) of this table and pass it your views context. Finally use {% render_table table %} to actually display your table. What you’ll get? The actual html table (with a nice style that can be configured if needed), free pagination and free column-header sorting. After you get the hang of it, you’ll be able to add tables for your models in a couple of minutes - this is DRY for me. It also offers some class based views and mixins for even more DRY adding tables to your views.

Finally, please notice that since version 1.8, django-tables2 supports exporting data using tablib, an old requirement of most users of this library.

django-filter

django-filter is used to create filters for your data. It plays well with django-tables2 but you can use it to create filters for any kind of listing you want. It works similar to django-tables2: Create a filters.py module in your application and define the a BookFilter class there by relating it with the Book model, specifying the fields of the model you want to filter with and maybe override some of the default options. New versions have some great built-in configuration functionality by customizing which method will be used for filtering or even adding multiple filter methods - for example you could define your BookFilter like this:

class BookFilter(django_filters.FilterSet):
    class Meta:
        model = Book
        fields = {
            'name': ['icontains'],
            'author__last_name': ['icontains', 'startswith'],
            'publication_date': ['year', 'month', ],
        }

which will give you the following filter form fields:

  • an ilike ‘%value%’ for the book name
  • an ilike ‘%value%’ for the author name
  • a like ‘value%’ for the author name
  • a year(publication_date) = value for the publication_date
  • a month(publication_date) = value for the publication_date

and their (AND) combinations!

The BookFilter can be used to create a filter form in your template and then in your views pass to it the initial queryset along with request.GET (which will contain the filter values) to return the filtered data (and usually pass it to the table). I’ve created a sample project that uses both django-tables2 and django-filters for you to use: https://github.com/spapas/django_table_filtering. Also, I’ve written an article which describes a technique for automatically creating a filter-table view.

django-crispy-forms

The forms that are created by default by django-forms are very basic and not styled properly. To overcome this and have better styles for my forms, I always use django-crispy-forms. It actually has two modes: Using the crispy template filter and using the crispy template tag. Using the crispy template filter is very simple - just take a plain old django form and render it in your template like this {{ form|crispy }}. If the django-crispy-forms has been configured correctly (with the correct template pack) the form you’ll get will be much nicer than the django-default one. This is completely automatic, you don’t need to do anything else!

Now, if you have some special requirements from a form, for example multi-column rendering, adding tabs, accordions etc then you’ll need to use the {% crispy %} template tag. To use this you must create the layout of your form in the form’s costructor using the FormHelper django-crispy-forms API. This may seem cumbersome at first (why not just create the form’s layout in the django template) but using a class to define your form’s layout has other advantages, for example all the form layout is in the actual form (not in the template) you can control programatically the layout of the form (f.e display some fields only for administrators), you can use inheritance and virtual methods to override how a form is rendered etc.

To help you understand how a FormHelper looks like, here’s a form that is used to edit an access Card for visitors that displays all fields horizontally inside a panel (I am using Bootstrap for all styling and layout purposes):

class CardForm(ModelForm):
    class Meta:
        model = Card
        fields = ['card_number', 'card_text', 'enabled']

    def __init__(self, *args, **kwargs):
        self.helper = FormHelper()
        self.helper.form_method = 'post'
        self.helper.layout = Layout(
            Fieldset(
                '',
                HTML(u"""<div class="panel panel-primary">
                   <div class="panel-heading">
                       <h3 class="panel-title">Add card</h3>
                   </div>
                   <div class="panel-body">"""),
                Div(
                   Div('card_number', css_class='col-md-4'),
                   Div('card_text', css_class='col-md-4'),
                   Div('enable',css_class='col-md-4'),
                   css_class='row'
                ),
            ),

            FormActions(
                Submit('submit', u'Save', css_class='btn btn-success'),
                HTML(u"<a class='btn btn-primary' href='{% url \"card_list\" %}'>Return</a>" ),
            ),
            HTML(u"</div></div>"),
        )

        super(CardForm, self).__init__(*args, **kwargs)

Notice that forms that will be rendered with a FormHelper actually contain their <form> tag (you don’t need to write it yourself like with plain django forms) so you have to define their method (post in this example) and submit button.

django-extensions

django-extensions is a swiss-army-knife of django tools. I use it always in my projects because of the runserver_plus and shell_plus commands. The first uses the Werkzeug debugger with django which makes django development an absolute joy (open a python shell wherever you want in your code and start writing commands)! The second opens a better shell (your models and a bunch of django stuff are auto-imported, a better shall will be used if found etc).

The runserver_plus and shell_plus alone will be more than enough for me to use this however it adds some more usefull management commands like: admin_generator to quickly create an admin.py for your app, graph_models to generate a graphviz dot file of your models, update_permissions to synchronize the list of permissions if you have added one to an existing model and many, many others. Take a look at them and you’ll probably find more useful things!

django-autocomplete-light

django-autocomplete-light is the best auto-complete library for django, especially after v3 was released (which greatly reduces the magic and uses normal CBVs for configuring the querysets). You will create an AutocompleteView for a model (similar to your other class based views) and then automatically use this view through a widget in the admin or in your own forms. It is fully configurable (both the results and the selection templates), supports many to many fields, creating new instances and even autocompleteing django-taggit tags! If for some reason it seems that it is not working please keep in mind that you need to includ jquery and {{ form.media }} to your templates or else the required client side code won’t be executed.

I think it is an essential for all cases because dropdowns with lots of choices have a very bad user experience - and the same is true with many to many fields (you could use the checkboxes widget to improve their behavior a little but you will have bad behavior when there are many choices).

django-reversion

django-reversion is a really important package for my projects. When it is configured properly (by adding a a reversion-middleware), it offers full auditing for all changes in the your model instances you select (and properly groups them in case of changes to multiple instances in a simple request). It saves a JSON representation of all the versions of an instance in your database. Keep in mind that this may increase your database size but if you need full auditing then is is probably the best way to do it. I have written an article about django model auditing that discusses this package and django-simple-history (following) more.

django-compressor

django-compressor is package that combines and minifies your css and javascript (both files and line snippets) into static files. There are other tools for this but I have never used them since django-compressor satisfies my needs. Although I’ve written about browserify and friends from the node-js world I don’t recommend using such tools in django to combine and minify your javascript and css unless you specifically require them.

It has an online and an offline mode. For the online mode, when a request is done it will check if the compressed file exist and if not it will create it. This may lead to problems with permissions if your application server user cannot write to your static folders and also your users will see exceptions if for some reason you have included a file that cannot be found. For the offline mode, you need to run a management command that will create the static files while deploying the applications - this mode is recommended because any missing files problems etc will be resolved while deploying the app.

django-debug-toolbar

django-debug-toolbar: This is a well known package for debugging django apps that is always included in the development configuration of my projects. It has various panels that help you debug your application but, at least for me, the most helpful is the one that displays you all SQL queries that are executed for a page load. Because of how the django orm is working it will go on and follow all relations something that will lead to hundreds of queries. For example, let’s say that you have simple Book model with a foreign key to an Author model that has N instances in your database. If you do a Book.objects.all() and want display the author name for each book in a template then you’ll always do N+1 queries to the database! This is really easy to miss because in the django you’ll just do {% for book in books %}{{ book.name}}, {{ book.author.name }}{% endif %} — however the {{ book.author.name }} will go on and do an extra SQL query!!! Such cases are easily resolved by using select_related (and prefetch_related) but you must be sure to use select_related for all your queries (and if you add some extra things to your template you must remember to also add them to your select_related clause for the query).

So, what I recommend before going to production is to visit all your pages using django-debug-toolbar and take a quick look at the number of SQL queries. If you see something that does not make sense (for example you see more than 10 queries) then you’ll need to think about the problem I just mentioned. Please notice that this, at least for me, is not premature optimization - this is not actually optimization! This is about writing correct code. Let’s suppose that you could not use the django orm anymore and you had to use plain old SQL queries. Would you write SELECT * FROM books and then for each row do another SELECT * FROM authors WHERE id=? passing the author of each book or do only select * from books b LEFT JOIN authors a on b.author_id = a.id?

Packages I use when I need their functionality

The packages following are also essential to me but only when I need their functionality. I don’t use them in all my projects but, when I need the capabilities they offer then I will use these packages (and not some others). For example, if I need to render PDFs in my applications then I will use the xhtml2pdf, if I need to login through LDAP I will use django-auth-ldap etc.

xhtml2pdf

xhtml2pdf is the package I use for creating PDF’s with django as I’ve alreadly discussed in the PDFs in Django article (this is not a django-specific package like most others I discuss here but it plays really good with django). You create a normal django template, add some styling to it and dump it to html. Notice that there’s a django-xhtml2pdf project but has not been recently updated and after all as you can see in my article it is easy to just call xhtml2pdf directly. The xhtml2pdf library is actually a wrapper around the excellent reportlab library which does the low-level pdf output.

Notice that the xhtml2pdf library had some maintenance problems (that’s why some people are suggesting other PDF solutions like WeasyPrint) however they seem to have been fixed now. Also, I have found out that, at least for my needs (using Windows as my development environment), other soltuons are much inferior to xhtml2pdf. I urge you to try xhtml2pdf first and only if you find that it does not cover your needs (and have asked me about your problem) try the other solutions.

django-auth-ldap

django-auth-ldap is the package you’ll want to use if your organization uses LDAP (or Active Directory) and you want to use it for logging in. Just configure your LDAP server settings, add the ldap authenticator and you’ll be ready to go. Please notice that this package is a django wrapper of the python-ldap package which actually provides the LDAP connection.

django-localflavor

django-localflavor offers useful stuff for various countries, mainly form fields with the correct validation and lists of choices. For example, for my country (Greece) you’ll get a GRPhoneNumberField, a GRPostalCodeField and a GRTaxNumberCodeField. Use it instead of re-implementing the behavior.

django-constance

django-constance is a simple package that enables you to add quick-configurable settings in your application. To change the settings.py file you need to edit the source and restart the application - for most installations this is a full re-deployment of the application. Fully re-deploying the app just to change a setting is not very good practice (depending on the setting of course but if it is a business setting it usually should be done by business users and not by administrators).

That’s where django-constance comes to help you. You can define some extra settings which can be changed through the django admin and their new value will be available immediately. Also you can configure where these settings will be saved. One option is the database but this is not recommended - instead you can use redis so that the settings values will be available much quicker!

django-rq

django-rq is a django wrapper for the rq library. I use it when I need asynchronous tasks (which is on almost all of my projects). More info can be found on the two articles I have writtten about django rq (asynchronous tasks in django and django-rq redux).

django-rules-light

One of the least known packages from those I discuss here, django-rules-light is one of the most useful when is needed. This package allows you to define complex rules for doing actions on model instances. Each rule is a function that gets the user that wants to do the action and the object that the user wants to action on. The function returns True or False to allow or not allow the action. You can then use these in both your code to programatically check if the user can do the the action and your templates to decide what buttons and options you will display. There are also various helper methods for CBVs that make everything easier.

To properly understand the value of django-rules-light you need to have some more complex than usual action rules. For example if your actions for an object are view / edit and all your users can view and edit their own objects then you don’t really need this package. However, if your administrators can view all objects and your object can be finalized so no changes are allowed unless an administrator tries to change it then you’ll greatly benefit from using it!

django-generic-scaffold

django-generic-scaffold is a package I have created that can be used to quickly (and DRYly) create CRUD CBVs for your models. I usually don’t want to give access to the django-admin to non-technical users however sometimes I want to quickly create the required CBVs for them (list, detail, create, edit delete). Using django-generic-scaffold you can just create a scaffold which is related with a Model and all the views will be automagically created - you only need to link them to your urls.py. The created CBVs are fully configurable by adding extra mixins or even changing the parent class of each CBV.

Notice that this package does not create any source files - instead all CBVs are created on-the-fly using type. For example, to create CRUD CBVs for a Book model you’ll do this in scaffolding.py:

class BookCrudManager(CrudManager):
    model = models.Book
    prefix = 'books'

and then in your urls.py you’ll just append the generated urls to your url list:

book_crud = BookCrudManager()
urlpatterns += book_crud.get_url_patterns()

Now you can visit the corresponding views (for example /books or /bookscreate - depends on the prefix) to add/view/edit etc your books!

django-taggit

django-taggit is the library you’ll want to use if you have to use tags with your models. A tag is a synonym for keyword, i.e adding some words/phrases to your instances that are used to categorise and desribe em. The relation between your to-be-tagged-model and your tags is many to many. To use it, you just add tags = TaggableManager() to your model and you are ready to go! Of course it will need some more configuration to be included in django admin and django forms but thankfully, autocomplete-lights can be integrated with django-taggit!

django-sendfile

django-sendfile is a very important - at least to - me library. Sometimes, user uploaded files (media in django) should not be visible to all users so you’ll need to implement some access control through your django app. However, it is important to not serve these media files through your application server (uwsi, gunicorn etc) but use a web server (nginx, apache ect) for serving them. This is needed because your application server’s purpose is not serving files from the disk - keep in mind that the application server usually has a specified amount of workers (usually analogous to the number of CPUs of your server, for example 4 workers ) - think what will happen if some large media files are server through these workers to users with a slow connection! With 4 such concurrent connections your application won’t be able to serve any other content!

So this package (along with the support of X-Sendfile from the web servers) helps you fulfill the above requirements: It allowes you to check permissions to your media through your django application but then offload the serving of your media files to the web server. More info about django-sendfile can be found on this SO answer but with a few words, with django-sendfile you create a view that checks if a file is allowed to be served and, if yes, instruct the web server to actually serve that file by appending a specific header to the response.

django-reversion-compare

django-reversion-compare is an addon to django-reversion. It allows you to compare two different versions of a model instance and highlights their differences using various smart algorithms (so if you have a long text field you won’t only see that these are different but you’ll also see where exactly they differ, with output similar to the one you get when using diff).

django-simple-history

django-simple-history has similar capabilities with django-reversion - (auditing and keeping versions of models) with a very important difference: While django-reversion keeps a JSON representation of each version in the database (making querying very difficult), django-simple-history creates an extra, history table for each model instance you want to track and adds each change as a new row to that table. As can be understood this will make the history table really huge but has the advantage that you can easily query for old values. I usually use django-reversion unless I know that I will need the history querying.

django-import-export

django-import-export can be used to enchance your models with mass import and export capabilities (from example from/to CSV). You will add an ModelResource class that describes (and configures) how your Model should be imported/exported. The ModelResource class can then be easily used in your views and, more importantly, it is integrated to the django-admin. I have to confess that I have not used django-import-export for importing data because I prefer implementing my own views for that (to have better control over the whole process and because the data I usually need to import does not usually map 1 to 1 with a model but I need to create more model instances etc). However I am using the export capabilities of django-import-export in various projects with great success, especially the admin integration which easily fulfills the exporting data capabilities of most users.

django-extra-views

Beyond django-forms, django supports a feature called Formsets which allows you to add/edit multiple object instances in a single view (by displaying all instances one after the other). The classic request/response cycle of django is preserved in Formsets, so your form instances will be submitted all together when you submit the form. The logic extension to the Formset is the ModelFormset i.e each form in a Formset is a ModelForm and InlineFormSet where you have a Parent model that has a bunch of children and you are editing the Parent and its children in a single Form. For example, you have a School and a Student model where each Student has a ForeignKey to School. The usual case would be to edit the Student model and select her school (through a drop down or even-better if you use django-autocomplete-light through a proper autocompelte widget). However, you may for some reason want to edit the School and display (and edit) the list of its Students — that’s where you’ll use an InlineFormSet!

The above features (Formsets, Modelformsets and Inlineformsets) are not supported natively by django CBVs — that’s where django-extra-views comes to the foreground. You can use the corresponding CBVs of django-extra-views to support the multiple-form workflows described above. Using these CBVs are more or less similar to using the good-old django FormView.

easy-thumbnails

easy-thumbnails is a great library if you want to support thumbnails. Actually, thumbnails is not 100% correct - this package can be used to generate and manage various versions of your original images, for example you may have a small version of the image that will be used as a thumbnail, a larger version that will be used in a gallery-carousel view and an even larger version (but not the original one which could be huge) that will be used when the user clicks on the gallery to view a larger version of the image. To do that you define the image configurations you support in your settings.py and then you have access to your thumbnails both in your templates and in your views. Notice that a specific thumbnail congfiguration for an image will be created only once since the generated images are saved so each thumbnail will be generated on the first request it contains it and will be reused in the following such requests.

django-rest-framework

django-rest-framework is definitely the best package for creating web APIs with django. I don’t recommend using it if you want to create a quick JSON search API (take a look at django non-HTML responses) but if you want to go the SPA way or if you want to create multiple APIs for various models then this is the way to go. Integrating it to your project will need some effort (that’s why I don’t recommend it for quick and dirty APIs) because you’ll need to create a serializers.py which will define the serializers (more or less the fields) for the models you’ll want to expose through your API and then create the views (or the viewsets which are families of views for example list, detail, delete, update, create) and add them to your urls.py. You’ll also need to configure authentication, authorization and probably filtering and pagination. This may seem like a lot of work but the result are excellent - you’ll get a full REST API supporting create, list, detail, update, delete for any complex configuration of your models. You can take a look at a sample application in my React tutorial repository (yes this is a repository that has a tutorial for React and friends but the back-end is in django and django-rest-framework).

django-rest-framework integrates nicely with django-filter (mentioned above) to re-use the filters you have created for your model listings in your REST APIs - DRY at its best!

django-waffle

django-waffle is described as a feature flipper. What does this mean? Let’s say that you want to control at will when a specific view will be enabled - this is the library you’ll want to use. Or you have developed a new feature and you want to give access to it only on a subset of users for a pilot run of the feature - once again you should use django flipper. It offers a nice admin interface where you can configure the flags that will be used for the various feature enabling/disabling (and if they are active or not) and various template tags and functions that you can use to test if the features should be activated or not.

django-allauth

django-allauth should be used in two cases: When you want a better user registration workflow than the default (non-existant) one or you want to integrate your application with an external OAuth provider (i.e allow your users to login through their facebook, google, twitter, github etc accounts). I have mainly used this package for the first case and I can confirm that it works great and you can create as complex flows as you want (for example, in one of my projects I have the following user registration-activation flow: A user registers using a custom form and using his email as username, he receives an email with a confirmation link, after he has confirmed his email he receivs a custom message to wait for his account activation and the administrators of the application are notified, the administrators enable the new user’s account after checking some things and only then he’ll be able to log-in). One thing that must be noticed about django-allauth is that it (in my opition) does not have very good documentation but there are lots of answers about django-allauth in stackoverflow and the source code is very clear so you can always use the source as documentation for this package.

django-modeltranslation

The django-modeltranslation library is the library I recommend for when you want to have translations to your models. To use it you add a translation.py file where you declare the models and their fields that should be translated. The, depending on which languages you have configured in your settings.py after you run makemigrations and migrate you’ll see that django-modeltranslation will have included extra fields to the database, each one with the corresponding language name (for example if you have added a field name to the translations and have english and greek as language, django-modeltranslation will add the fields name_en and name_el to your table). You can the edit the i18n fields (using forms or the django admin) and depending on the current language of your site when you use name you’ll get either name_el or name_en.

django-widget-tweaks

If for some reason you don’t want to do django-crispy-forms, or you have a form in which you want to do a specific layout change but without fully implementing the FormHelper then you can actually render the form in HTML and output the fields one by one. One thing that cannot be done though is passing custom options to the rendered form field. When you do a {{ form.field }} to your template django will render the form field using its default options - yes this can be overriden using custom widgets but I don’t recommend it for example if you only want to add a class to the rendered <input>!

Instead, you can use django-widget-tweaks to pass some specific class names or attributes to the rendered form fields - so if you use {{ form.field|add_class:"myclass" }} the rendered <input> will have a myclass css class.

django-simple-captcha

Use django-simple-captcha to add (configurable) captchas to your django forms. This is a very simple package that does not have any requirements beyond the Pillow library for the captcha image generator. The generated captchas are simple images with some added noise so it won’t integrate reCAPTCHA with which you may be more familiar. I deliberatly propose this package for captchas so you won’t need to integrate with Google services.

wagtail

wagtail is a great django CMS. I use it when I need to create a CMS or I need to add CMS like capabilities to a project. It has many capabilities, too many to be listed here. I urge you to try it if you need a CMS!

Conclusion

The above packages should cover most of your django needs. I have listed only packages with good documentation, that have been recently updated and work with new django versions and should be fairly easy to integrate with your projects. If you need anything more or want to take a general look at some of the packages that have are availablie I recommend starting with the django packages site.

One important thing to notice here is that some of the above packages are not really complex and their functionality can be re-implemented by you in a couple of hours. For example, you could replicate the functionality of django-constance by adding a config dict and a couple of methods (and template tags) of storing and retrieving the keys of that dict with redis. Or add some custom clean methods to your forms instead of using the form fields from django-localflavor. Also, some of these packages have similar functionality and can be used (along with a little custom code) to replicate the functionality of other packages, for exmaple instead of using django-waffle you could use django-constance to configure if the features should be enabled or disabled and django-rules-light to control if the users have access to the feature. Also, you could probably use django-waffle for access control, i.e allow only admins to access a specific views.

Please don’t do this. This violates DRY and violates being disciplined. Each package has its purpose and being DRY means that you use it for its purpose, not re-implementing it and not re-using it for other purposes. When somebody (or you after some months) sees that package in requirements or INSTALLED_APPS he will conclude that you are using it for its intented purpose and thank you because you have saved him some time - please don’t make him waste his time by needing to read your source code to understand any smart tricks or reinventing the wheel.

Automatically create a category table in Postgresql by extracting unique table values

Recently I was given an SQL table containing some useful data. The problem with that table was that it was was non-properly normalized but was completely flat, i.e all its columns contained varchar values while, some of them should have instead contained foreign keys to other tables. Here’s an example of how this table looked like:

Parts
ID Serial Category Manufacturer
1 423431 Monitor LG
2 534552 Monitor LG
3 435634 Printer HP
4 534234 Printer Samsung
5 234212 Monitor Samsung
6 123123 Monitor LG

The normalized version of this table should have been instead like this:

Parts
ID Serial Category_id Manufacturer_id
1 423431 1 1
2 534552 1 1
3 435634 2 2
4 534234 2 3
5 234212 1 3
6 123123 1 2

with the following extra tables that contain the category values with proper foreign keys:

Category
ID Name
1 Monitor
2 Printer
Manufacturer
ID Name
1 Monitor
2 Printer

The normalized version should be used instead of the flat one for reasons which at this moment must be all too obvious.

Having such non-normalized tables is also a common problem I experience with Django: When creating a model I usually define some CharField with pre-defined choices which “will never change”. Sometime during the development (if I am lucky) or when the project is live for years I will be informed that the choices need not only to be changed but must be changed by the users of the site without the need to change the source code! Or that the choices/categories have properties (beyond their name) that need to be defined and used in the project. Both of these cases mean that these categories need to be extracted from simple strings to full Django models (i.e get normalized in their own table)!

In this post I will present a function written in PL/pgsql that will automatically normalize a column from a flat table like the previous. Specifically, using the previous example, if you have a table named part that has a non-normalized column named category then when you call select export_relation('part', 'category') the following will happen:

  • A new table named part_category will be created. This table will contain two columns id and name, with id being the primary key and name having a unique constraint. If the table exists it will be dropped and re-ccreated

  • A new column named category_id will be added to part. This column will be a foreign key to the new table part_category.

  • For each unique value v of category:
    • Insert a new record in part_category with v in the name field of the table and save the inserted id to current_id
    • Set current_id to category_id to all rows of part where category has the value of v

Before diving in to the PL/pgSQL script that does the above changes to the table I’d like to notice that I am not very experienced with PL/pgSQL since I rarerly use it (I actually avoid writing code in the database) however, because the case I described is ideal for using a database script I’ve bitten the bullet and implemented it.

Beyond it’s actual functionality, this script can be used as a reference/cookbook for common PL/pgSQL tasks:

  • Create/define a PL/pgSQL function (stored procedure)
  • Declare variables
  • Assign values to variables
  • Execute SQL commands with variable defined table / column names
  • Log process in PL/pgSQL
  • Executing code conditionally
  • Loop through the rows of a query
  • Save the primary key of an inserted row

The script works however I feel that more experienced PL/pgSQL developers would write things different - if you have any proposals please comment out and I’ll be happy to incorporate them to the script.

Now, let’s now take a look at the actual script:

CREATE OR REPLACE FUNCTION export_relation(table_name varchar, column_name varchar)
RETURNS void AS $$
DECLARE
    val varchar;
    res boolean;
    new_table_name varchar;
    new_column_name varchar;
    current_id int;
BEGIN
    new_table_name := table_name || '_' || column_name;
    new_column_name := column_name || '_id' ;

    execute format ('drop table if exists %s cascade', new_table_name);
    execute format ('CREATE TABLE %s (id serial NOT NULL, name character varying(255) NOT NULL, CONSTRAINT %s_pkey PRIMARY KEY (id), CONSTRAINT %s_unique UNIQUE (name))WITH ( OIDS=FALSE)',
        new_table_name, new_table_name, new_table_name
    );

    execute format('select exists(SELECT column_name  FROM information_schema.columns WHERE table_name=''%s'' and column_name=''%s'') as x', table_name, new_column_name) into res;
    if res is false then
        raise notice 'Creating colum';
        execute format ('ALTER TABLE %s ADD COLUMN %s integer null', table_name, new_column_name);
        execute format ('ALTER TABLE %s ADD CONSTRAINT fk_%s FOREIGN KEY (%s) REFERENCES %s(id)', table_name, new_column_name, new_column_name, new_table_name);
    end if ;

    for val IN execute format ('select %s from %s group by(%s)', column_name, table_name, column_name) LOOP
        RAISE NOTICE 'Inserting new value %s ...', val;
        execute format ('insert into  %s(name) values (''%s'') returning id', new_table_name, val) into current_id;
        raise notice 'Created ID %', current_id;
        execute format ('update %s set %s = %s where %s = ''%s''', table_name, new_column_name,current_id , column_name, val );
    END LOOP;

    /* Uncomment this if you want to drop the flat column
    raise notice 'Dropping colmn';
    execute format ('alter table %s drop column %s', table_name, column_name);
    */

END;
$$ LANGUAGE plpgsql;

The first line creates or updates the script. You can just copy over this script to an SQL window and run it as many times as you like (making changes between runs) and the script will be always updated. The function that is created is actually a procedure since it returns void and takes two parameters. The DECLARE section that follows contains all the variables that are used in the script:

  • val is the current value of the category when looping through their values
  • res is a boolean variable used for a conditional
  • new_table_name is the name of the table that will be created
  • new_column_name is the name of the column that will be added to the old table
  • current_id is the id of the last inserted value in the new table

After the BEGIN the actual procedure starts: First the values of new_table_name and new_column_name are initialized to be used throughout the code and then the new table is dropped (if exists) and re-created. Noticce the execute format (parameter) function that executes the SQL contained in the parameter which is a string and is constructed using the variables we’ve defined. The next line checks if the old table has the new column (i.e category_id) and saves the result in the res variable to check if the new column exists and if not add it to the old table.

A loop enumerating all unique values of the category column of the old table is then executed. Notice that val will contain a single value since the SQL that is executed will return a single column (that’s why it is declared as varchar). If we returned more than one column from the select the val could be declared as record and access its properties through dot notation (val.prop1 etc). The value is inserted to the newly created table using a insert into table values () returning id SQL syntax (so that the new id will be returned - this is an insert/select hybrid command) and saved to the current_id variable. The current_id variable then is used to update the new column that was added to the old table with the proper foreign key value.

Notice that I’ve a commented out code in the end - if you want you can uncomment it and the old (flat) column will be dropped - so in my examply the category column will be removed since I will have category_id to find out the name of each category. I recommend to uncomment this and actually drop the column since when you have both category and category_id the values of these two columns are going to get out of sync and since you’ll have duplicate information your table will be even more non-normalized. You can of course keep the column to make sure that the script works as you want since if the column is not dropped you can easily return to the previous state of the database by removeing the new table and column.

To call it just run select export_relation('part', 'category') and you should see some debug info in the messages tab. When the script is finished you’ll have the part_category table and category_id column in the part table.

Creating custom components for ag-grid

Recently I had to integrate an application with a javascript (client-side) excel-like grid. After some research I found out that ag-grid is (at least for me) the best javascript grid library. It has an open source version with a MIT license so it can safely be used by all your projects and a commercial, enteprise version that includes a bunch of extra features. The open source version should be sufficient for most projects however I really recommend buying the commercial version to support this very useful library.

The greatest characteristic of ag-grid in my opinion is its openess and API that enables you to extend it to support all your needs! An example of this API will be presented in this article where I will implement two components that can be included in your grid:

  • An array editor through which you will be able to create cells that can be used to insert a list of values. For example, if you have a grid of employees, you will be able to add a “children” cell that will contain the names of each employee’s children properly separated (one line per child)
  • An object editor through which you will be able to create cells that can be used to insert flat objects. For example, for the grid of employees, you may add an “address” cell that, when edited will be expanded to seperate fields for Address, City, Zip code and Country.

A common ground

First of all, let’s create a simple example that will act as a common ground for our components:

As you can see, I have defined a bunch of columns for the grid and then create a new grid passing it the myGrid div and the gridOptions (which are kept to a minimum). Finally, there’s an event handler for button click that adds an empty row:

var columns = [
    {
        headerName: 'ID', field: 'id', width: 50, editable: true
    }, {
        headerName: 'Name', field: 'name', width: 100, editable: true
    }, {
        headerName: "Address", field: "address", width: 200, editable: true
    }, {
        headerName: "Children", field: "children", width: 200, editable: true
    }
];

var gridOptions = {
    columnDefs: columns,
    rowData: []
};

new agGrid.Grid(document.querySelector('#myGrid'), gridOptions);

document.querySelector('#addRow').addEventListener("click", function() {
    gridOptions.api.addItems([{}]);
});

You may be able to edit both the address and the children of each employee in the above example however this is not intuitive. The editors will be able to enter any kind of address they want and add the children seperated by commas, spaces, dashes or whatever they want. Of course you could add validators to enforce some formatting for these fields however I think that using custom components has a much better user experience.

Please notice that the above example along with the following components will be implemented in pure (no-framework) javascript. Integrating ag-grid with a javascript framework like angular or react in an SPA should not be difficult however I find it easier to adjust my SPA so that the grid component is seperate and does not need interoperability with other SPA components since all components like renderers and editors will be integrated to the grid!

Also, using pure javascript for your custom components makes them faster than adding another layer of indirection through react as can be seen on the on ag-grid react integration:

If you do use React, be aware that you are adding an extra layer of indirection into ag-Grid. ag-Grid’s internal framework is already highly tuned to work incredibly fast and does not require React or anything else to make it faster. If you are looking for a lightning fast grid, even if you are using React and the ag-grid-react component, consider using plain ag-Grid Components (as explained on the pages for rendering etc) inside ag-Grid instead of creating React counterparts.

Editors and renderers

A custom column in ag-grid actually has two distinctive parts: An object that is used for rendering and an object that is used for editing the cell value. You can have columns with the built in editor and a custom renderer, columns with the built in renderer and a custom editor and cells with custom editors and rendererers. I urge you to read the documentation on both renderers and editors in order to understand most of the decisions I have made for the implemented components.

A renderer can be:

  • a function that receives the value of the cell and returns either an HTML string or a complete DOM object
  • a class that provides methods for returning the HTML string or DOM object for the cell

The function is much easier to use however biting the bullet and providing a renderer class is better for non-trivial rendering. This is because the function will be called each time the cell needs to be refreshed (refer to the docs on what refreshing means) while, the class provides a specific refresh() method that is called instead. This way, using the class you can generate the DOM structure for the cell once, when it is first created and then when its value changes you’ll only call its refresh method to update the value. We’ll see how this works later.

An editor is a class that should provide methods for returning the DOM structure for the cell editing (for example an <input> field) and the current value of the field.

Both renderer and editor classes can be attached to columns using cellEditor and cellRenderer column properties. You also may pass per-column properties to each cell using the cellEditorParams and cellRendererParams propertie. For example, you may have a renderer for booleans that displays icons for true/false and you want to use different icons depending on the column type, or you may want to create a validation-editor that receives a function and accepts the value you enter only if the function returns true - the valid function could be different for different column types.

Creating the object cell editor

The first component we’ll present here is an object renderer/editor. This component will receiver a list of fields and will allow the user to edit them in a popup grouped together. Here’s a fiddle with the Address of each employee using the object editing component:

To integrate it with the ag-grid I’ve added an addressFields list containg the fields of the object like this:

var addressFields = [
  {'name': 'address', 'label': 'Address' },
  {'name': 'zip', 'label': 'ZIP' },
  {'name': 'city', 'label': 'City' },
]

and then passed this as a parameter to both the renderer and editor for the address field:

{
    headerName: "Address", field: "address", width: 200, editable: true,
    cellRenderer: ObjectCellRenderer,
    cellEditor: ObjectEditor,
    cellEditorParams: {
      fields: addressFields
    },
    cellRendererParams: {
      fields: addressFields
    }
}

The ObjectEditor and ObjectCellRenderer are the actual editor and renderer of the component. I will start by representing the renderer first:

function ObjectCellRenderer() {}

ObjectCellRenderer.prototype.init = function (params) {
    // Create the DOM element to display
    this.span = document.createElement('span');
    this.span.innerHTML='';
    this.refresh(params)
};

The ObjectCellRender is an javascript object to which we define an init method. This method will be called by ag-grid when the component is first created, passing it a params object with various useful params, like the user-defined parameters (from cellRendererParams) and the actual value of othe cell. We just create an empty span DOM element that will be used to display the value of the object and call refresh.

ObjectCellRenderer.prototype.refresh = function(params) {
    var res = ''
    if(params.value) {
        // If we have a value build the representation
        for(var i=0;i<params.fields.length;i++) {
            res += params.fields[i].label + ': ';
            res += params.value[params.fields[i].name] + ' ';
        }
    }
    // Put representation to the DOM element
    this.span.innerHTML=res;
}

ObjectCellRenderer.prototype.getGui = function () {
    return this.span;
};

The refresh method generates the text value of the cell (that will be put inside the span we created in init). It first checks if the value attribute of params is defined and if yes, it appends the label of each object attribute (which we pass through cellRendererParams.fields.label) along with its value (which is retrieved from the params.value using cellRendererParams.fields.name). Notice ag-grid puts the result of the getGui method in the cell - so we just return the span we create. Also, we created the span element in init but filled it in refresh - to avoid it creating the same element lots of times (this would be more imporntant of course on more expensive operations).

Now let’s continue with ObjectEditor:

function ObjectEditor() {}

// Surpress some keypresses
var onKeyDown = function(event) {
    var key = event.which || event.keyCode;
    if (key == 37 ||  // left
        key == 39 || // right
        key == 9 ) {  // tab
        event.stopPropagation();
    }
}

ObjectEditor.prototype.init = function (params) {
    // Create the container DOM element
    this.container = document.createElement('div');
    this.container.style = "border-radius: 15px; border: 1px solid grey;background: #e6e6e6;padding: 10px; ";
    this.container.onkeydown = onKeyDown

    // Create the object-editing form
    for(i=0;i<params.fields.length;i++) {
        var field = params.fields[i];
        var label = document.createElement('label');
                label.innerHTML = field.label+': ';
        var input = document.createElement('input');
        input.name = field.name;
        if (params.value) {
            input.value = params.value[field.name];
        }

        this.container.appendChild(label);
        this.container.appendChild(input);
        this.container.appendChild(document.createElement('br'));
    }

    // Create a save button
    var saveButton = document.createElement('button');
    saveButton.appendChild(document.createTextNode('Ok'))
    saveButton.addEventListener('click', function (event) {
        params.stopEditing();
    });
    this.container.appendChild(saveButton);
};

The init function of ObjectEditor ise used to create a container div element that will hold the actual input elements. Then, using the fields that were passed as a parameter to the editor it creates a label, an input and a br element and inserts them one by one to the container div. The input is instantiated with the current value of each attribute while its name is taken from the name of the corresponding field (from the fields parameter). Finally, a saveButton is created that will stop the editing when clicked.

ObjectEditor.prototype.getGui = function () {
    return this.container;
};

ObjectEditor.prototype.afterGuiAttached = function () {
    var inputs = this.container.getElementsByTagName('input');
    inputs[0].focus();
};

ObjectEditor.prototype.getValue = function () {
    var res = {};
    // Create the cell value (an object) from the inputs values
    var inputs = this.container.getElementsByTagName('input');
    for(j=0;j<inputs.length;j++) {
          res[inputs[j].name] = inputs[j].value.replace(/^\s+|\s+$/g, "");
    }
    return res;
};

ObjectEditor.prototype.destroy = function () {
};

ObjectEditor.prototype.isPopup = function () {
    return true;
};

The other methods of ObjectEditor are simpler: getGui actually returns the container we built in the init, afterGuiAttached is called when the component is attached to the DOM and focuses on the first input element, getValue enumerates the input elements, takes their value (and names) and return an object with the name/value pairs, destroy dosn’t do anything however it must be defined and can be used for cleaning up if needed and isPopup returns true to display the container as a popup instead of inline.

Creating the array-like cell editor

The multi-line renderer/editor will allow a cell to contain a list of values. Here’s the complete fiddle where the “children” column is using the multi-line component:

To integrate it with ag-grid we just need to use the corresponding editor and renderer:

{
    headerName: "Children", field: "children", width: 200, editable: true,
    cellRenderer: MultiLineCellRenderer,
    cellEditor: MultiLineEditor
}

The MultiLineCellRenderer is similar to the ObjectCellRenderer. A span/container element is created at the init method and the refresh method is called to fill it. The refresh method outputs the number of items in the list (i.e it writes N items) and uses the span’s title to display a tooltip with the values of the items:

function MultiLineCellRenderer() {}

MultiLineCellRenderer.prototype.init = function (params) {
    this.span = document.createElement('span');
    this.span.title=''
    this.span.innerHTML='';
    this.refresh(params);
}

MultiLineCellRenderer.prototype.refresh = function (params) {
    if (params.value === "" || params.value === undefined || params.value === null) {
        this.span.innerHTML = '0 items';
    } else {
        var res = ''
        // Create the tooltip for the cell
        for(var i=0;i<params.value.length;i++) {
            res += (i+1)+': ' + params.value[i]
            res+='\n';
        }
        this.span.title = res;
        this.span.innerHTML = params.value.length + ' item' + (params.value.length==1?"":"s");
    }
};

MultiLineCellRenderer.prototype.getGui = function () {
    return this.span;
};

The logic of the MultiLineEditor is also similar to the ObjectEditor:

function MultiLineEditor() {}

MultiLineEditor.prototype.onKeyDown = function (event) {
    var key = event.which || event.keyCode;
    if (key == 37 ||  // left
        key == 39 || // right
        key == 9 ) {  // tab
        event.stopPropagation();
    }
};

// Used to append list items (their value along with the remove button)
MultiLineEditor.prototype.addLine = function(val) {
    var li = document.createElement('li');
    var span = document.createElement('span');
    var removeButton = document.createElement('button');
    removeButton.style='margin-left: 5px; text-align: right; '
    removeButton.innerHTML = 'Remove'
    span.innerHTML = val;
    li.appendChild(span)
    li.appendChild(removeButton)

    this.ul.appendChild(li);
    var that = this;
    removeButton.addEventListener('click', function(event) {
        that.ul.removeChild(this.parentElement);
    });
}

MultiLineEditor.prototype.init = function (params) {
    var that = this;
    this.container = document.createElement('div');
    this.container.style = "border-radius: 15px; border: 1px solid grey;background: #e6e6e6;padding: 10px;";

    this.ul = document.createElement('ol');
    if (params.value) {
        for(i=0;i<params.value.length;i++) {
            this.addLine(params.value[i]);
        }
    }
    this.container.appendChild(this.ul);
    this.input = document.createElement('input');
    this.container.appendChild(this.input);

    this.addButton = document.createElement('button');
    this.addButton.innerHTML = 'Add';
    this.addButton.addEventListener('click', function (event) {
        var val = that.input.value;
        if(!val || val==undefined || val=='') return;
        that.addLine(val, that.ul);
        that.input.value='';
    });
    this.container.appendChild(this.addButton);

    this.saveButton = document.createElement('button');
    this.saveButton.innerHTML = 'Ok';
    this.saveButton.addEventListener('click', function (event) {
        params.stopEditing();
    });
    this.container.appendChild(this.saveButton);

};

A div element that will display the popup editor is created first. In this container we add a ol element and then, if there are values in the list they will be appended in that ol using the addLine method. This method creates a li element the value of each list item along with a remove button which removes the corresponding li element when clicked (that that=this is needed because the click callback function of the button has a different this than the addLine method so that needs to be used instead).

After the list of the items, there’s an input whose value will be inserted to the list when clicking the add button. The same addLine function we used when initializing the component is used here to append the input’s value to the ol. Finally the save button stops the list editing:

MultiLineEditor.prototype.getGui = function () {
    return this.container;
};

MultiLineEditor.prototype.afterGuiAttached = function () {
    var inputs = this.container.getElementsByTagName('input');
    inputs[0].focus();
};

MultiLineEditor.prototype.getValue = function () {
    var res = [];
    // The value is the text of all the span items of the li items
    var items = this.ul.getElementsByTagName('span');
    for(var i=0;i<items.length;i++) {
        res.push(items[i].innerHTML);
    }
    return res;
};

MultiLineEditor.prototype.destroy = function () {
};

MultiLineEditor.prototype.isPopup = function () {
    return true;
};

For the value of this component, we just enumerate through the li span items of the list we created through addLine and add it to a normal javascript array.

Some more discussion about the components

The above components can be used as they are however I think that their greatest value is that they should show-off some of the capabilities of ag-grid and used as templates to build up your own stuff. Beyond styling or changing the layout of the multi-line and the object editor, I can think of a great number of extensions for them. Some examples, left as an excerise to the reader:

  • The renderer components are rather simple, you may create any DOM structure you want in them
  • Pass a function as a parameter to the object or multi-line editor that will be used as a validator for the values (i.e not allow nulls or enforce a specific format) - then you can add another “error” span that will be displayed if the validation does not pass
  • Make the object editor smarter by passing the type of each attribute (i.e text, number, boolean, date) and check that each input corresponds to that type
  • Pass a min/max number of list items for the multi-line editor
  • The multi-line editor doesn’t really need to have only text values for each item. You could combine the multi-line editor with the object editor so that each line has a specific, object-like structure. For example, for each child we may have two attributes, name and date of birth

Getting a logical backup of all databases of your Postgresql server

In this small post I will present a small bash script that could be used to create logical backups of all the databases in a Postgresql server along with some other goodies. More specifically, the script will:

  • Create two files at /tmp to output information (one for debugging and one with info for sending it through email at the end)
  • Create a backup directory with the current date
  • Create a list of all databases found on the server
  • For each database, vacuum and analyze it, backup it, gzip it and put it in the backup directory
  • Write info about the backup in the info log file
  • Do the same for global objects
  • Send an email when the backup is finished with the info log

Now, in my system I’m using an external folder at /mnt/backupdb to put my backups. You may either use the same technique or connect remotely to a postgresql database (so you need to change the parameters of vacuumdb, pg_dump and pg_dumpall to define the server and credentials to connect to) and put the backups to a local disc.

#!/bin/sh
echo "" > /tmp/db_backup.log
echo "" > /tmp/db_backup_info.log
date_str=$(date +"%Y%m%d_%H%M%S")
backup_dir=/mnt/backupdb/pg_backup.$date_str

mkdir $backup_dir
pushd $backup_dir > /dev/null

dbs=`sudo -u postgres psql -Upostgres -lt | grep -v : | cut -d \| -f 1 | grep -v template | grep -v -e '^\s*$' | sed -e 's/  *$//'|  tr '\n' ' '`
echo "Will backup: $dbs to $backup_dir" >> /tmp/db_backup_info.log
for db in $dbs; do
  echo "Starting backup for $db" >> /tmp/db_backup_info.log
  filename=$db.$date_str.sql.gz
  sudo -u postgres vacuumdb --analyze -Upostgres $db >> /tmp/db_backup.log
  sudo -u postgres pg_dump -Upostgres -v $db -F p 2>> /tmp/db_backup.log | gzip > $filename
  size=`stat $filename --printf="%s"`
  kb_size=`echo "scale=2; $size / 1024.0" | bc`
  echo "Finished backup for $db - size is $kb_size KB" >> /tmp/db_backup_info.log
done

echo "Backing up global objects" >> /tmp/db_backup_info.log
filename=global.$date_str.sql.gz
sudo -u postgres pg_dumpall -Upostgres -v -g 2>> /tmp/db_backup.log | gzip > $filename
size=`stat $filename --printf="%s"`
kb_size=`echo "scale=2; $size / 1024.0" | bc`
echo "Finished backup for global - size is $kb_size KB" >> /tmp/db_backup_info.log

echo "Ok!" >> /tmp/db_backup_info.log
mail -s "Backup results" spapas@mymail.foo.bar  < /tmp/db_backup_info.log
popd > /dev/null

Let’s explain a bit the above script: The two first lines (echo …) will just clear out the two files /tmp/db_backup.log and /tmp/db_backup_info.log. The first will contain debug info from the commands and the second one will contain our info that will be sent through an email at the end of the backup. After that, we initialize date_str with the current date in the form 20161102_145011 and the backup_dir with the correct directory to save the backups to. We then create the backup directory and switch to it with pushd.

The following, rather long command will assign the names of the databases to the dbs variable. So how is it working? psql -lt lists the names of the databases, but lists also more non-needed information which we remove with the following commands (grep, cut etc). The sed removes whitespace and the tr concatenates individual lines to a single line so dbs will have a value like ‘db1 db2 …’. For each one of these files then we assign its name and date to a filename and then, after we execute vacuumdb we use pg_dump with gzip to actually create the backup and output it to the file. The other two lines (size and kb_size) are used to calculate the size of the backup file (to be sure that something is actually created) - you’ll need to install bc for that. The same process is followed the to backup global objects (usernames etc) using pg_dumpall -g. Finally, we send a mail with a subject of “Backup results” and body the contents of /tmp/db_backup_info.log.

I’ve saved this file to /var/lib/pgsql/db_backup_all.sh. To run I propose using cron — just edit your crontab (through vi /etc/crontab) and add the line

15 2  *  *  * root       /usr/bin/bash /var/lib/pgsql/db_backup_all.sh

This will run the backup every night at 2.15. Uses the root user to have access rights to the backup folder. One thing to be careful about is that on Redhat/Centos distributions, the above won’t work because sudo requires a tty to work and cron doesn’t have one. To fix this, comment out the line

Defaults    requiretty

of your /etc/sudoers file.

Update 02/12/2016: Here’s a little better version of the above script that

  1. Create two files for each database, one with SQL script backup, one with binary backup. Although with SQL backup you can check out the backup and maybe do changes before applying it, the binary backup is a more foolproof method of restoring everything to your database! Also, instead of restoring the database through psql (as required by the SQL script backup), using the binary backup you can restore through the pg_restore tool.
  2. Adds a function to output the file size (so the script is more DRY)
#!/bin/sh

function output_file_size {
  size=`stat $1 --printf="%s"`
  kb_size=`echo "scale=2; $size / 1024.0" | bc`
  echo "Finished backup for $2 - size is $kb_size KB" >> /tmp/db_backup_info.log
}

echo "" > /tmp/db_backup.log
echo "" > /tmp/db_backup_info.log
date_str=$(date +"%Y%m%d_%H%M%S")
backup_dir=/mnt/backupdb/dbpg/pg_backup.$date_str

mkdir $backup_dir
pushd $backup_dir > /dev/null
dbs=`sudo -u postgres psql -Upostgres -lt | cut -d \| -f 1 | grep -v template | grep -v -e '^\s*$' | sed -e 's/  *$//'|  tr '\n' ' '`
#dbs='dgul  hrms  mailer_server  missions  postgres'
echo "Will backup: $dbs to $backup_dir" >> /tmp/db_backup_info.log
for db in $dbs; do
  echo "Starting backup for $db" >> /tmp/db_backup_info.log
  filename=$db.$date_str.sql.gz
  filename_binary=$db.$date_str.bak.gz
  sudo -u postgres vacuumdb --analyze -Upostgres $db >> /tmp/db_backup.log
  sudo -u postgres pg_dump -Upostgres -v $db -F p 2>> /tmp/db_backup.log | gzip > $filename
  sudo -u postgres pg_dump -Upostgres -v $db -F c 2>> /tmp/db_backup.log | gzip > $filename_binary
  output_file_size $filename "$db sql"
  output_file_size $filename_binary "$db bin"
done
echo "Backing up global objects" >> /tmp/db_backup_info.log
filename=global.$date_str.sql.gz
sudo -u postgres pg_dumpall -Upostgres -v -g 2>> /tmp/db_backup.log | gzip > $filename
output_file_size $filename global
echo "Ok!" >> /tmp/db_backup_info.log
mail -s "Backup results" spapas@hcg.gr  < /tmp/db_backup_info.log
popd > /dev/null

A pandas pivot_table primer

Introduction

Recently, I started using the pandas python library to improve the quality (and quantity) of statistics in my applications. One pandas method that I use frequently and is really powerful is pivot_table. This is a rather complex method that has very poor documentation. Beyond this, this command is explained a little more in an article about data reshaping, however, even this leaves much to be desired (when I first tried reading it I was overwhelmed by the amount of information there).

A great introduction to pandas is the three part series by Greg Reda - it touches pivot_table however I was only able to understand it properly after I played a lot with it. I don’t know, maybe playing with pivot_table yourself (or being really experienced in such concepts) is the only way to properly comprehend it! To help with this journey however I’m going to try to explain various basic pandas concepts that will lead us to the pivot_table command (and some of its friends). Notice that I’m using pandas 0.18.1.

I have to mention that I am no expert in statistics or numeric analysis so this post won’t have any advanced information and may even point out some obvious things. However keep in mind things that may seem obvious to some experts are really difficult to grasp for a non-expert.

Before continuing, please notice that this article has been written as a jupyter notebook and was integrated with pelican using the pelican-ipynb plugin. I had to do some modifications to my theme to better integrate the notebook styling, however some stuff may not look as nice as the other articles. I have to mention that this integration is really great and I totally recommend it!

The DataFrame

The most important data structure that pandas uses is the DataFrame. This can be thought as a two dimensional array, something like an Excel spreadsheet. In the pandas nomenclature, the rows of that two-dimensional array are called indexes (while the columns are still called columns) — I’ll either use rows or indexes for the rows of the DataFrame. The rows are called indexes because they can be used to … index data (think of each column as a dictionary). However please notice that pandas has a different data structure named Index that is used to store the names of the headers (axis) of the rows and columns.

If we horizontally/vertically pick the values of a single row(index)/column we’ll be left with a different data structure called Series - this is more or less a single dimensional array (or a dictionary with the names of the columns/indexes as keys). There’s also a Panel data structure which is 3-dimensional, more like a complete Excel workbook (the third dimension being the individual sheets of the workbook) but I won’t cover that here.

More info on the above can be found on the corresponding article about data structures.

There are various ways to read the data for a Series or DataFrame: Initializing through arrays or dicts, reading from csv, xls, database, combinining series to create an array and various others. I won’t go into any details about this but will include some examples on how to create Series and DataFrames. If you are familiar with python you can just convert everything to a dict and read that instead of researching individual methods.

Using Series

The Series data structure is more or less used to store a single dimensional array of data. This array-like structure could either have numbers as indexes (so will be more similar to a normal array) or have textual indexes (so will be more similar to a dictionary). Let’s see some examples:

In [203]:
import pandas as pd

def t(o):
    # Return the class name of the object
    return o.__class__.__name__

# Use an array to create a Series
series1 = pd.Series([10,20,30])
print "series1 (", t(series1), ')\n', series1
# Notice that the index names were automatically generated as 0,1,2
# Use a dict to create a Series
# notice that the keys of the dict will be the index names
series2 = pd.Series({'row1':11,'row2':22,'row3':33})
print "series2 (", t(series2), ')\n', series2
series1 ( Series )
0    10
1    20
2    30
dtype: int64
series2 ( Series )
row1    11
row2    22
row3    33
dtype: int64

The are various ways to select values from the Series. You can use textual or numeric indexes or you can filter the elements using an intuitive syntax:

In [204]:
# Get values from series using named indexes
series2['row1']
# Can also use slicing and interesting operations 
# like array in array [[]] to select specific indexes
print series1[1:]
print series1[[0,2]]
print series2['row2':]
print series2[['row1', 'row3']]
1    20
2    30
dtype: int64
0    10
2    30
dtype: int64
row2    22
row3    33
dtype: int64
row1    11
row3    33
dtype: int64
In [205]:
# Filtering series
# You can use comparison operators with a Series to 
# get an array of booleans with the result of each element
print "Boolean result\n", series2>15
# This boolean array can then be used to filter the Series
# by returning only the elements that are "True"
print "Filtered result\n", series2[series2>15]
Boolean result
row1    False
row2     True
row3     True
dtype: bool
Filtered result
row2    22
row3    33
dtype: int64
In [206]:
# The above means that we'll only get the second and third (index: 0,2)

# So we can create a function that returns Boolean, apply it to 
# all elements of series with map and use the result for indexing!
def is_22(x):
    return x==22
print "Map filtering\n", series2[series2.map(is_22)]
Map filtering
row2    22
dtype: int64

The map method above gets a callback function and applies it to all elements of the Series, returning a new Series with the results. It is similar to the map(function, sequence) -> list global python funtion. Using map filtering is the most general way to filter elements of a series.

Using DataFrames

Let’s start by a quick introduction to see some basic operations on DataFrames:

In [207]:
# Create a DataFrame using a two-dimensional array
# Notice that the indexes and column names were automatically generated
df1 = pd.DataFrame([[10,20,30], [40,50,60]])
print "Dataframe from array: df1(", t(df1), ')'
print df1

# Use a dict to give names to columns
df2 = pd.DataFrame([{'col1':10,'col2':20,'col3':30}, {'col1':40,'col2':50,'col3':60}])
print "Dataframe from dict: df2(", t(df2), ')'
print df2

# Give names to indexes
df3 = pd.DataFrame([
    {'col1':10,'col2':20,'col3':30}, 
    {'col1':40,'col2':50,'col3':60}
], index=['idx1', 'idx2'])
print "Dataframe from dict, named indexes: df3(", t(df3), ')'
print df3

# What happens when columns are missing
df4 = pd.DataFrame([{'col1':10,'col2':20,'col3':30}, {'col2':40,'col3':50,'col4':60}])
print "Dataframe from dict, missing columns: df4(", t(df4), ')'
print df4

# Create a DataFrame by combining series
df5 = pd.DataFrame([pd.Series([1,2]), pd.Series([3,4])], index=['a', 'b'], )
print "Dataframe from series: df5(", t(df5), ')'
print df5

# Output a dataframe as html
print df5.to_html()

# Notice that there are many more interesting DataFrame output methods, like
# to_csv, to_dict, to_excel, to_json, to_latex, to_msgpack, to_string,
Dataframe from array: df1( DataFrame )
    0   1   2
0  10  20  30
1  40  50  60
Dataframe from dict: df2( DataFrame )
   col1  col2  col3
0    10    20    30
1    40    50    60
Dataframe from dict, named indexes: df3( DataFrame )
      col1  col2  col3
idx1    10    20    30
idx2    40    50    60
Dataframe from dict, missing columns: df4( DataFrame )
   col1  col2  col3  col4
0  10.0    20    30   NaN
1   NaN    40    50  60.0
Dataframe from series: df5( DataFrame )
   0  1
a  1  2
b  3  4
0 1
a 1 2
b 3 4

Reading a DataFrame from an array of python dicts is (at least for me) the easiest way to put my data in a DataFrame. Use any normal python method to generate that array of dicts and then just initialize the DataFrme with that. Also, the to_html method is really useful to quickly output a DataFrame to your web application - don’t forget to add some styling to the .dataframe class!

Selecting values from the Dataframe is very easy if you know how to do it. You index ([]) directly to select columns:

In [208]:
print "df3(", t(df3), ")\n", df3

# We can get a column as a Series
print "Get column as series\n", df3['col3']
# Or multiple columns as a DataFrame
print "Get multiple columns\n", df3[['col3', 'col2']]

# We can also get the column by its idx 
print "Get column by index\n", df3[df3.columns[1]]

# Pick values from a dataframe using array indexing
# df3['col2'] returns a Series so using the ['idx2'] 
# index to it will return the actual value
print "Get value\n", df3['col2']['idx2']
df3( DataFrame )
      col1  col2  col3
idx1    10    20    30
idx2    40    50    60
Get column as series
idx1    30
idx2    60
Name: col3, dtype: int64
Get multiple columns
      col3  col2
idx1    30    20
idx2    60    50
Get column by index
idx1    20
idx2    50
Name: col2, dtype: int64
Get value
50

Also you use the loc/iloc properties of the DataFrame to select rows/indexes (either by number or by text). The loc/iloc actually behave as a two dimensional array - they can get two parameters, the first one being the row/rows and the second one being the column/columns:

In [209]:
# Pick an index (select a horizontal line) as a series
print "Get index as a series\n", df3.loc['idx1']
# Also can pick by index number
print "Get index as a series by index\n", df3.iloc[0]
# iloc can be used to numerically index both rows and columns by passing two indexes:
print "Two dimensional - get by index\n",df3.iloc[0, :] # This is the same as the previous
# so to select the first column we'll use
print "Two dimensional - get by column\n", df3.iloc[:, 0]
# We could do more interesting things, for example select a square
print "Two dimensional - get by index and column\n", df3.iloc[0:2, 1:3]
# Loc which is for label based indexing can also be used as a two dimensional index
print "Two dimensional - use label based indexing\n", df3.loc[['idx1','idx2'], 'col1']
Get index as a series
col1    10
col2    20
col3    30
Name: idx1, dtype: int64
Get index as a series by index
col1    10
col2    20
col3    30
Name: idx1, dtype: int64
Two dimensional - get by index
col1    10
col2    20
col3    30
Name: idx1, dtype: int64
Two dimensional - get by column
idx1    10
idx2    40
Name: col1, dtype: int64
Two dimensional - get by index and column
      col2  col3
idx1    20    30
idx2    50    60
Two dimensional - use label based indexing
idx1    10
idx2    40
Name: col1, dtype: int64

Of course, boolean indexing and filtering can also be used just like in Series:

In [210]:
print "Boolean dataframe\n", df3>30
print "Boolean indexing\n",df3[df3>30]
def is_20_or_50(x):
    return x==20 or x==50
# We need to use applymap (instead of map we used in Series)
print "Boolean indexing\n",df3[df3.applymap(is_20_or_50)]
Boolean dataframe
       col1   col2   col3
idx1  False  False  False
idx2   True   True   True
Boolean indexing
      col1  col2  col3
idx1   NaN   NaN   NaN
idx2  40.0  50.0  60.0
Boolean indexing
      col1  col2  col3
idx1   NaN    20   NaN
idx2   NaN    50   NaN

Notice that for the DataFrame we use the applymap method which applies the callback function to all individual elements of the DataFrame and returns the result as a new DataFrame (with the same dimensions of course). The boolean indexing is nice but it does not actually drop not needed things, we see that we just get a NaN in the positions that are filtered. Could we do something better? The answer is yes, but we’ll need to do index/column boolean indexing - i.e select only specific columns or indexes and then pass these to filter the dataframe:

In [211]:
# Let's see the indexes that have *10* in their col1 column
print df3['col1']==10
# And then select *only* these indexes (i.e idx1)
print df3[df3['col1']==10]
# Now we can do exactly the opposite (see columns that have 10 in their idx1 index)
print df3.loc['idx1']==10
# And then select *only* these columns (i.e col1)
print df3.loc[:, df3.loc['idx1']==10]
idx1     True
idx2    False
Name: col1, dtype: bool
      col1  col2  col3
idx1    10    20    30
col1     True
col2    False
col3    False
Name: idx1, dtype: bool
      col1
idx1    10
idx2    40
In [212]:
# Let's finally see a general solution to boolean selecting with loc:
# Select specific columns 
print df3.loc[:, [True, False, True] ]
# Select specific rows
print df3.loc[[False, True], : ]
# Select specific rows and cols
print df3.loc[[False, True], [True, False,True] , ]
# So we can pass two boolean arrays to loc, the first for selecting indexes and 
# the second for selecting columns
      col1  col3
idx1    10    30
idx2    40    60
      col1  col2  col3
idx2    40    50    60
      col1  col3
idx2    40    60

Modifying DataFrames

It’s easy to modify the DataFrame by changing its values, adding more indexes / columns, dropping rows and columns, renaming columns and indexes. Notice that some operations are performed in place (so they modify the original DataFrame), while others return a copy of the original array.

In [213]:
# Let's copy because some of the following operators change the dataframes
df = df3.copy()
print df

print "Change values of a column"
df['col1'] = [11,41]
print df

print "Change values of an index"
df.loc['idx1'] = [11,21, 31]
print df

print "We can change more specific values (a 2x2 array here)"
df.iloc[0:2, 0:2] = [[4,3], [2,1]]
print df

print "Add another column to an existing dataframe (changes DataFrame)"
df['col4'] = [1,2]
print df

print "Add another row (index) to an existing dataframe (changes DataFrame)"
df.loc['idx3']=[100,200,300,400]
print df

print "Drop a row (returns new object)"
print df.drop('idx1')

print "Drop a column (returns new object)"
print df.drop('col1', axis=1)

print "Rename index (returns new object)"
print df.rename(index={'idx1': 'new-idx-1'})

print "Rename column (returns new object)"
print df.rename(columns={'col1': 'new-col-1'})

print "Transpose array- change columns to rows and vice versa"
print df.T

print "Double transpose - returns the initial DataFrame"
print df.T.T
      col1  col2  col3
idx1    10    20    30
idx2    40    50    60
Change values of a column
      col1  col2  col3
idx1    11    20    30
idx2    41    50    60
Change values of an index
      col1  col2  col3
idx1    11    21    31
idx2    41    50    60
We can change more specific values (a 2x2 array here)
      col1  col2  col3
idx1     4     3    31
idx2     2     1    60
Add another column to an existing dataframe (changes DataFrame)
      col1  col2  col3  col4
idx1     4     3    31     1
idx2     2     1    60     2
Add another row (index) to an existing dataframe (changes DataFrame)
      col1  col2  col3  col4
idx1     4     3    31     1
idx2     2     1    60     2
idx3   100   200   300   400
Drop a row (returns new object)
      col1  col2  col3  col4
idx2     2     1    60     2
idx3   100   200   300   400
Drop a column (returns new object)
      col2  col3  col4
idx1     3    31     1
idx2     1    60     2
idx3   200   300   400
Rename index (returns new object)
           col1  col2  col3  col4
new-idx-1     4     3    31     1
idx2          2     1    60     2
idx3        100   200   300   400
Rename column (returns new object)
      new-col-1  col2  col3  col4
idx1          4     3    31     1
idx2          2     1    60     2
idx3        100   200   300   400
Transpose array- change columns to rows and vice versa
      idx1  idx2  idx3
col1     4     2   100
col2     3     1   200
col3    31    60   300
col4     1     2   400
Double transpose - returns the initial DataFrame
      col1  col2  col3  col4
idx1     4     3    31     1
idx2     2     1    60     2
idx3   100   200   300   400

More advanced operations

Beyond the previous, more or less basic operations, pandas allows you to do some advanced operations like SQL-like joins of more than one dataset or, applying a function to each of the rows / columns or even individual cells of the DataFrame:

In [214]:
authors_df=pd.DataFrame([{'id': 1, 'name':'Stephen King'}, {'id': 2, 'name':'Michael Crichton'}],  )

books_df=pd.DataFrame([
    {'id': 1, 'author_id':1, 'name':'It'}, 
    {'id': 2, 'author_id':1, 'name':'The Stand'}, 
    {'id': 3, 'author_id':2, 'name':'Airframe'},
    {'id': 4, 'author_id':2, 'name':'Jurassic Park'}
])

print authors_df
print books_df
print books_df.merge(authors_df, left_on='author_id', right_on='id')
   id              name
0   1      Stephen King
1   2  Michael Crichton
   author_id  id           name
0          1   1             It
1          1   2      The Stand
2          2   3       Airframe
3          2   4  Jurassic Park
   author_id  id_x         name_x  id_y            name_y
0          1     1             It     1      Stephen King
1          1     2      The Stand     1      Stephen King
2          2     3       Airframe     2  Michael Crichton
3          2     4  Jurassic Park     2  Michael Crichton

As can be seen above, the merge method of DataFrame can be used to do an sql-like join with another DataFrame, using specific columns as join-keys for each of the two dataframes (left_on and right_on). There are a lot of options for doing various join types (left, right, inner, outer etc) and concatenating DataFrames with other ways - most are discussed in the corresponding post.

Let’s see another method of doing the above join that is more controlled, using the apply method of DataFrame that applies a function to each row/column of the DataFrame and returns the result as a series:

In [215]:
# Let's do the join using a different method
def f(r):
    author_df_partial = authors_df[authors_df['id']==r['author_id']]
    return author_df_partial.iloc[0]['name']
        
books_df['author name'] = books_df.apply(f, axis=1)
print books_df
   author_id  id           name       author name
0          1   1             It      Stephen King
1          1   2      The Stand      Stephen King
2          2   3       Airframe  Michael Crichton
3          2   4  Jurassic Park  Michael Crichton

How does this work? We pass the axis=1 parameter to apply so that the callback function will be called for each row of the DataFrame (by default axis=0 which means it will be called for each column). So, f will be called getting each row as an input. From this book_df row, we get the author_id it contains and filter authors_df by it. Notice that author_df_partial is actually a DataFrame containing only one row, so we need to filter it by getting its only line, using iloc[0] which will return a Series and finally, we return the author name using the corresponding index name.

When calling the apply method, by defautl the axis parameter is 0 (i.e the function will be called for each column). When I first encountered this I found it very strange because I thought that most users would usually want to apply a function to each of the rows. However, there’s a reason for applying the function to all columns, here’s an example:

In [216]:
values = pd.DataFrame([
    {'temperature': 31, 'moisture': 68},
    {'temperature': 33, 'moisture': 72},
    {'temperature': 31.5, 'moisture': 58},
    {'temperature': 28.5, 'moisture': 42},
])

import numpy as np
# We can easily create statistics for our data using apply -- that's why
# axis=0 is the default parameter to apply (to operate vertically to each column)
values.loc['avg']=values.apply(np.average )
values.loc['len']=values.apply(len )
values.loc['sum']=values.apply(sum)
print values
     moisture  temperature
0        68.0         31.0
1        72.0         33.0
2        58.0         31.5
3        42.0         28.5
avg      60.0         31.0
len       5.0          5.0
sum     305.0        160.0

Comprehending pivot_table

After this (rather long) introduction to using and manipulating DataFrames, the time has come to see pivot_table. The pivot_table method is applied to a DataFrame and its purpose is to “reshape” and “aggregate” the values of a DataFrame . More on reshaping can be found here and it means changing the indexes/columns of the DataFrame to create a new DataFrame that fits our needs. Aggregate on the other hand means that for each of the cells of the new DataFrame we’ll create a summary of the data that should have appeared there.

Let’s start by creating a nice set of data we’ll use for the pivot_table operations:

The recommended type of input (at least by me) to the pivot_table is a simple DataFrame like the one I have already created: Your index will be the id of your database (or you could even have an auto-generated index like in the example) and the columns will be the values you want to aggregate and reshape. This is very easy to create either by reading a file (xls/csv) or by a simple SQL query (substituting all foreign keys with a representative value). In the above example, we actually have the following columns: author, genre, name, pages, year, decade, size - this is a pool of data that will be very useful to remember for later and it is important to also keep it in your mind for your data. So, use a unique id as the index and remember the names of your columns.

As we can see in the documentation, the pivot_table method uses four basic parameters:

  • index: An array of the data that will be used as indexes to the resulting (i.e the reshaped and aggregated) DataFrame
  • columns: An array of the data that will be used as a columns to the resulting DataFrame
  • values: An array of the data whose values we want to aggregate in each cell
  • aggfunc: Which is the function (or functions) that will be used for aggregating the values

So, how it actually works? You select a number of the headers from your pool of data and assign them to either index or columns, depending if you want to put them horizontally or vertically. Notice that both index and columns:

  • take either a string (to denote a single column) or an array to denote multiple columns
  • are optional (but you must define one of them) — if you skip either columns or index you’ll get a Series instead of a DataFrame
  • are interchangable (you can put any header from your pool to either index or columns, depending on how you want to display your data)
  • are mutually exclusive (you can’t put the same header in both index and columns)

Multiple data headers means that you’ll have hierachical indexes / columns in your pivot (or MultiIndex as it’s called - remember that Index is used to store the axis of the DataFrame), ie the rows/columns would be grouped by a hierarchy. Let’s see an example of multiple indexes:

If we used 'decade' as an index, then the pivot_table index would be like

  • 70s value1 value2 …
  • 80s value1 value2 …
  • 90s value1 value2 …

while, if we used ['decade', 'year'] we’d hove something like

  • 70s
    • 1975 value1 value2 …
    • 1978 value1 value2 …
  • 80s
    • 1980 value1 value2 …
    • 1982 value1 value2 …
  • 90s
    • 1990 value1 value2 …

So, each year would automatically be grouped to its corresponing decade. The same would be true if we used ['decade', 'year'] in columns (but we’ll now have a vertical grouping from top to bottom). Notice that pandas doesn’t know if the values have any parent / child relationship but just goes from left to right (or top to bottom). For example, if we had used ['year', 'decade'], we’d get something like:

  • 1975 70s' value1 value2 …
  • 1978 70s' value1 value2 …
  • 1980 80s' value1 value2 …
  • 1982 80s' value1 value2 …

Also, pandas doesn’t care if the values of the hierarchical index are actually related. We could for example had selected a multi index of ['decade', 'size', 'genre'] that would

  • display the list of decades at the left (or at the top if we used it as a column)
  • for each decade will display the sizes of the book of that decade at the center (header column or row) and finally
  • at the right header (or bottom correspondingly) will display the available genres for each size.

So, since we have 3 values for each decade, 3 values for each size and 4 values for each genre in our dataset, each decade will appear 1 time (at the left), each size will appear 3 times (one for each decade) in the middle and each genre will appear 3x3 = 9(one for each combination of decade and size) times in the right. The total number of lines that our MultiIndex will contain is 3x3x4 = 36 (one line for each combination of decade/size/genre).

I hope the above clarifies how index and columns are used to create the headers for rows and index of pivot_table. I will show some examples of various index and columns combinations but first, I’d like to talk about contents of the pivot table (since we’ve only talked about the headers of rows/columns until now).

The values that the pivot_table will contain are defined through the other two parameters, values and aggfunc: We select one or more columns of the initial DataFrame through the values parameter and these are aggregated in the corresponding cell of the resulting dataframe using the aggfunc fuction, so for each cell as defined by index and column, pandas will pick the values that correspond to that cell and pass them to a function that will return the result (by combining these values). As can be understood, the values must be different than index and columns (so all three sets of values, index and columns must not intersect). By default, the values and aggfunc parameters may be ommited - this will result in using average as the function and selecting all numerical columns (that are not in indexes or columns of course) in the values.

I know that this is difficult to understand so I’ll give a simple example right away. Let’s first create a nice set of data for our samples:

In [218]:
books_df=pd.DataFrame([
    {'author':'Stephen King', 'name':'It', 'pages': 1138, 'year': 1986, 'genre': 'Horror',},  
    {'author':'Stephen King', 'name':'The Stand', 'pages': 823, 'year': 1978, 'genre': 'Horror',}, 
    {'author':'Stephen King', 'name': 'Salem\'s Lot', 'pages': 439, 'year': 1975, 'genre': 'Horror',},
    {'author':'Stephen King', 'name': 'Misery', 'pages': 320, 'year': 1987, 'genre': 'Thriller',},
    {'author':'Stephen King', 'name': 'Pet Sematary', 'pages': 374, 'year': 1983, 'genre': 'Horror',},
    {'author':'Stephen King', 'name': 'Bag of bones', 'pages': 529, 'year': 1998, 'genre': 'Horror',},
    {'author':'Stephen King', 'name': 'Different Seasons', 'pages': 527, 'year': 1982, 'genre': 'Thriller',},
    {'author':'Stephen King', 'name': 'The Dark Tower: The Gunslinger', 'pages': 224, 'year': 1982, 'genre': 'Fantasy',},
    {'author':'Stephen King', 'name': 'The Dark Tower II: The Drawing of the Three', 'pages': 400, 'year': 1987, 'genre': 'Fantasy',},
    {'author':'Stephen King', 'name': 'The Dark Tower III: The Waste Lands', 'pages': 512, 'year': 1991, 'genre': 'Fantasy',},
    {'author':'Stephen King', 'name': 'The Dark Tower IV: Wizard and Glass', 'pages': 787, 'year': 1998, 'genre': 'Fantasy',},
    {'author':'Michael Crichton', 'name':'Airframe', 'pages': 352, 'year': 1996, 'genre': 'Crime',},
    {'author':'Michael Crichton', 'name':'Jurassic Park', 'pages': 448, 'year':1990, 'genre': 'Fantasy',},
    {'author':'Michael Crichton', 'name':'Congo', 'pages': 348, 'year':1980, 'genre': 'Fantasy',},
    {'author':'Michael Crichton', 'name':'Sphere', 'pages': 385, 'year':1987, 'genre': 'Fantasy',},
    {'author':'Michael Crichton', 'name':'Rising Sun', 'pages': 385, 'year':1992, 'genre': 'Crime',},
    {'author':'Michael Crichton', 'name':'Disclosure ', 'pages': 597, 'year':1994, 'genre': 'Crime',},
    {'author':'Michael Crichton', 'name':'The Lost World ', 'pages': 430, 'year':1995, 'genre': 'Fantasy',},
    {'author':'John Grisham', 'name':'A Time to Kill', 'pages': 515, 'year':1989, 'genre': 'Crime',},
    {'author':'John Grisham', 'name':'The Firm', 'pages': 432, 'year':1991, 'genre': 'Crime',},
    {'author':'John Grisham', 'name':'The Pelican Brief', 'pages': 387, 'year':1992, 'genre': 'Crime',},
    {'author':'John Grisham', 'name':'The Chamber', 'pages': 496, 'year':1994, 'genre': 'Crime',},
    {'author':'John Grisham', 'name':'The Rainmaker', 'pages': 434, 'year':1995, 'genre': 'Crime',},
    {'author':'John Grisham', 'name':'The Runaway Jury', 'pages': 414, 'year':1996, 'genre': 'Crime',},
    {'author':'John Grisham', 'name':'The Street Lawyer', 'pages': 347, 'year':1998, 'genre': 'Crime',},
    {'author':'George Pelecanos', 'name':'Nick\'s Trip ', 'pages': 276, 'year':1993, 'genre': 'Crime',},
    {'author':'George Pelecanos', 'name':'A Firing Offense', 'pages': 216, 'year':1992, 'genre': 'Crime',},
    {'author':'George Pelecanos', 'name':'The Big Blowdown', 'pages': 313, 'year':1996, 'genre': 'Crime',},
    {'author':'George R.R Martin', 'name':'A Clash of Kings', 'pages': 768, 'year':1998, 'genre': 'Fantasy',},
    {'author':'George R.R Martin', 'name':'A Game of Thrones', 'pages': 694, 'year':1996, 'genre': 'Fantasy',},
])

# Add a decade column to the books DataFrame
def add_decade(y):
        return str(y['year'])[2] + '0\'s'
    
books_df['decade'] = books_df.apply(add_decade, axis=1)

# Add a size column to the books DataFrame
def add_size(y):
        if y['pages'] > 600:
            return 'big'
        elif y['pages'] < 300:
            return 'small'
        return 'medium'
    
books_df['size'] = books_df.apply(add_size, axis=1)
# Let's display it sorted here
books_df.sort_values(['decade', 'genre', 'year'])
Out[218]:
author genre name pages year decade size
2 Stephen King Horror Salem’s Lot 439 1975 70’s medium
1 Stephen King Horror The Stand 823 1978 70’s big
18 John Grisham Crime A Time to Kill 515 1989 80’s medium
13 Michael Crichton Fantasy Congo 348 1980 80’s medium
7 Stephen King Fantasy The Dark Tower: The Gunslinger 224 1982 80’s small
8 Stephen King Fantasy The Dark Tower II: The Drawing of the Three 400 1987 80’s medium
14 Michael Crichton Fantasy Sphere 385 1987 80’s medium
4 Stephen King Horror Pet Sematary 374 1983 80’s medium
0 Stephen King Horror It 1138 1986 80’s big
6 Stephen King Thriller Different Seasons 527 1982 80’s medium
3 Stephen King Thriller Misery 320 1987 80’s medium
19 John Grisham Crime The Firm 432 1991 90’s medium
15 Michael Crichton Crime Rising Sun 385 1992 90’s medium
20 John Grisham Crime The Pelican Brief 387 1992 90’s medium
26 George Pelecanos Crime A Firing Offense 216 1992 90’s small
25 George Pelecanos Crime Nick’s Trip 276 1993 90’s small
16 Michael Crichton Crime Disclosure 597 1994 90’s medium
21 John Grisham Crime The Chamber 496 1994 90’s medium
22 John Grisham Crime The Rainmaker 434 1995 90’s medium
11 Michael Crichton Crime Airframe 352 1996 90’s medium
23 John Grisham Crime The Runaway Jury 414 1996 90’s medium
27 George Pelecanos Crime The Big Blowdown 313 1996 90’s medium
24 John Grisham Crime The Street Lawyer 347 1998 90’s medium
12 Michael Crichton Fantasy Jurassic Park 448 1990 90’s medium
9 Stephen King Fantasy The Dark Tower III: The Waste Lands 512 1991 90’s medium
17 Michael Crichton Fantasy The Lost World 430 1995 90’s medium
29 George R.R Martin Fantasy A Game of Thrones 694 1996 90’s big
10 Stephen King Fantasy The Dark Tower IV: Wizard and Glass 787 1998 90’s big
28 George R.R Martin Fantasy A Clash of Kings 768 1998 90’s big
5 Stephen King Horror Bag of bones 529 1998 90’s medium
In [219]:
# Here's the first example
books_df.pivot_table(index=['decade', ], columns=['genre'], )
Out[219]:
pages year
genre Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller
decade
70’s NaN NaN 631.0 NaN NaN NaN 1976.5 NaN
80’s 515.000000 339.25 756.0 423.5 1989.000000 1984.000000 1984.5 1984.5
90’s 387.416667 606.50 529.0 NaN 1994.083333 1994.666667 1998.0 NaN

In the above, we aggregated dour books by their decade and genre.

As we can see we just passed decade as an index and genre as a column. We ommited values and aggfunc so the default values were used. What happened? Pandas created a new DataFrame that had the values of decade as its index and the values of genre as its columns. Now, for each of the values (remember that since we ommited values, pandas just gets all numerical data, i.e pages and year) it found the corresponding entries for each cell, got their average and put it in that cell. For example, since there are no Crime genre books in the 70’s we got a NaN to both the pages and year values. However, there are two Horror books, with 823 and 439 pages so their average is 631. Notice that for each value a separate top-level multi-column containing all indexes and columns was created - we can display only pages or year by indexing with ['pages'] or ['year']. We can think of each of the values columns as a seperate pivot table, so in the above example we have a pivot table for pages and a pivot table for year.

The above year column will also use the default average aggregate, something that doesn’t actually makes sense. So we can use values to explicitly define which values to aggregate — here’s how we can display only the pages:

In [220]:
books_df.pivot_table(index=['decade', ], columns=['genre'], values='pages')
#The above is more or less the same as with books_df.pivot_table(index=['decade', ], columns=['genre'], )['pages']
Out[220]:
genre Crime Fantasy Horror Thriller
decade
70’s NaN NaN 631.0 NaN
80’s 515.000000 339.25 756.0 423.5
90’s 387.416667 606.50 529.0 NaN
In [221]:
# In the above, we could pass ['pages'] instead of 'pages' as the values.
# This will result in creating a multi-column index with 'pages' as the top level column
books_df.pivot_table(index=['decade', ], columns=['genre'], values=['pages'])
Out[221]:
pages
genre Crime Fantasy Horror Thriller
decade
70’s NaN NaN 631.0 NaN
80’s 515.000000 339.25 756.0 423.5
90’s 387.416667 606.50 529.0 NaN
In [222]:
# Also, please notice that you can skip index or columns (but not both) to get a series
print books_df.pivot_table(index=['decade', ], values='pages')
print books_df.pivot_table(columns=['decade', ], values='pages')
decade
70's    631.000000
80's    470.111111
90's    464.052632
Name: pages, dtype: float64
decade
70's    631.000000
80's    470.111111
90's    464.052632
Name: pages, dtype: float64

Notice that above we have exactly the same result since for both cases we got a Series (it doesn’t matter that we used index in the first and columns in the second). Also, since we use less columns from our data pool (we used only decade while previously we used both decade and genre), the aggregation is more coarse: We got the averages of book pages in each decade. Of course, we could have the same values as before but use a multi-column index:

In [223]:
s1 = books_df.pivot_table(index=['decade', 'genre'], values='pages')
s2 = books_df.pivot_table(columns=['decade', 'genre'], values='pages')
print "s1 equals s2: ", s1.equals(s2)
print s1.index
s1
s1 equals s2:  True
MultiIndex(levels=[[u'70's', u'80's', u'90's'], [u'Crime', u'Fantasy', u'Horror', u'Thriller']],
           labels=[[0, 1, 1, 1, 1, 2, 2, 2], [2, 0, 1, 2, 3, 0, 1, 2]],
           names=[u'decade', u'genre'])
Out[223]:
decade  genre   
70's    Horror      631.000000
80's    Crime       515.000000
        Fantasy     339.250000
        Horror      756.000000
        Thriller    423.500000
90's    Crime       387.416667
        Fantasy     606.500000
        Horror      529.000000
Name: pages, dtype: float64

The above return a Series with a multi column index (they are both the same). Notice that the data is exactly the same as when we passed decade and genre in in index and column. The only difference is that some NaN rows have been dropped from the Series while in the DataFrame are there, for example Crime/70’s (the DataFrame will by default drop a row or index if all its values are NaN). Finally, take a look at how the multi index is represented (each easy to decypher it).

Let’s now say that we actually wanted to have a meaningful value for the year, for example the first year we have a book for that genre/decade:

In [224]:
# I'll intentionally skip values again to see what happens
books_df.pivot_table(index=['decade', ], columns=['genre'], aggfunc=min )
Out[224]:
author name pages year size
genre Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller
decade
70’s None None Stephen King None None None Salem’s Lot None None None 439 None None None 1975 None None None big None
80’s John Grisham Michael Crichton Stephen King Stephen King A Time to Kill Congo It Different Seasons 515 224 374 320 1989 1980 1983 1982 medium medium big medium
90’s George Pelecanos George R.R Martin Stephen King None A Firing Offense A Clash of Kings Bag of bones None 216 430 529 None 1991 1990 1998 None medium big medium None

This is more interesting. It seems that since we didn’t use the default aggfunc value but instead we passed our own (min), pandas did not use only the numerical values but used instead all remaining columns as values: Remember that our pool of data was author, genre, name, pages, year, decade, size, the genre and decade were used as an index/column so the remaining headers were used as values: author, name, pages, year, size! For the pages and year we can understand what happens: For example, for the Horror novels of the 80’s, the one with the minimal pages is Pet Sematery with 374 pages. The same has also the minimal year (1983). However, the one with the minimal name is It (since I is before P it just compares strings). The author is the same for both(Stephen King) and the minimum size is medium (since small (s) > medium (m)). Of course we could pass the values parameter to actually define which values we wanted to see.

Another really interesting thing is to take a peek at which are the values that are passed to the aggregation function. For this, we can just use tuple:

In [225]:
# Please notice that for reasons unknown to me, if I used aggfunc=tuple it would throw an exception
books_df_tuples = books_df.pivot_table(index=['decade', ], columns=['genre'], aggfunc=lambda x: tuple(x))
books_df_tuples
Out[225]:
author name pages year size
genre Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller
decade
70’s None None (Stephen King, Stephen King) None None None (The Stand, Salem’s Lot) None None None (823, 439) None None None (1978, 1975) None None None (big, medium) None
80’s (John Grisham,) (Stephen King, Stephen King, Michael Crichton,… (Stephen King, Stephen King) (Stephen King, Stephen King) (A Time to Kill,) (The Dark Tower: The Gunslinger, The Dark Towe… (It, Pet Sematary) (Misery, Different Seasons) (515,) (224, 400, 348, 385) (1138, 374) (320, 527) (1989,) (1982, 1987, 1980, 1987) (1986, 1983) (1987, 1982) (medium,) (small, medium, medium, medium) (big, medium) (medium, medium)
90’s (Michael Crichton, Michael Crichton, Michael C… (Stephen King, Stephen King, Michael Crichton,… (Stephen King,) None (Airframe, Rising Sun, Disclosure , The Firm, … (The Dark Tower III: The Waste Lands, The Dark… (Bag of bones,) None (352, 385, 597, 432, 387, 496, 434, 414, 347, … (512, 787, 448, 430, 768, 694) (529,) None (1996, 1992, 1994, 1991, 1992, 1994, 1995, 199… (1991, 1998, 1990, 1995, 1998, 1996) (1998,) None (medium, medium, medium, medium, medium, mediu… (medium, big, medium, medium, big, big) (medium,) None
In [226]:
# Dont worry about the ellipsis, the values are all there in each cell, for example
books_df_tuples['author']['Crime']['90\'s']
Out[226]:
('Michael Crichton',
 'Michael Crichton',
 'Michael Crichton',
 'John Grisham',
 'John Grisham',
 'John Grisham',
 'John Grisham',
 'John Grisham',
 'John Grisham',
 'George Pelecanos',
 'George Pelecanos',
 'George Pelecanos')

Notice that for the columns we havea MultiIndex of both value_type (author, name etc) and genre (Crime, Fantasy etc) while for the index we have the decade. So by books_df_tuples['author'] we’ll get the author values DataFrame, by books_df_tuples['author']['Crime'] we’ll get the Crime column of that DataFrame as a series and finally with books_df_tuples['author']['Crime']['90\'s'] we’ll get the actuall value which is all author names that have written Crime books in the 90’s — authors that have written multiple books will be displayed multiple times.

What if we wanted to only display the different authors for each genre and decade and remove duplicates:

In [227]:
books_df.pivot_table(
    index=['decade', ], 
    columns=['genre'], 
    values='author',
    aggfunc=lambda x: ', '.join(set(x))
)
Out[227]:
genre Crime Fantasy Horror Thriller
decade
70’s None None Stephen King None
80’s John Grisham Stephen King, Michael Crichton Stephen King Stephen King
90’s John Grisham, Michael Crichton, George Pelecanos Stephen King, George R.R Martin, Michael Crichton Stephen King None

What happens above is that we use the lambda x: ', '.join(set(x)) function to aggregate. This function will create a set (i.e remove duplicates) from the input (which is the corresponding values for each cell) and then join the set members using ','.

Notice that the inpout parameter that is passed to our aggfunc is actually a Series so don’t be alarmed if some list operations are not working:

In [228]:
books_df.pivot_table(
    index=['decade', ], 
    columns=['genre'], 
    values='author',
    aggfunc=lambda x: type(x)
)
Out[228]:
genre Crime Fantasy Horror Thriller
decade
70’s None None None
80’s
90’s None

Before continuing, I’d like to present another two parameters that could be passed to the pivot_table: fill_value to define a value to display when no values are found to be aggregated for a cell and margins to enable or disable margin rows/columns to the left/bottom that will aggregate all values of that column, for example:

In [229]:
books_df.pivot_table(
    index=['decade', ],
    columns=['genre'], 
    values ='author', 
    aggfunc=lambda x: ', '.join(set(x)),
    margins=True, 
    fill_value='-'
)
Out[229]:
genre Crime Fantasy Horror Thriller All
decade
70’s - - Stephen King - Stephen King
80’s John Grisham Stephen King, Michael Crichton Stephen King Stephen King Stephen King, John Grisham, Michael Crichton
90’s John Grisham, Michael Crichton, George Pelecanos Stephen King, George R.R Martin, Michael Crichton Stephen King - Stephen King, George R.R Martin, John Grisham,…
All John Grisham, Michael Crichton, George Pelecanos Stephen King, George R.R Martin, Michael Crichton Stephen King Stephen King Stephen King, George R.R Martin, John Grisham,…

The “All” column above will aggregate all values for each row/column (and the All/All down right will aggregate all values).

Using our previous knowledge of multi column indexes, let’s display the average number of pages each author writes for each decade and genre:

In [230]:
books_df.pivot_table(
    index=['decade', ],
    columns=['author', 'genre'], 
    values='pages', 
)
Out[230]:
author George Pelecanos George R.R Martin John Grisham Michael Crichton Stephen King
genre Crime Fantasy Crime Crime Fantasy Fantasy Horror Thriller
decade
70’s NaN NaN NaN NaN NaN NaN 631.0 NaN
80’s NaN NaN 515.000000 NaN 366.5 312.0 756.0 423.5
90’s 268.333333 731.0 418.333333 444.666667 439.0 649.5 529.0 NaN
In [231]:
# One interesting thing is that if we changed the order of the multi-columns we'd get the same data
books_df.pivot_table(
    index=['decade', ],
    columns=['genre', 'author'], 
    values='pages', 
)
Out[231]:
genre Crime Fantasy Horror Thriller
author George Pelecanos John Grisham Michael Crichton George R.R Martin Michael Crichton Stephen King Stephen King Stephen King
decade
70’s NaN NaN NaN NaN NaN NaN 631.0 NaN
80’s NaN 515.000000 NaN NaN 366.5 312.0 756.0 423.5
90’s 268.333333 418.333333 444.666667 731.0 439.0 649.5 529.0 NaN
In [232]:
# Or we can interchange index with columns to get the same data in a horizontal format
books_df.pivot_table(
    columns=['decade', ],
    index=['author', 'genre'], 
    values='pages', 
)
Out[232]:
decade 70’s 80’s 90’s
author genre
George Pelecanos Crime NaN NaN 268.333333
George R.R Martin Fantasy NaN NaN 731.000000
John Grisham Crime NaN 515.0 418.333333
Michael Crichton Crime NaN NaN 444.666667
Fantasy NaN 366.5 439.000000
Stephen King Fantasy NaN 312.0 649.500000
Horror 631.0 756.0 529.000000
Thriller NaN 423.5 NaN

So, Michael Crichton was writing 445 pages for Crime novels and 439 pages for Fantasy novels on average at the 90’s (of course this would be true if we had included all works of Michael Crichton). In the previous table we can see that, for example for George Pelecanos only the Crime genre is displayed (since he’s only Crime genre books in our database). Pandas automatically drops columns / lines where everything is empty (NaN)— if we for some reason wanted to display it, could use the dropna=False parameter:

In [233]:
books_df.pivot_table(
    index=['decade', ], 
    columns=['author', 'genre'], 
    values=['pages'], 
    dropna=False
)
Out[233]:
pages
author George Pelecanos George R.R Martin John Grisham Michael Crichton Stephen King
genre Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller
decade
70’s NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 631.0 NaN
80’s NaN NaN NaN NaN NaN NaN NaN NaN 515.000000 NaN NaN NaN NaN 366.5 NaN NaN NaN 312.0 756.0 423.5
90’s 268.333333 NaN NaN NaN NaN 731.0 NaN NaN 418.333333 NaN NaN NaN 444.666667 439.0 NaN NaN NaN 649.5 529.0 NaN
In [234]:
# We can create any combination we want with our multi-index colums, for example let's see where each book belongs
# be decade / year / author and genre / size
books_df.pivot_table(
    index=['decade', 'year', 'author', 'name' ], 
    columns=['size', 'genre'], 
    values='pages', 
    aggfunc=lambda x: 'v',
    fill_value='',
)
Out[234]:
size big medium small
genre Fantasy Horror Crime Fantasy Horror Thriller Crime Fantasy
decade year author name
70’s 1975 Stephen King Salem’s Lot v
1978 Stephen King The Stand v
80’s 1980 Michael Crichton Congo v
1982 Stephen King Different Seasons v
The Dark Tower: The Gunslinger v
1983 Stephen King Pet Sematary v
1986 Stephen King It v
1987 Michael Crichton Sphere v
Stephen King Misery v
The Dark Tower II: The Drawing of the Three v
1989 John Grisham A Time to Kill v
90’s 1990 Michael Crichton Jurassic Park v
1991 John Grisham The Firm v
Stephen King The Dark Tower III: The Waste Lands v
1992 George Pelecanos A Firing Offense v
John Grisham The Pelican Brief v
Michael Crichton Rising Sun v
1993 George Pelecanos Nick’s Trip v
1994 John Grisham The Chamber v
Michael Crichton Disclosure v
1995 John Grisham The Rainmaker v
Michael Crichton The Lost World v
1996 George Pelecanos The Big Blowdown v
George R.R Martin A Game of Thrones v
John Grisham The Runaway Jury v
Michael Crichton Airframe v
1998 George R.R Martin A Clash of Kings v
John Grisham The Street Lawyer v
Stephen King Bag of bones v
The Dark Tower IV: Wizard and Glass v

One more advanced thing I’d like to cover here is that we could define multiple aggregate functions for each one of our values by passing a dictionary of value:function to the aggfunc parameter. For example, if we wanted to display

  • the sum of the pages that have been written
  • the range of years for which we have books
  • the names of the authors
  • the name of one book we have

for each genre each decade, we could do something like this

In [235]:
def get_range(years):
    return '{0} - {1}'.format(min(years), max(years))

def get_names(authors):
    return ', '.join(set(authors))

def get_book(books):
    # Don't forget the the passed parameter is a Series so we use iloc to index it
    return books.iloc[0]
    
books_df.pivot_table(
    index=['decade',  ], 
    columns=['genre', ], 
    values=['author', 'pages', 'year', 'name'], 
    aggfunc={
        'author': get_names,
        'pages': sum,
        'year': get_range,
        'name': get_book,
    },
    fill_value='-'
)
Out[235]:
year pages name author
genre Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller Crime Fantasy Horror Thriller
decade
70’s - - 1975 - 1978 - - - 1262 - - - The Stand - - - Stephen King -
80’s 1989 - 1989 1980 - 1987 1983 - 1986 1982 - 1987 515 1357 1512 847 A Time to Kill The Dark Tower: The Gunslinger It Misery John Grisham Stephen King, Michael Crichton Stephen King Stephen King
90’s 1991 - 1998 1990 - 1998 1998 - 1998 - 4649 3639 529 - Airframe The Dark Tower III: The Waste Lands Bag of bones - John Grisham, Michael Crichton, George Pelecanos Stephen King, George R.R Martin, Michael Crichton Stephen King -

As we’ve already mentioned, the above is more or less like four different pivot tables — for example we could get a pivot table with only pages if we passed 'pages' as the values and sum as the aggfunc in the above method call.

Friends of pivot_table

The pivot_table method has some friends — these are functions that operate on DataFrame and can do reshaping but they are not as powerful as pivot_table. Let’s introduce some of them:

In [236]:
# First, I'll create a DataFrame as an example:
df=books_df.pivot_table(index=['decade', ], columns=['genre', 'author'], values='pages', aggfunc=sum, )
# This df has a Multi-index in columns - first level is the genres, second level is the authors
print df.columns
print df.index
df
MultiIndex(levels=[[u'Crime', u'Fantasy', u'Horror', u'Thriller'], [u'George Pelecanos', u'George R.R Martin', u'John Grisham', u'Michael Crichton', u'Stephen King']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 3], [0, 2, 3, 1, 3, 4, 4, 4]],
           names=[u'genre', u'author'])
Index([u'70's', u'80's', u'90's'], dtype='object', name=u'decade')
Out[236]:
genre Crime Fantasy Horror Thriller
author George Pelecanos John Grisham Michael Crichton George R.R Martin Michael Crichton Stephen King Stephen King Stephen King
decade
70’s NaN NaN NaN NaN NaN NaN 1262.0 NaN
80’s NaN 515.0 NaN NaN 733.0 624.0 1512.0 847.0
90’s 805.0 2510.0 1334.0 1462.0 878.0 1299.0 529.0 NaN

Notice above the MultiIndex and Index structs that are used to hold the axis for columns and index.

Stack / unstack

These two operations move columns to indexes and vice-versa. Let’s see what the manual says:

  • stack: Pivot a level of the (possibly hierarchical) column labels, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a new inner-most level of row labels.
  • unstack: Pivot a level of the (necessarily hierarchical) index labels, returning a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels. If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are not a MultiIndex). The level involved will automatically get sorted.

I must confess that I was not able to comprehend the above! A more easy explanation is that:

  • stack will re-arrange the values of the DataFrame so that the most inner column (the one at the bottom) will be converted to the most inner index (to the right)
  • unstack will do the exactly opposite: Re-arrange the values of the DataFrame so that the most inner index (the one at the right) will be converted to the most inner column (to the bottom)

Also, stack and unstack do not really make sense. It would be much easier (at least to me) if stack was named col_to_idx (or col_to_row) and unstack was named idx_to_col (row_to_col).

Before looking at examples of stack and unstack let’s take a look at the index and columns of our dataframe. Notice again the Index and MultiIndex data structs:

In [237]:
print "Index\n",df.index
print "Column\n",df.columns
Index
Index([u'70's', u'80's', u'90's'], dtype='object', name=u'decade')
Column
MultiIndex(levels=[[u'Crime', u'Fantasy', u'Horror', u'Thriller'], [u'George Pelecanos', u'George R.R Martin', u'John Grisham', u'Michael Crichton', u'Stephen King']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 3], [0, 2, 3, 1, 3, 4, 4, 4]],
           names=[u'genre', u'author'])
In [238]:
stacked = df.stack()
print "Index\n",stacked.index
print "Column\n",stacked.columns
stacked
Index
MultiIndex(levels=[[u'70's', u'80's', u'90's'], [u'George Pelecanos', u'George R.R Martin', u'John Grisham', u'Michael Crichton', u'Stephen King']],
           labels=[[0, 1, 1, 1, 2, 2, 2, 2, 2], [4, 2, 3, 4, 0, 1, 2, 3, 4]],
           names=[u'decade', u'author'])
Column
Index([u'Crime', u'Fantasy', u'Horror', u'Thriller'], dtype='object', name=u'genre')
Out[238]:
genre Crime Fantasy Horror Thriller
decade author
70’s Stephen King NaN NaN 1262.0 NaN
80’s John Grisham 515.0 NaN NaN NaN
Michael Crichton NaN 733.0 NaN NaN
Stephen King NaN 624.0 1512.0 847.0
90’s George Pelecanos 805.0 NaN NaN NaN
George R.R Martin NaN 1462.0 NaN NaN
John Grisham 2510.0 NaN NaN NaN
Michael Crichton 1334.0 878.0 NaN NaN
Stephen King NaN 1299.0 529.0 NaN

We see that the author column (which was the most inner column) was moved to the right of the indexes. The rows (index) was converted to a multi-index while the columns is a simple index now.

In [239]:
# We can of course stack again -- this time we'll get a series (with a three level index) since there are no more columns
stacked2 = stacked.stack()
print stacked2.index
stacked2
MultiIndex(levels=[[u'70's', u'80's', u'90's'], [u'George Pelecanos', u'George R.R Martin', u'John Grisham', u'Michael Crichton', u'Stephen King'], [u'Crime', u'Fantasy', u'Horror', u'Thriller']],
           labels=[[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2], [4, 2, 3, 4, 4, 4, 0, 1, 2, 3, 3, 4, 4], [2, 0, 1, 1, 2, 3, 0, 1, 0, 0, 1, 1, 2]],
           names=[u'decade', u'author', u'genre'])
Out[239]:
decade  author             genre   
70's    Stephen King       Horror      1262.0
80's    John Grisham       Crime        515.0
        Michael Crichton   Fantasy      733.0
        Stephen King       Fantasy      624.0
                           Horror      1512.0
                           Thriller     847.0
90's    George Pelecanos   Crime        805.0
        George R.R Martin  Fantasy     1462.0
        John Grisham       Crime       2510.0
        Michael Crichton   Crime       1334.0
                           Fantasy      878.0
        Stephen King       Fantasy     1299.0
                           Horror       529.0
dtype: float64
In [240]:
# unstack does the opposite operation
unstacked = df.unstack()
print unstacked.index
unstacked
MultiIndex(levels=[[u'Crime', u'Fantasy', u'Horror', u'Thriller'], [u'George Pelecanos', u'George R.R Martin', u'John Grisham', u'Michael Crichton', u'Stephen King'], [u'70's', u'80's', u'90's']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3], [0, 0, 0, 2, 2, 2, 3, 3, 3, 1, 1, 1, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=[u'genre', u'author', u'decade'])
Out[240]:
genre     author             decade
Crime     George Pelecanos   70's         NaN
                             80's         NaN
                             90's       805.0
          John Grisham       70's         NaN
                             80's       515.0
                             90's      2510.0
          Michael Crichton   70's         NaN
                             80's         NaN
                             90's      1334.0
Fantasy   George R.R Martin  70's         NaN
                             80's         NaN
                             90's      1462.0
          Michael Crichton   70's         NaN
                             80's       733.0
                             90's       878.0
          Stephen King       70's         NaN
                             80's       624.0
                             90's      1299.0
Horror    Stephen King       70's      1262.0
                             80's      1512.0
                             90's       529.0
Thriller  Stephen King       70's         NaN
                             80's       847.0
                             90's         NaN
dtype: float64

We now see that the that the decade column (which was the only index) was moved as the most inner to the columns — however this also converts this DataFrame to a Series!

One interesting thing to notice is that a Series can only be unstack()ed since it has no columns (so stack won’t work, remember stack = col_to_idx)

In [241]:
# unstack - move the rightmost idx (decade) to columns
unstacked.unstack()
Out[241]:
decade 70’s 80’s 90’s
genre author
Crime George Pelecanos NaN NaN 805.0
John Grisham NaN 515.0 2510.0
Michael Crichton NaN NaN 1334.0
Fantasy George R.R Martin NaN NaN 1462.0
Michael Crichton NaN 733.0 878.0
Stephen King NaN 624.0 1299.0
Horror Stephen King 1262.0 1512.0 529.0
Thriller Stephen King NaN 847.0 NaN
In [243]:
#Also, because `unstack` works on series we can use it for ever to cycle through different representations
df.unstack()
df.unstack().unstack()
df.unstack().unstack().unstack().unstack().unstack().unstack().unstack().unstack().unstack()
Out[243]:
decade 70’s 80’s 90’s
author George Pelecanos George R.R Martin John Grisham Michael Crichton Stephen King George Pelecanos George R.R Martin John Grisham Michael Crichton Stephen King George Pelecanos George R.R Martin John Grisham Michael Crichton Stephen King
genre
Crime NaN NaN NaN NaN NaN NaN NaN 515.0 NaN NaN 805.0 NaN 2510.0 1334.0 NaN
Fantasy NaN NaN NaN NaN NaN NaN NaN NaN 733.0 624.0 NaN 1462.0 NaN 878.0 1299.0
Horror NaN NaN NaN NaN 1262.0 NaN NaN NaN NaN 1512.0 NaN NaN NaN NaN 529.0
Thriller NaN NaN NaN NaN NaN NaN NaN NaN NaN 847.0 NaN NaN NaN NaN NaN
In [244]:
# One final comment is that stack and unstack can get a level parameter to inticate which
# index/column level we want to pivot
# For example the following will unstack - idx_to_col the leftmost index (genre)
unstacked.unstack(level=0)
Out[244]:
genre Crime Fantasy Horror Thriller
author decade
George Pelecanos 70’s NaN NaN NaN NaN
80’s NaN NaN NaN NaN
90’s 805.0 NaN NaN NaN
George R.R Martin 70’s NaN NaN NaN NaN
80’s NaN NaN NaN NaN
90’s NaN 1462.0 NaN NaN
John Grisham 70’s NaN NaN NaN NaN
80’s 515.0 NaN NaN NaN
90’s 2510.0 NaN NaN NaN
Michael Crichton 70’s NaN NaN NaN NaN
80’s NaN 733.0 NaN NaN
90’s 1334.0 878.0 NaN NaN
Stephen King 70’s NaN NaN 1262.0 NaN
80’s NaN 624.0 1512.0 847.0
90’s NaN 1299.0 529.0 NaN

pivot

The pivot command will convert a column values to an index. This is similar like the pivot_table but does not aggregate the values and does not create multi-hierarchy indexes so you must be careful that each cell will contain only one value.

In [245]:
# We'll use the initial books_df DataFrame
books_df.pivot(index='name', columns='genre', values='year')
# Notice that we used 'name' as an index (to be sure that each cell will contain a single value)
Out[245]:
genre Crime Fantasy Horror Thriller
name
A Clash of Kings NaN 1998.0 NaN NaN
A Firing Offense 1992.0 NaN NaN NaN
A Game of Thrones NaN 1996.0 NaN NaN
A Time to Kill 1989.0 NaN NaN NaN
Airframe 1996.0 NaN NaN NaN
Bag of bones NaN NaN 1998.0 NaN
Congo NaN 1980.0 NaN NaN
Different Seasons NaN NaN NaN 1982.0
Disclosure 1994.0 NaN NaN NaN
It NaN NaN 1986.0 NaN
Jurassic Park NaN 1990.0 NaN NaN
Misery NaN NaN NaN 1987.0
Nick’s Trip 1993.0 NaN NaN NaN
Pet Sematary NaN NaN 1983.0 NaN
Rising Sun 1992.0 NaN NaN NaN
Salem’s Lot NaN NaN 1975.0 NaN
Sphere NaN 1987.0 NaN NaN
The Big Blowdown 1996.0 NaN NaN NaN
The Chamber 1994.0 NaN NaN NaN
The Dark Tower II: The Drawing of the Three NaN 1987.0 NaN NaN
The Dark Tower III: The Waste Lands NaN 1991.0 NaN NaN
The Dark Tower IV: Wizard and Glass NaN 1998.0 NaN NaN
The Dark Tower: The Gunslinger NaN 1982.0 NaN NaN
The Firm 1991.0 NaN NaN NaN
The Lost World NaN 1995.0 NaN NaN
The Pelican Brief 1992.0 NaN NaN NaN
The Rainmaker 1995.0 NaN NaN NaN
The Runaway Jury 1996.0 NaN NaN NaN
The Stand NaN NaN 1978.0 NaN
The Street Lawyer 1998.0 NaN NaN NaN
In [246]:
# We could pivot by using name as a column
books_df.pivot(index='decade', columns='name', values='pages')
Out[246]:
name A Clash of Kings A Firing Offense A Game of Thrones A Time to Kill Airframe Bag of bones Congo Different Seasons Disclosure It The Dark Tower III: The Waste Lands The Dark Tower IV: Wizard and Glass The Dark Tower: The Gunslinger The Firm The Lost World The Pelican Brief The Rainmaker The Runaway Jury The Stand The Street Lawyer
decade
70’s NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 823.0 NaN
80’s NaN NaN NaN 515.0 NaN NaN 348.0 527.0 NaN 1138.0 NaN NaN 224.0 NaN NaN NaN NaN NaN NaN NaN
90’s 768.0 216.0 694.0 NaN 352.0 529.0 NaN NaN 597.0 NaN 512.0 787.0 NaN 432.0 430.0 387.0 434.0 414.0 NaN 347.0

3 rows × 30 columns

What happens is that we got the values of one column and coverted these to a column/index and use another column’s values as the values of the new DataFrame. So, in the first example the values of the genre column were converted to columns and inside each cell we put the page number. In the second example instead we converted the decade value to index and put the page number inside each cell. In both cases we used the name of the book to be sure that we would each cell will contain one value (remember that pivot cannot aggregate).

The pivot command is not very useful (at least to me) since it does not actually modify (by aggregating) data but just changes its representation - you won’t get any new information from pivot but you’ll only display it differently. Also keep in mind that each cell after the pivoting must contain one value, for example in the above wouldn’t work if we used author as columns (instead of book name).

groupby

The final method we’ll talk about and is related to pivot_table is groupby. This of course is related to the SQL group by method and should be easy to comprehend. The groupby gets a parameter that defines how to group the entries and returns a GroupBy object that contains the groups. The GroupBy object can be enumerated to get the groups and their data. It’s interesting to take a look at the structure of each such object:

In [247]:
groupby_object = books_df.groupby(['decade', 'author'])
print type(groupby_object)
# Let's see what groupby-object contains
for x in groupby_object:
    print "type: ", type(x), "len: ", len(x) #len(x), type(x[0]), type(x[1]), x[0]
    print "first element of tuple", type(x[0]), x[0]
    print "second element of tuple", type(x[1])
    

type:   len:  2
first element of tuple  ("70's", 'Stephen King')
second element of tuple 
type:   len:  2
first element of tuple  ("80's", 'John Grisham')
second element of tuple 
type:   len:  2
first element of tuple  ("80's", 'Michael Crichton')
second element of tuple 
type:   len:  2
first element of tuple  ("80's", 'Stephen King')
second element of tuple 
type:   len:  2
first element of tuple  ("90's", 'George Pelecanos')
second element of tuple 
type:   len:  2
first element of tuple  ("90's", 'George R.R Martin')
second element of tuple 
type:   len:  2
first element of tuple  ("90's", 'John Grisham')
second element of tuple 
type:   len:  2
first element of tuple  ("90's", 'Michael Crichton')
second element of tuple 
type:   len:  2
first element of tuple  ("90's", 'Stephen King')
second element of tuple 

So, from the above we can see that the GroupBy object contains a number of 2-element tuples. Each tuple contains (another tuple with) the columns that were used for groupping and the actual data of that group (as a DataFrame). Now, we could either use the enumeration I shown above to operate on each group or, better, to use some of the methods that the GroupBy object contains:

In [248]:
# get some statistics
print groupby_object.mean()
print groupby_object.sum()

# We can use the aggregate method to do anything we want
# Each aggregate function will get a Series with the values (similar to pivot_table)
def year_aggr(x):
    return '{0}-{1}'.format(max(x), min(x))

def genre_aggr(x):
    return ', '.join(set(x))

groupby_object.aggregate({'year':year_aggr, 'pages': sum, 'genre':genre_aggr})
                               pages         year
decade author                                    
70's   Stephen King       631.000000  1976.500000
80's   John Grisham       515.000000  1989.000000
       Michael Crichton   366.500000  1983.500000
       Stephen King       497.166667  1984.500000
90's   George Pelecanos   268.333333  1993.666667
       George R.R Martin  731.000000  1997.000000
       John Grisham       418.333333  1994.333333
       Michael Crichton   442.400000  1993.400000
       Stephen King       609.333333  1995.666667
                          pages   year
decade author                         
70's   Stephen King        1262   3953
80's   John Grisham         515   1989
       Michael Crichton     733   3967
       Stephen King        2983  11907
90's   George Pelecanos     805   5981
       George R.R Martin   1462   3994
       John Grisham        2510  11966
       Michael Crichton    2212   9967
       Stephen King        1828   5987
Out[248]:
genre pages year
decade author
70’s Stephen King Horror 1262 1978-1975
80’s John Grisham Crime 515 1989-1989
Michael Crichton Fantasy 733 1987-1980
Stephen King Fantasy, Horror, Thriller 2983 1987-1982
90’s George Pelecanos Crime 805 1996-1992
George R.R Martin Fantasy 1462 1998-1996
John Grisham Crime 2510 1998-1991
Michael Crichton Fantasy, Crime 2212 1996-1990
Stephen King Fantasy, Horror 1828 1998-1991
In [249]:
# It's interesting to notice that the previous is exactly
# the same that we can do with this pivot_table command
books_df.pivot_table(index=['decade', 'author'], values=['genre', 'pages', 'year'], aggfunc={
    'genre': genre_aggr,
    'year': year_aggr,
    'pages': sum,
})
Out[249]:
genre pages year
decade author
70’s Stephen King Horror 1262 1978-1975
80’s John Grisham Crime 515 1989-1989
Michael Crichton Fantasy 733 1987-1980
Stephen King Fantasy, Horror, Thriller 2983 1987-1982
90’s George Pelecanos Crime 805 1996-1992
George R.R Martin Fantasy 1462 1998-1996
John Grisham Crime 2510 1998-1991
Michael Crichton Fantasy, Crime 2212 1996-1990
Stephen King Fantasy, Horror 1828 1998-1991
In [250]:
# The added value of pivot_table over group of course is that we could instead do 
books_df.pivot_table(columns=['decade'], index=['author'], values=['genre', 'pages', 'year'], aggfunc={
        'genre': genre_aggr,
        'year': year_aggr,
        'pages': sum,
    })
# or any other combination of columns - index we wanted
Out[250]:
genre pages year
decade 70’s 80’s 90’s 70’s 80’s 90’s 70’s 80’s 90’s
author
George Pelecanos None None Crime None None 805 None None 1996-1992
George R.R Martin None None Fantasy None None 1462 None None 1998-1996
John Grisham None Crime Crime None 515 2510 None 1989-1989 1998-1991
Michael Crichton None Fantasy Fantasy, Crime None 733 2212 None 1987-1980 1996-1990
Stephen King Horror Fantasy, Horror, Thriller Fantasy, Horror 1262 2983 1828 1978-1975 1987-1982 1998-1991

A real-world example

To continue with a real-world example, I will use the MovieLens 100k to represent some pivot_table (and friends) operations. To load the data I’ve used the code already provided by the three part series I already mentioned. Notice that this won’t load the genre of the movie (left as an excersize to the reader).

In [251]:
# Useful to display graphs inline
%matplotlib inline
In [252]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

path = 'C:/Users/serafeim/Downloads/ml-100k' # Change this to your own directory
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv(os.path.join(path, 'u.user'), sep='|', names=u_cols, encoding='latin-1')
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(os.path.join(path, 'u.data'), sep='\t', names=r_cols, encoding='latin-1')
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv(os.path.join(path, 'u.item'), sep='|', names=m_cols, usecols=range(5), encoding='latin-1')
movie_ratings = movies.merge(ratings)
lens = movie_ratings.merge(users)
lens.head()
Out[252]:
movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp age sex occupation zip_code
0 1 Toy Story (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Toy%20Story%2… 308 4 887736532 60 M retired 95076
1 4 Get Shorty (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Get%20Shorty%… 308 5 887737890 60 M retired 95076
2 5 Copycat (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Copycat%20(1995) 308 4 887739608 60 M retired 95076
3 7 Twelve Monkeys (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Twelve%20Monk… 308 4 887738847 60 M retired 95076
4 8 Babe (1995) 01-Jan-1995 NaN http://us.imdb.com/M/title-exact?Babe%20(1995) 308 5 887736696 60 M retired 95076

As we can see, we are using the merge method of DataFrame to do an SQL-style join between movies and ratings and then between movie_ratings and users - this will result in a fat DataFrame with all the info of the movie and user for each review. The head method displays the 5 first rows of the DataFrame.

We can see that there’s a zip_code - I wanted to convert it to the specific US-state. There’s a service from ziptasticapi.com that can be used for that but you need to do the queries one-by-one! I’ve executed the queries once and created a zip-state dict to be used instead:

In [253]:
# The following can be used to find out the state by zip_code for each row
API="http://ziptasticapi.com/{0}"
states = {}
import urllib2, json
def get_state(s):
    global states
    if states.get(s):
        return states.get(s)
    headers = { 'User-Agent' : 'Mozilla/5.0' }
    req = urllib2.Request(API.format(s), None, headers)    
    state = json.loads(urllib2.urlopen(req).read()).get('state')
    states[s] = state
    return state

#using this command we can add the state column
#lens['state']=lens['zip_code'].apply(get_state)
In [254]:
# However, since we shouldn't call the zipstatic API so many times, I'll provide here the
# dict of zip:state (that I actually got through the previous command)
states2={u'73013': u'OK', u'77042': u'TX', u'61455': u'IL', u'55345': u'MN', u'19711': u'DE', 
u'19716': u'DE', u'55343': u'MN', u'15203': u'PA', u'48446': u'MI', u'92093': u'CA', 
u'92653': u'CA', u'61073': u'IL', u'55346': u'MN', u'32303': u'FL', u'32301': u'FL', 
u'32712': u'FL', u'06437': u'CT', u'01581': u'MA', u'85719': u'AZ', u'12065': u'NY', 
u'10960': u'NY', u'32789': u'FL', u'01375': u'MA', u'60135': u'IL', u'98501': u'WA', 
u'95521': u'CA', u'49512': u'MI', u'02215': u'MA', u'80209': u'CO', u'97330': u'OR', 
u'98006': u'WA', u'52302': u'IA', u'60187': u'IL', u'46005': None, u'46260': u'IN', 
u'63021': u'MO', u'17036': u'PA', u'99206': u'WA', u'10707': u'NY', u'75206': u'TX', 
u'21208': u'MD', u'75204': u'TX', u'60007': u'IL', u'60005': u'IL', u'22902': u'VA', 
u'21201': u'MD', u'21206': u'MD', u'22906': u'VA', u'45680': u'OH', u'94025': u'CA', 
u'53144': u'WI', u'05001': u'VT', u'97208': u'OR', u'54494': u'WI', u'90008': u'CA', 
u'45660': u'OH', u'53214': u'WI', u'53210': u'WI', u'53211': u'WI', u'37411': u'TN', 
u'37412': u'TN', u'63119': u'MO', u'10025': u'NY', u'10022': u'NY', u'10021': u'NY', 
u'44648': u'OH', u'60641': u'IL', u'78213': u'TX', u'78212': u'TX', u'22973': u'VA', 
u'96819': u'HI', u'42647': u'KY', u'62901': u'IL', u'62903': u'IL', u'90095': u'CA', 
u'04102': u'ME', u'14627': u'NY', u'20006': u'DC', u'70808': u'LA', u'20003': u'DC', 
u'20001': u'DC', u'70802': u'LA', u'05452': u'VT', u'20009': u'DC', u'20008': u'DC', 
u'08610': u'NJ', u'33775': u'FL', u'30329': u'GA', u'76013': u'TX', u'84408': u'UT', 
u'11758': u'NY', u'95014': u'CA', u'08052': u'NJ', u'37777': u'TN', u'37771': u'TN', 
u'76309': u'TX', u'23509': u'VA', u'50311': u'IA', u'33884': u'FL', u'30803': u'GA', 
u'42459': u'KY', u'95064': u'CA', u'02859': u'RI', u'68504': u'NE', u'40243': u'KY', 
u'68503': u'NE', u'02918': u'RI', u'34656': u'FL', u'L1V3W': None, u'22003': u'VA', 
u'55113': u'MN', u'55117': u'MN', u'55116': u'MN', u'23112': u'VA', u'91201': u'CA', 
u'91206': u'CA', u'06927': u'CT', u'55337': u'MN', u'02136': u'MA', u'11577': u'NY', 
u'47130': u'IN', u'02139': u'MA', u'02138': u'MA', u'N2L5N': None, u'15217': u'PA', 
u'15213': u'PA', u'50670': u'IA', u'04988': u'ME', u'19382': u'PA', u'87501': u'NM', 
u'55454': u'MN', u'19149': u'PA', u'19146': u'PA', u'55021': u'MN', u'V1G4L': None, 
u'06405': u'CT', u'73071': u'OK', u'77459': u'TX', u'92037': u'CA', u'60089': u'IL', 
u'64118': u'MO', u'21114': u'MD', u'98101': u'WA', u'98103': u'WA', u'98102': u'WA', 
u'02341': u'MA', u'94306': u'CA', u'94305': u'CA', u'85233': u'AZ', u'11753': u'NY', 
u'90814': u'CA', u'14534': u'NY', u'98072': u'WA', u'16803': u'PA', u'43512': u'OH', 
u'10309': u'NY', u'95468': u'CA', u'60402': u'IL', u'60152': u'IL', u'75218': u'TX', 
u'98199': u'WA', u'12603': u'NY', u'90254': u'CA', u'84116': u'UT', u'16801': u'PA', 
u'41850': None, u'97214': u'OR', u'97215': u'OR', u'97212': u'OR', u'10019': u'NY', 
u'10018': u'NY', u'49705': u'MI', u'10011': u'NY', u'10010': u'NY', u'10016': u'NY', 
u'13210': u'NY', u'78209': u'TX', u'60659': u'IL', u'01754': u'MA', u'60657': u'IL', 
u'70124': u'LA', u'12345': u'NY', u'95161': u'CA', u'20015': u'DC', u'94708': u'CA', 
u'58202': u'ND', u'29379': u'SC', u'94703': u'CA', u'94702': u'CA', u'68767': u'NE', 
u'24060': u'VA', u'33763': u'FL', u'33765': u'FL', u'54248': None, u'80303': u'CO', 
u'03062': u'NH', u'03060': u'NH', u'18301': u'PA', u'08403': u'NJ', u'94551': u'CA', 
u'48043': u'MI', u'28450': u'NC', u'78264': u'TX', u'63304': u'MO', u'06333': u'CT', 
u'08105': u'NJ', u'07102': u'NJ', u'18015': u'PA', u'11231': u'NY', u'27606': u'NC', 
u'38115': u'TN', u'95076': u'CA', u'77845': u'TX', u'77841': u'TX', u'14476': u'NY', 
u'08360': u'NJ', u'02903': u'RI', u'01945': u'MA', u'40256': u'KY', u'91919': None, 
u'89801': u'NV', u'48825': u'MI', u'48823': u'MI', u'07204': u'NJ', u'92154': u'CA', 
u'55106': u'MN', u'55107': u'MN', u'55104': u'MN', u'55105': u'MN', u'55108': u'MN', 
u'55109': u'MN', u'61755': u'IL', u'91351': u'CA', u'Y1A6B': None, u'91606': u'CA', 
u'28734': u'NC', u'55320': u'MN', u'78205': u'TX', u'11201': u'NY', u'01824': u'MA', 
u'47024': u'IN', u'43212': u'OH', u'43215': u'OH', u'02125': u'MA', u'08816': u'NJ', 
u'15222': u'PA', u'M7A1A': None, u'97520': u'OR', u'76234': u'TX', u'55420': u'MN', 
u'55423': u'MN', u'55422': u'MN', u'55038': u'MN', u'55428': u'MN', u'94560': u'CA', 
u'T8H1N': None, u'16125': u'PA', u'02154': None, u'R3T5K': None, u'35802': u'AL', 
u'97006': u'OR', u'02159': None, u'32250': u'FL', u'50613': u'IA', u'92020': u'CA', u'60804': u'IL', 
u'21044': u'MD', u'98117': u'WA', u'E2A4H': None, u'90804': u'CA', u'74101': u'OK', 
u'22903': u'VA', u'22904': u'VA', u'52245': u'IA', u'52246': u'IA', u'52241': u'IA', 
u'17331': u'PA', u'20723': u'MD', u'63044': u'MO', u'17110': u'PA', u'10314': u'NY', 
u'32605': u'FL', u'60067': u'IL', u'90247': u'CA', u'61820': u'IL', u'84103': u'UT', 
u'84105': u'UT', u'84107': u'UT', u'60090': u'IL', u'99835': u'AK', u'98281': u'WA', 
u'05201': u'VT', u'10003': u'NY', u'20090': u'DC', u'90064': u'CA', u'01040': u'MA', 
u'21250': u'MD', u'20657': u'MD', u'97203': u'OR', u'60466': u'IL', u'42141': u'KY', 
u'44134': u'OH', u'78390': u'TX', u'44133': u'OH', u'83686': u'ID', u'14085': u'NY', 
u'45810': u'OH', u'75006': u'TX', u'63146': u'MO', u'91335': u'CA', u'39762': u'MS', 
u'80302': u'CO', u'44224': u'OH', u'37076': u'TN', u'33755': u'FL', u'54901': u'WI', 
u'03052': u'NH', u'30220': u'GA', u'94403': u'CA', u'91040': u'CA', u'29464': u'SC', 
u'49931': u'MI', u'49938': u'MI', u'71457': u'LA', u'03755': u'NH', u'78739': u'TX', 
u'77048': u'TX', u'30040': u'GA', u'11101': u'NY', u'83702': u'ID', u'31211': u'GA', 
u'83709': u'ID', u'34105': u'FL', u'76201': u'TX', u'91903': u'CA', u'01913': u'MA', 
u'31404': u'GA', u'27705': u'NC', u'27708': u'NC', u'92121': u'CA', u'29631': u'SC', 
u'28480': u'NC', u'94583': u'CA', u'V0R2H': None, u'91344': u'CA', u'28806': u'NC', 
u'95821': u'CA', u'95823': u'CA', u'43202': u'OH', u'11211': u'NY', u'11217': u'NY', 
u'43204': u'OH', u'77504': u'TX', u'01960': u'MA', u'82435': u'WY', u'19047': u'PA', 
u'15235': u'PA', u'15237': u'PA', u'15232': u'PA', u'92629': u'CA', u'55436': u'MN', 
u'50112': u'IA', u'55439': u'MN', u'92626': u'CA', u'22030': u'VA', u'98682': u'WA', 
u'65203': u'MO', u'19341': u'PA', u'77005': u'TX', u'77009': u'TX', u'77008': u'TX', 
u'02146': None, u'02143': u'MA', u'56567': u'MN', u'93055': None, u'27249': u'NC', 
u'06492': u'CT', u'93117': u'CA', u'20064': u'DC', u'64131': u'MO', u'17604': u'PA', u'94086': u'CA', 
u'01915': u'MA', u'02320': None, u'01810': u'MA', u'02324': u'MA', u'06260': u'CT', 
u'32067': u'FL', u'78155': u'TX', u'43537': u'OH', u'94131': u'CA', u'90405': u'CA', 
u'85210': u'AZ', u'17325': u'PA', u'53188': u'WI', u'98225': u'WA', u'53066': u'WI', 
u'95403': u'CA', u'32114': u'FL', u'01602': u'MA', u'02176': u'MA', u'85281': u'AZ', 
u'85282': u'AZ', u'22911': u'VA', u'53115': u'WI', u'20817': u'MD', u'97405': u'OR', 
u'90291': u'CA', u'97403': u'OR', u'54467': u'WI', u'97408': u'OR', u'68106': u'NE', 
u'25652': u'WV', u'60476': u'IL', u'75240': u'TX', u'12205': u'NY', u'14853': u'NY', 
u'V0R2M': None, u'14850': u'NY', u'05464': u'VT', u'20910': u'MD', u'85710': u'AZ', 
u'85711': u'AZ', u'44124': u'OH', u'44691': u'OH', u'48118': u'MI', u'29210': u'SC', 
u'49428': u'MI', u'95628': u'CA', u'75013': u'TX', u'94612': u'CA', u'94619': u'CA', 
u'94618': u'CA', u'84302': u'UT', u'97232': u'OR', u'59801': u'MT', u'36106': u'AL', 
u'15610': u'PA', u'33556': u'FL', u'14211': u'NY', u'V5A2B': None, u'14216': u'NY', 
u'99709': u'AK', u'37212': u'TN', u'40206': u'KY', u'40205': u'KY', u'78741': u'TX', 
u'78628': u'TX', u'78746': u'TX', u'50266': u'IA', u'95050': u'CA', u'66315': None, 
u'77904': u'TX', u'30350': u'GA', u'30606': u'GA', u'83716': u'ID', u'79508': u'TX', 
u'67401': u'KS', u'27713': u'NC', u'23092': u'VA', u'55128': u'MN', u'80227': u'CO', 
u'55122': u'MN', u'92507': u'CA', u'55125': u'MN', u'97124': u'OR', u'43201': u'OH', 
u'94591': u'CA', u'66221': u'KS', u'30078': u'GA', u'66046': u'KS', u'28814': u'NC', 
u'23322': u'VA', u'77081': u'TX', u'89503': u'NV', u'77840': u'TX', u'07310': u'NJ', 
u'01970': u'MA', u'55013': u'MN', u'08832': u'NJ', u'22306': u'VA', u'55409': u'MN', 
u'55408': u'MN', u'55406': u'MN', u'73162': u'OK', u'73439': u'OK', u'06472': u'CT', 
u'93101': u'CA', u'93063': u'CA', u'93109': u'CA', u'55303': u'MN', u'96349': u'AP', 
u'55305': u'MN', u'07733': u'NJ', u'98038': u'WA', u'98034': u'WA', u'17345': u'PA', 
u'43017': u'OH', u'31820': u'GA', u'06371': u'CT', u'85202': u'AZ', u'93402': u'CA', 
u'51157': None, u'38866': u'MS', u'20879': u'MD', u'95453': u'CA', u'53171': u'WI', 
u'98257': u'WA', u'20707': u'MD', u'19807': u'DE', u'93711': u'CA', u'61801': u'IL', 
u'98133': u'WA', u'97365': u'OR', u'05779': None, u'46538': u'IN', u'01701': u'MA', 
u'22932': u'VA', u'75094': u'TX', u'20770': u'MD', u'70116': u'LA', u'60440': u'IL', 
u'94117': u'CA', u'94115': u'CA', u'80538': u'CO', u'48322': u'MI', u'45243': u'OH', 
u'80027': u'CO', u'90210': u'CA', u'85016': u'AZ', u'90034': u'CA', u'48103': u'MI', 
u'20902': u'MD', u'48105': u'MI', u'59717': u'MT', u'29201': u'SC', u'29206': u'SC', 
u'29205': u'SC', u'63129': u'MO', u'60614': u'IL', u'60615': u'IL', u'60613': u'IL', 
u'20685': u'MD', u'94608': u'CA', u'48197': u'MI', u'95110': u'CA', u'36117': u'AL', 
u'08534': u'NJ', u'37901': u'TN', u'91105': u'CA', u'48076': u'MI', u'95129': u'CA', 
u'29440': u'SC', u'20057': u'DC', u'95123': u'CA', u'01080': u'MA', u'33319': u'FL', 
u'07030': u'NJ', u'71701': u'AR', u'07039': u'NJ', u'78750': u'TX', u'78756': u'TX', 
u'17870': u'PA', u'27502': u'NC', u'02140': u'MA', u'99603': u'AK', u'97229': u'OR', 
u'76059': u'TX', u'V3N4P': None, u'01331': u'MA', u'30011': u'GA', u'N4T1A': None, 
u'40515': u'KY', u'48911': u'MI', u'50325': u'IA', u'92103': u'CA', u'50322': u'IA', u'30067': u'GA', 
u'30068': u'GA', u'06811': u'CT', u'77801': u'TX', u'L9G2B': None, u'22202': u'VA', 
u'22207': u'VA', u'22206': u'VA', u'11238': u'NY', u'K7L5J': None, u'89104': u'NV', 
u'87544': u'NM', u'43221': u'OH', u'01940': u'MA', u'23237': u'VA', u'19102': u'PA', 
u'19104': u'PA', u'95938': u'CA', u'55412': u'MN', u'55413': u'MN', u'55414': u'MN', 
u'55417': u'MN', u'16509': u'PA', u'16506': u'PA', u'60302': u'IL', u'55369': u'MN', 
u'90036': u'CA', u'73034': u'OK', u'27105': u'NC', u'61401': u'IL', u'19422': u'PA', 
u'39042': u'MS', u'98121': u'WA', u'98027': u'WA', u'21012': u'MD', u'21010': u'MD', 
u'93612': u'CA', u'96754': u'HI', u'06365': u'CT', u'93550': u'CA', u'93555': u'CA', 
u'93003': u'CA', u'60115': u'IL', u'45439': u'OH', u'80228': u'CO', u'64153': u'MO', 
u'91505': u'CA', u'81648': u'CO', u'49036': u'MI', u'12180': u'NY', u'11701': u'NY', 
u'53715': u'WI', u'53711': u'WI', u'53713': u'WI', u'90630': u'CA', u'94040': u'CA', 
u'94043': u'CA', u'05146': u'VT', u'80521': u'CO', u'80526': u'CO', u'80525': u'CO', 
u'44265': u'OH', u'09645': u'AE', u'33716': u'FL', u'60626': u'IL', u'63132': u'MO', 
u'63130': u'MO', u'01002': u'MA', u'98801': u'WA', u'83814': u'ID', u'90703': u'CA', 
u'42101': u'KY', u'44212': u'OH', u'54302': u'WI', u'44405': u'OH', u'84010': u'UT', 
u'80913': u'CO', u'80919': u'CO', u'37235': u'TN', u'29301': u'SC', u'44106': u'OH', 
u'84604': u'UT', u'84601': u'UT', u'33308': u'FL', u'37725': u'TN', u'18053': u'PA', 
u'03261': u'NH', u'07029': u'NJ', u'78602': u'TX', u'95032': u'CA', u'99687': u'AK', 
u'28018': u'NC', u'27514': u'NC', u'27510': u'NC', u'27511': u'NC', u'17961': u'PA', 
u'08034': u'NJ', u'30002': u'GA', u'12866': u'NY', u'92688': u'CA', u'15017': u'PA', 
u'40504': u'KY', u'40503': u'KY', u'92110': u'CA', u'92113': u'CA', u'92115': u'CA', 
u'29646': u'SC', u'30093': u'GA', u'73132': u'OK', u'33484': u'FL', u'06779': u'CT', 
u'23226': u'VA', u'23227': u'VA', u'10522': u'NY', u'06906': u'CT', u'61462': u'IL', 
u'74078': u'OK', u'74075': u'OK', u'06518': u'CT', u'77073': u'TX', u'06513': u'CT', 
u'06512': u'CT', u'43085': u'OH', u'51250': u'IA', u'92064': u'CA', u'98620': u'WA', 
u'02110': u'MA', u'02113': u'MA', u'92660': u'CA', u'92705': u'CA', u'60201': u'IL', 
u'60202': u'IL', u'32707': u'FL', u'56321': u'MN', u'06355': u'CT', u'11787': u'NY', 
u'90840': u'CA', u'06059': u'CT', u'97007': u'OR', u'11727': u'NY', u'63645': u'MO', 
u'49508': u'MI', u'21911': u'MD', u'60515': u'IL', u'80236': u'CO', u'26241': u'WV', 
u'85258': u'AZ', u'98405': u'WA', u'97302': u'OR', u'85251': u'AZ', u'97301': u'OR', 
u'46032': u'IN', u'62522': u'IL', u'45218': u'OH', u'53706': u'WI', u'53705': u'WI', 
u'76111': u'TX', u'53703': u'WI', u'21218': u'MD', u'60035': u'IL', u'57197': u'SD', 
u'63033': u'MO', u'20755': u'MD', u'01720': u'MA', u'85032': u'AZ', u'75230': u'TX', 
u'38401': u'TN', u'20854': u'MD', u'03869': u'NH', u'20850': u'MD', u'80127': u'CO', 
u'80123': u'CO', u'53202': u'WI', u'90016': u'CA', u'58644': u'ND', u'90019': u'CA', 
u'63108': u'MO', u'12550': u'NY', u'94143': u'CA', u'33066': u'FL', u'68147': u'NE', 
u'60630': u'IL', u'20784': u'MD', u'70403': u'LA', u'33205': None, u'94720': u'CA', 
u'95662': u'CA', u'95660': u'CA', u'18505': u'PA', u'44074': u'OH', u'M4J2K': None, 
u'44092': u'OH', u'94533': u'CA', u'50233': u'IA', u'79070': u'TX', u'94920': u'CA', 
u'08043': u'NJ', u'47401': u'IN', u'30033': u'GA', u'78704': u'TX', u'30030': u'GA', 
u'66215': u'KS', u'21227': u'MD', u'95316': u'CA', u'29678': u'SC', u'60008': u'IL', 
u'47905': u'IN', u'91711': u'CA', u'47906': u'IN', u'55443': u'MN', u'77380': u'TX', 
u'13820': u'NY', u'26506': u'WV', u'31909': u'GA', u'08904': u'NJ', u'92374': u'CA', 
u'00000': None, u'E2E3R': None} 

# And use the dict to initialize lens['state']:
lens['state']=lens['zip_code'].apply(lambda x: states2.get(x))

Beyond the state, I’d like to add some other columns for describing data and drop a bunch of non-needed columns:

In [255]:
import datetime

# Let's also initialize it by the release_year, decade and review day
lens['release_year']=lens['release_date'].apply(lambda x: str(x).split('-')[2] if len(str(x).split('-'))>2 else 0)
lens['decade']=lens['release_year'].apply(lambda x: str(x)[2:3]+"0's" if x else "")
lens['review_day']=lens['unix_timestamp'].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%A'))

# And remove some non-needed stuff
final_lens = lens.drop(['release_date','zip_code', 'unix_timestamp', 'video_release_date', 'imdb_url', 'movie_id', 'user_id'], 1)

# Also add an idx column
final_lens['idx'] = final_lens.index

final_lens.head()
Out[255]:
title rating age sex occupation state release_year decade review_day idx
0 Toy Story (1995) 4 60 M retired CA 1995 90’s Tuesday 0
1 Get Shorty (1995) 5 60 M retired CA 1995 90’s Tuesday 1
2 Copycat (1995) 4 60 M retired CA 1995 90’s Tuesday 2
3 Twelve Monkeys (1995) 4 60 M retired CA 1995 90’s Tuesday 3
4 Babe (1995) 5 60 M retired CA 1995 90’s Tuesday 4

So, after the previous (I hope easy) modifications we have a DataFrame that contains useful info about reviews of movies. Each line of the dataframe contains the following 9 columns of data:

  • Title of movie
  • Rating it got from this review
  • Age of the reviewer
  • Sex of the reviewer
  • Occupation of the reviewer
  • State (US) of the reviewer
  • Release year of the movie
  • Decade the movie was released
  • Day of the review

Let’s take a peek at the movies of which decade were prefered by reviewers, by sex. First of all, we’ll do a pivot table to aggregate the rating and idx columns values by sex and decade. For the rating we’ll take the average of the reviews for each decade/sex while, for idx we’ll get the len (just to count the number of reviews):

In [256]:
prefered_decade_by_sex = final_lens.pivot_table(
    columns=['sex', ],
    index=[ 'decade'],
    values=['rating', 'idx'], 
    aggfunc={'rating': np.average, 'idx': len},
    fill_value=0,
)
print prefered_decade_by_sex
# We see that there are a bunch of movies without decade, we'll drop them using the drop method
# Notice that I pass '' to drop to remove the column with empty index
prefered_decade_by_sex = prefered_decade_by_sex.drop('')
prefered_decade_by_sex
          rating              idx       
sex            F         M      F      M
decade                                  
        3.500000  3.428571      2      7
20's    2.857143  3.632653      7     49
30's    3.961340  3.912455    388   1108
40's    3.952641  4.029412    549   1700
50's    3.835006  3.972403    897   2609
60's    3.852321  3.891015    948   2927
70's    3.754007  3.899522   1435   4807
80's    3.700214  3.764700   2802   9320
90's    3.437366  3.384938  18712  51733
Out[256]:
rating idx
sex F M F M
decade
20’s 2.857143 3.632653 7 49
30’s 3.961340 3.912455 388 1108
40’s 3.952641 4.029412 549 1700
50’s 3.835006 3.972403 897 2609
60’s 3.852321 3.891015 948 2927
70’s 3.754007 3.899522 1435 4807
80’s 3.700214 3.764700 2802 9320
90’s 3.437366 3.384938 18712 51733

Now, we see that we have the rating and review count for both men and women. However, I’d also like to get their combined (average rating and total review) values. There are two ways that this can be done: First, we can create another dataframe that does not seperate sex:

In [257]:
prefered_decade = final_lens.pivot_table(
    index=[ 'decade'],
    values=['rating', 'idx'], 
    aggfunc={'rating': np.average, 'idx': len},
    fill_value=0,
)
# Drop non needed index
prefered_decade = prefered_decade.drop('')
prefered_decade
Out[257]:
idx rating
decade
20’s 56 3.535714
30’s 1496 3.925134
40’s 2249 4.010671
50’s 3506 3.937250
60’s 3875 3.881548
70’s 6242 3.866069
80’s 12122 3.749794
90’s 70445 3.398864

Now, these two DataFrames can be easily combined because they have the same index. I’ll put the values from the total DataFrame as a second level index to the seperated (by sex) DataFrame - notice how the multi index is used for indexing:

In [258]:
prefered_decade_by_sex['rating', 'total'] = prefered_decade['rating']
prefered_decade_by_sex['idx', 'total'] = prefered_decade['idx']
prefered_decade_by_sex
Out[258]:
rating idx rating idx
sex F M F M total total
decade
20’s 2.857143 3.632653 7 49 3.535714 56
30’s 3.961340 3.912455 388 1108 3.925134 1496
40’s 3.952641 4.029412 549 1700 4.010671 2249
50’s 3.835006 3.972403 897 2609 3.937250 3506
60’s 3.852321 3.891015 948 2927 3.881548 3875
70’s 3.754007 3.899522 1435 4807 3.866069 6242
80’s 3.700214 3.764700 2802 9320 3.749794 12122
90’s 3.437366 3.384938 18712 51733 3.398864 70445

Now, the previous would be almost perfect but I would really prefer the rating and idx first-level columns to be all together. This can be done by using the sort_index — the axis=1 parameter sorts the columns(or else the index will be sorted):

In [259]:
prefered_decade_by_sex = prefered_decade_by_sex.sort_index(axis=1)
prefered_decade_by_sex
Out[259]:
idx rating
sex F M total F M total
decade
20’s 7 49 56 2.857143 3.632653 3.535714
30’s 388 1108 1496 3.961340 3.912455 3.925134
40’s 549 1700 2249 3.952641 4.029412 4.010671
50’s 897 2609 3506 3.835006 3.972403 3.937250
60’s 948 2927 3875 3.852321 3.891015 3.881548
70’s 1435 4807 6242 3.754007 3.899522 3.866069
80’s 2802 9320 12122 3.700214 3.764700 3.749794
90’s 18712 51733 70445 3.437366 3.384938 3.398864

The other method to create the sex and total reviews DataFrame is to aggregate the values from prefered_decade_by_sex directly (without creating another pivot table). Take a look at the get_average function below. For each row it will take the total number of reviews for men and women and multiply that number with the corresponding average. It will then divide the sum of averages with the total number of reviws to get the average for each row. We also use the sort_index method to display the columns correctly:

In [260]:
prefered_decade_by_sex = final_lens.pivot_table(
    columns=['sex', ],
    index=[ 'decade'],
    values=['rating', 'idx'], 
    aggfunc={'rating': np.average, 'idx': len},
    fill_value=0,
)

prefered_decade_by_sex = prefered_decade_by_sex.drop('')

def get_average(row):
    total_f = row['idx']['F']
    total_m = row['idx']['M']
    total_rating_f = total_f*row['rating']['F']
    total_rating_m = total_m*row['rating']['M']
    
    return (total_rating_f+total_rating_m)/(total_f+total_m)

prefered_decade_by_sex['rating', 'total'] = prefered_decade_by_sex.apply(get_average, axis=1)
prefered_decade_by_sex['idx', 'total'] = prefered_decade_by_sex.apply(lambda x: x['idx']['F']+x['idx']['M'], axis=1)
prefered_decade_by_sex = prefered_decade_by_sex.sort_index(axis=1)
prefered_decade_by_sex
Out[260]:
idx rating
sex F M total F M total
decade
20’s 7 49 56.0 2.857143 3.632653 3.535714
30’s 388 1108 1496.0 3.961340 3.912455 3.925134
40’s 549 1700 2249.0 3.952641 4.029412 4.010671
50’s 897 2609 3506.0 3.835006 3.972403 3.937250
60’s 948 2927 3875.0 3.852321 3.891015 3.881548
70’s 1435 4807 6242.0 3.754007 3.899522 3.866069
80’s 2802 9320 12122.0 3.700214 3.764700 3.749794
90’s 18712 51733 70445.0 3.437366 3.384938 3.398864

Let’s try to plot this DataFrame to see if we can extract some useful conclusions:

In [261]:
f, a = plt.subplots(1,2)
prefered_decade_by_sex['idx'].plot(ax=a[0], figsize=(15,5), logy=True)
prefered_decade_by_sex['rating'].plot(ax=a[1], figsize=(15,5))
# It seems that women don't like movies from the 20's (but if you take at the number 
# of votes there are too few women that have voted for 20's movies. Also, the best
# reviews seem to be for movies of 30's and 40's and (as expected) the most revies
# are for newest movies
Out[261]:

Let’s try another quick pivot_table. Can we see if there’s a specific day-of-week at which the reviewers prefer to vote?

In [262]:
final_lens.pivot_table(
    index=['review_day'],
    columns=['sex'],
    values='idx',
    aggfunc=len
).plot()

# Probably not
Out[262]:

Continuing our exploration of the movie-reviews data set, we’d like to get the total number of reviews and average rating for each movie. To make a better display for the release we’ll categorise the movies by their decade and release year (using multi index rows):

In [263]:
rating_count = final_lens.pivot_table(
    index=[ 'decade', 'release_year', 'title',],
    values=['rating', 'idx', ],
    aggfunc={
        'rating': np.average,
        'idx': len,
        
    }
)
# Drop movies without decade
rating_count = rating_count.drop('')
rating_count.head(10)
# Notice the nice hierarchical index on decade and release_year
Out[263]:
idx rating
decade release_year title
20’s 1922 Nosferatu (Nosferatu, eine Symphonie des Grauens) (1922) 54 3.555556
1926 Scarlet Letter, The (1926) 2 3.000000
30’s 1930 Blue Angel, The (Blaue Engel, Der) (1930) 18 3.777778
1931 M (1931) 44 4.000000
1932 Farewell to Arms, A (1932) 12 3.833333
1933 Duck Soup (1933) 93 4.000000
Liebelei (1933) 1 1.000000
1934 Gay Divorcee, The (1934) 15 3.866667
It Happened One Night (1934) 81 4.012346
Of Human Bondage (1934) 5 3.200000

As we can see above, there are movies with very few revies. I don’t really want to count them since they’ll probably won’t have correct ratings. Also, instead of displaying all the movies I’d like to create a small list with only the best movies:

In [264]:
# So, let's find the best movies (rating more than 4) with more than 150 reviews
best_movies = rating_count[(rating_count.idx>150) & (rating_count.rating>4)]
best_movies
Out[264]:
idx rating
decade release_year title
30’s 1939 Wizard of Oz, The (1939) 246 4.077236
40’s 1941 Citizen Kane (1941) 198 4.292929
1942 Casablanca (1942) 243 4.456790
1946 It’s a Wonderful Life (1946) 231 4.121212
50’s 1951 African Queen, The (1951) 152 4.184211
1954 Rear Window (1954) 209 4.387560
1957 Bridge on the River Kwai, The (1957) 165 4.175758
1958 Vertigo (1958) 179 4.251397
1959 North by Northwest (1959) 179 4.284916
60’s 1960 Psycho (1960) 239 4.100418
1962 Lawrence of Arabia (1962) 173 4.231214
To Kill a Mockingbird (1962) 219 4.292237
1963 Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) 194 4.252577
1967 Graduate, The (1967) 239 4.104603
70’s 1972 Godfather, The (1972) 413 4.283293
1973 Sting, The (1973) 241 4.058091
1974 Godfather: Part II, The (1974) 209 4.186603
Monty Python and the Holy Grail (1974) 316 4.066456
1975 One Flew Over the Cuckoo’s Nest (1975) 264 4.291667
1977 Star Wars (1977) 583 4.358491
1979 Alien (1979) 291 4.034364
Apocalypse Now (1979) 221 4.045249
80’s 1980 Empire Strikes Back, The (1980) 367 4.204360
1981 Raiders of the Lost Ark (1981) 420 4.252381
1982 Blade Runner (1982) 275 4.138182
Gandhi (1982) 195 4.020513
1984 Amadeus (1984) 276 4.163043
1987 Princess Bride, The (1987) 324 4.172840
1989 Glory (1989) 171 4.076023
90’s 1991 Silence of the Lambs, The (1991) 390 4.289744
Terminator 2: Judgment Day (1991) 295 4.006780
1993 Fugitive, The (1993) 336 4.044643
Much Ado About Nothing (1993) 176 4.062500
Schindler’s List (1993) 298 4.466443
1994 Pulp Fiction (1994) 394 4.060914
Shawshank Redemption, The (1994) 283 4.445230
1995 Sense and Sensibility (1995) 268 4.011194
Usual Suspects, The (1995) 267 4.385768
1996 Braveheart (1995) 297 4.151515
Lone Star (1996) 187 4.053476
Secrets & Lies (1996) 162 4.265432
Taxi Driver (1976) 182 4.120879
1997 Boot, Das (1981) 201 4.203980
Fargo (1996) 508 4.155512
Good Will Hunting (1997) 198 4.262626
L.A. Confidential (1997) 297 4.161616
Return of the Jedi (1983) 507 4.007890
Titanic (1997) 350 4.245714
1998 Apt Pupil (1998) 160 4.100000
In [265]:
# Which are the most popular movies (number of votes) ?
best_movies.sort_values(by='idx', ascending=False)['idx'][:10].plot(kind='bar')

# Fargo at the 2nd and Fugitive at the 10nth place of popularity seem a little strange to mee
Out[265]:
In [266]:
# Which are the best movies (vote average) ?
best_movies.sort_values(by='rating', ascending=False)['rating'][:10].plot(kind='bar', ylim=(4.2,4.5))

# I tend to agree with most of them, however I feel that the Godfather is missing...
Out[266]:

Let’s try to see how many people of each age are voting, however instead of displaying the votes for people of each specific age, we’ll seperate the ages into groups (0-10, 10,20 etc) and dispaly the counts for them:

In [267]:
review_idx_by_age = final_lens.pivot_table(index=['age'], values='idx', aggfunc=len)
print review_idx_by_age.head(10)

# Let's group by age group
def by_age(x):
    return '{0}-{1}'.format((x/10)*10, (x/10 + 1)*10)


grouped_review_idx = review_idx_by_age.groupby(by_age).aggregate(sum)
grouped_review_idx
age
7       43
10      31
11      27
13     497
14     264
15     397
16     335
17     897
18    2219
19    3514
Name: idx, dtype: int64
Out[267]:
0-10        43
10-20     8181
20-30    39535
30-40    25696
40-50    15021
50-60     8704
60-70     2623
70-80      197
Name: idx, dtype: int64
In [268]:
# Let's plot our number of votes - we can see that most people voting are 20-30
grouped_review_idx.plot(kind='pie', figsize=(8, 8), legend=True, autopct='%1.1f%%', fontsize=10 )
Out[268]:
In [269]:
# We can also see the same (more or less) info directly from the
# initial dataset, using a histogram
final_lens['age'].hist(bins=10)
Out[269]:
In [270]:
# Or just by plotting the num of reviews / age (not the grouped one)
review_idx_by_age.plot()
Out[270]:

Let’s see how people of each occupation and sex are voting. We’ll get the averages for age, and rating and total number of reviews:

In [271]:
aggrs = {'age': np.average, 'idx': len, 'rating' :np.average}
# This creeates a dataframe for men and women
d1 = final_lens.pivot_table(index='occupation', columns='sex', aggfunc=aggrs)
# This creates a dataframe for both
d2 = final_lens.pivot_table(index='occupation', aggfunc=aggrs)
# Let's put the values from the "both" dataframe to the men/women dataframe
d1[ 'idx','Total' ] = d2['idx']
d1[ 'age','Total'] = d2['age']
d1[ 'rating','Total'] = d2['rating']
# And now let's sort the row index so that it will have the correct multiindex
occupations = d1.sort_index(axis=1)
# Finally, let's sort the DataFrame by the total number of votes for each occupation
occupations = occupations.sort_values( ('idx', 'Total'), ascending=False )
occupations
# Students are no1 - not a surpise!
Out[271]:
age idx rating
sex F M Total F M Total F M Total
occupation
student 21.092697 22.510731 22.142870 5696.0 16261.0 21957 3.602879 3.484411 3.515143
other 31.813370 32.964704 32.568977 3665.0 6998.0 10663 3.531241 3.563447 3.552377
educator 37.942058 44.570167 42.789240 2537.0 6905.0 9442 3.698857 3.660246 3.670621
engineer 33.489655 34.371731 34.356086 145.0 8030.0 8175 3.751724 3.537609 3.541407
programmer 32.463007 32.502167 32.500064 419.0 7382.0 7801 3.577566 3.567732 3.568260
administrator 38.096081 39.688083 39.123145 2654.0 4825.0 7479 3.781839 3.555233 3.635646
writer 37.429848 32.219527 34.325867 2238.0 3298.0 5536 3.663986 3.180109 3.375723
librarian 36.707343 38.129300 37.358050 2860.0 2413.0 5273 3.580070 3.537920 3.560781
technician 38.000000 31.512655 31.712493 108.0 3398.0 3506 3.268519 3.540612 3.532230
executive 42.529412 36.203331 36.614164 221.0 3182.0 3403 3.773756 3.319610 3.349104
healthcare 37.644993 44.641851 38.885164 2307.0 497.0 2804 2.736021 3.639839 2.896220
artist 27.155510 33.088257 30.592288 971.0 1337.0 2308 3.347065 3.875841 3.653380
entertainment 27.546667 28.912834 28.766110 225.0 1870.0 2095 3.448889 3.440107 3.441050
scientist 28.273381 35.855133 35.343052 139.0 1919.0 2058 3.251799 3.637311 3.611273
marketing 32.106335 37.759947 36.478462 442.0 1508.0 1950 3.522624 3.474801 3.485641
retired 70.000000 61.375163 61.755749 71.0 1538.0 1609 3.239437 3.477243 3.466750
lawyer 36.000000 34.478056 34.556134 69.0 1276.0 1345 3.623188 3.741379 3.735316
none 33.887671 18.960821 25.007769 365.0 536.0 901 3.632877 3.878731 3.779134
salesman 31.318584 34.882012 33.470794 339.0 517.0 856 3.870206 3.394584 3.582944
doctor NaN 35.592593 35.592593 NaN 540.0 540 NaN 3.688889 3.688889
homemaker 33.416357 23.000000 32.371237 269.0 30.0 299 3.278810 3.500000 3.301003

Let’s manipulate the previous DataFrame to take a look at only the occupations that have voted the most number of times. Actually, we’ll get only occupations that have voted more than 6000 times, all other occupations we’ll just add them to the “other” occupation. For this, we’ll get only the 'idx' column and filter it by the rows which have a Total<6000. We’ll then take the sum for this DataFrame so that we’ll get the total votes for each male/female and total.

Next, we’ll add this to the “other” row of the dataframe and remove the less than 6000 rows from it. Finally, we’ll plot the resulting DataFrame for all male, female and both.

In [272]:
occupations_num = occupations['idx']
# Let's see which are the total numbers we need to add to "other"
add_to_other = occupations_num[occupations_num['Total']<6000].sum()
print add_to_other

occupations_num.loc['other']+=add_to_other
print occupations_num.loc['other']

# Now let's get the rows that have a total of > 6000
most_voters_with_other = occupations_num[occupations_num['Total']>6000]
print most_voters_with_other
most_voters_with_other.plot(kind='pie', subplots=True, figsize=(18,5 ), legend=True, autopct='%1.1f%%', fontsize=12)
sex
F        10624.0
M        23859.0
Total    34483.0
dtype: float64
sex
F        14289.0
M        30857.0
Total    45146.0
Name: other, dtype: float64
sex                  F        M    Total
occupation                              
student         5696.0  16261.0  21957.0
other          14289.0  30857.0  45146.0
educator        2537.0   6905.0   9442.0
engineer         145.0   8030.0   8175.0
programmer       419.0   7382.0   7801.0
administrator   2654.0   4825.0   7479.0
Out[272]:
array([,
       ,
       ], dtype=object)

Conclusion

In the previous, I have tried to present a comprehensive introduction of the pivot_table command along with a small introduction to the pandas data structures (Series and DataFrame) and a bunch of other methods that will help in using pivot_table presenting both toy and real world examples for all of them. If you feel that something is missing or you want me to add another example pivot_table operation for either the books or the movie lens dataset feel free to tell me in the comments.

I hope that after reading (and understanding) the above you’ll be able to use pandas and pivot_table without problems!