dplython for python

Heaven Zone | 2018-08-23  |  python

简介

一直想知道python下有没有像r中的dplyr的一个库,昨天搜了一下发现有个dplython的库。

下面学习一下dplython的用法。

安装

pip install dplython
pip install git+https://github.com/dodger487/dplython.git

使用方法

示例数据

import pandas as pd
data = [["Jordan", "Bulls", 23, 300, 10],
        ["Kobe", "Lakers", 24, 280, 10],
        ["O'Neal", "Lakers", 34, 250, 8],
        ["Pippen", "Bulls", 33, 240, 10],
        ["Iverson", "76ers", 3, 290, 10],
        ["Duncan", "Spurs", 21, 270, 10]]

df = pd.DataFrame(data, columns=["name", "team", "number", "pts", "gs"])
print(df)
      name    team  number  pts  gs
0   Jordan   Bulls      23  300  10
1     Kobe  Lakers      24  280  10
2   O'Neal  Lakers      34  250   8
3   Pippen   Bulls      33  240  10
4  Iverson   76ers       3  290  10
5   Duncan   Spurs      21  270  10

载入库函数

import dplython as dp
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
    sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction)

转换数据类型

df = DplyFrame(df)
type(df)
dplython.dplython.DplyFrame

选择列

select

>>相当于dplyr的%>%X.的意思是df本身。

df >> select(X.name, X.number)
name number
0 Jordan 23
1 Kobe 24
2 O'Neal 34
3 Pippen 33
4 Iverson 3
5 Duncan 21

筛选行

sift

print( df >> sift(X.number < 30) )
print("\n" )
print( df >> sift( (X.number < 30) & (X.pts > 280)) )
print("\n")
print( df 
      >> sift( (X.number < 30) | (X.pts < 250)) 
      >> select(X.name,X.number,X.pts)
     ) 
      name    team  number  pts  gs
0   Jordan   Bulls      23  300  10
1     Kobe  Lakers      24  280  10
4  Iverson   76ers       3  290  10
5   Duncan   Spurs      21  270  10


      name   team  number  pts  gs
0   Jordan  Bulls      23  300  10
4  Iverson  76ers       3  290  10


      name  number  pts
0   Jordan      23  300
1     Kobe      24  280
3   Pippen      33  240
4  Iverson       3  290
5   Duncan      21  270

随机抽样

sample_n或者sample_frac

df >> sample_n(3)
name team number pts gs
3 Pippen Bulls 33 240 10
1 Kobe Lakers 24 280 10
0 Jordan Bulls 23 300 10
df >> sample_frac(0.3)
name team number pts gs
2 O'Neal Lakers 34 250 8
1 Kobe Lakers 24 280 10

生成新列

mutate

下面生成每场得分:

df = df >> mutate(ppg=X.pts/X.gs)
print(df)
      name    team  number  pts  gs    ppg
0   Jordan   Bulls      23  300  10  30.00
1     Kobe  Lakers      24  280  10  28.00
2   O'Neal  Lakers      34  250   8  31.25
3   Pippen   Bulls      33  240  10  24.00
4  Iverson   76ers       3  290  10  29.00
5   Duncan   Spurs      21  270  10  27.00

分类汇总

group_bysummarize

(df >>    
 group_by(X.team) >>
 summarize(sum_pts=X.pts.sum(), mean_pts=X.pts.mean())
)
team sum_pts mean_pts
0 76ers 290 290
1 Bulls 540 270
2 Lakers 530 265
3 Spurs 270 270

排序

arrange

(df >>    
 group_by(X.team) >>
 summarize(sum_pts=X.pts.sum(), mean_pts=X.pts.mean()) >>
 arrange(-X.mean_pts)
)
team sum_pts mean_pts
0 76ers 290 290
1 Bulls 540 270
3 Spurs 270 270
2 Lakers 530 265

当列名有空格

当列名有空格或者特殊符号时,用X["column name"]

df["W/L pct"] = range(len(df))
print( df >> select(X.name, X["W/L pct"]) >> head(3) )
     name  W/L pct
0  Jordan        0
1    Kobe        1
2  O'Neal        2

本位符号

X._相当于dplyr的.

print( df >> select(X.name, X.number) >> X._.T )
             0     1       2       3        4       5
name    Jordan  Kobe  O'Neal  Pippen  Iverson  Duncan
number      23    24      34      33        3      21

传递给ggplot

需要用dplython.DelayFunction初始化ggplot函数。

import ggplot as gg
from pandas import Timestamp
ggplot = DelayFunction(ggplot)

无法import ggplot,可参考ImportError: cannot import name ‘Timestamp’

参考资料