SlideShare a Scribd company logo
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
My	
  Data	
  Journey	
  with	
  Python	
  
Wes	
  McKinney	
  @wesmckinn	
  
SciPy	
  2015	
  Keynote,	
  2015-­‐07-­‐09	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Who	
  am	
  I?	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
This	
  talk	
  
•  2007-­‐present,	
  from	
  my	
  perspecOve	
  
•  CelebraOng	
  our	
  successes	
  
•  Challenges	
  and	
  opportuniOes	
  for	
  the	
  future	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  are	
  we	
  all	
  here?	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
My	
  pre-­‐2007	
  existence	
  
•  I	
  was	
  a	
  mathemaOcian!	
  
•  No	
  exposure	
  to	
  Python,	
  SQL,	
  R	
  (or	
  any	
  analyOcs	
  for	
  that	
  maYer)	
  
•  Rude	
  awakening	
  ahead	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
My	
  first	
  job:	
  AQR	
  (quant	
  hedge	
  fund)	
  
•  A	
  quant	
  finance	
  operaOon	
  that	
  lived	
  and	
  breathed	
  SQL	
  and	
  Excel	
  
•  ProducOon	
  systems	
  in	
  C++,	
  Java,	
  Visual	
  BASIC,	
  and	
  C#	
  .NET	
  
•  Some	
  PhD-­‐level	
  researchers	
  used	
  MATLAB	
  for	
  research	
  (as	
  was	
  common	
  in	
  
finance	
  /	
  economics	
  departments)	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
ProducOvity	
  frustraOons	
  
•  First	
  year:	
  several	
  analyOcs	
  and	
  staOsOcal	
  data	
  analysis	
  projects	
  
• A	
  huge	
  amount	
  of	
  SQL	
  
• Some	
  Java	
  
• A	
  liYle	
  bit	
  of	
  R	
  
• …	
  and	
  TONS	
  of	
  Excel	
  
•  Projects	
  felt	
  like	
  5%	
  conceptualizaOon,	
  95%	
  tedium	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  in	
  early	
  2008:	
  different	
  Omes	
  
•  A	
  bleeding	
  edge	
  stack	
  
• NumPy	
  1.0.4	
  
• SciPy	
  0.6.0	
  
• matplotlib	
  0.91.2	
  
• IPython	
  0.8.4,	
  SVN	
  history	
  begins	
  2/2008	
  
• Cython	
  0.9.8	
  
•  The	
  scienOfic	
  Python	
  community	
  seemed	
  mainly	
  focused	
  on	
  aYracOng	
  MATLAB,	
  
HPC,	
  and	
  scienOfic	
  lab	
  users	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
2008:	
  Things	
  SciPythonistas	
  didn’t	
  care	
  too	
  much	
  about	
  
•  RelaOonal	
  data	
  or	
  SQL	
  
•  Missing	
  data	
  handling	
  (outside	
  numpy.ma)	
  
•  StaOsOcs	
  and	
  econometrics	
  (first	
  statsmodels	
  release:	
  2011)	
  
•  StaOsOcal	
  graphics	
  
•  Machine	
  learning	
  (scikit-­‐learn	
  0.1:	
  2/2010)	
  
•  AnalyOcs	
  and	
  business	
  intelligence	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Taking	
  a	
  gamble	
  
•  Decided	
  to	
  give	
  Python	
  a	
  shot	
  for	
  AQR	
  projects	
  aoer	
  seeing	
  part	
  of	
  MASS	
  R	
  
package	
  ported	
  in	
  scipy.stats.models	
  by	
  Jonathan	
  Taylor	
  at	
  Stanford	
  
•  proto-­‐pandas	
  first	
  version	
  built	
  in	
  April	
  2008	
  
• Focused	
  on	
  porOng	
  an	
  R	
  project	
  to	
  Python	
  
•  May	
  ‘08:	
  Embedded	
  Python	
  interpreter	
  in	
  a	
  legacy	
  C++	
  system	
  
•  5/2008	
  –	
  12/2008:	
  Skunkworks	
  Python	
  ports	
  and	
  evangelism	
  across	
  company	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Why	
  did	
  Python	
  work	
  out?	
  
•  BaYeries	
  included	
  
•  Interoperability	
  with	
  C++	
  
• Embedding	
  Python	
  interpreter	
  
• Wrapping	
  C++	
  in	
  Python	
  C	
  extensions	
  
•  ProducOve	
  user	
  interface	
  
• Python	
  language	
  
• IPython	
  +	
  matplotlib	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  other	
  cool	
  things	
  we	
  built	
  
•  A	
  global	
  macro	
  risk	
  modeling	
  system	
  (using	
  pandas	
  +	
  NumPy	
  +	
  PyTables)	
  
•  A	
  heterogeneous	
  market	
  data	
  loading	
  and	
  cleaning	
  system	
  
•  A	
  task-­‐based	
  cluster	
  compuOng	
  system	
  (similar	
  to	
  Celery)	
  
•  Tick	
  data	
  storage	
  and	
  analyOcs	
  	
  
•  Various	
  GUIs	
  with	
  wxPython	
  +	
  matplotlib	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
End	
  2009:	
  pandas!	
  
•  AQR	
  lets	
  me	
  open	
  source	
  pandas	
  0.1	
  on	
  Christmas,	
  2009.	
  
~/Downloads/pandas-­‐0.1	
  $	
  cloc	
  -­‐-­‐exclude-­‐ext	
  pandas	
  
	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
Language	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  files	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  blank	
  	
  	
  	
  	
  	
  	
  	
  comment	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  code	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
Python	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  41	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3124	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2933	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  8225	
  
Cython	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  7	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  418	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  93	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1247	
  
C/C++	
  Header	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
SUM:	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  49	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3542	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3026	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  9473	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
2010	
  –	
  2011:	
  Python’s	
  data	
  growing	
  pains	
  
•  pandas	
  did	
  not	
  evolve	
  much	
  aoer	
  its	
  iniOal	
  release	
  
•  No	
  consensus	
  or	
  momentum	
  behind	
  any	
  project	
  for	
  analyOcs	
  /	
  data	
  wrangling	
  
•  AQR	
  —>	
  Duke	
  StaOsOcal	
  Science	
  
•  AQR	
  sponsors	
  bug	
  fixes	
  and	
  new	
  features	
  in	
  pandas	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
May	
  2011:	
  Gevng	
  inspired	
  
•  2011-­‐05-­‐13:	
  Enthought	
  Datarray	
  Summit	
  
• Discuss	
  how	
  to	
  enable	
  Python	
  to	
  become	
  more	
  useful	
  staOsOcal	
  compuOng	
  
• Me:	
  “Library	
  fragmentaOon	
  is	
  destrucOve;	
  integraOon	
  is	
  beYer”	
  
• Data	
  structures,	
  missing	
  data,	
  and	
  data	
  wrangling	
  tools	
  
•  2011-­‐05-­‐23	
  –	
  2011-­‐06-­‐03	
  :	
  Python	
  finance	
  consulOng	
  engagement	
  
• Realized	
  that	
  Python	
  data	
  tools	
  sorely	
  needed	
  in	
  industry	
  
• But	
  not	
  nearly	
  mature	
  enough	
  yet	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
2011-­‐05-­‐30	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Tell	
  me	
  about	
  your	
  use	
  cases	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Making	
  pandas	
  a	
  beYer	
  tool	
  
•  ConsulOng	
  at	
  AppNexus	
  (NYC	
  ad	
  tech	
  company)	
  opened	
  eyes	
  to	
  new	
  problems	
  
•  June	
  2011	
  –	
  December	
  2012	
  
• Fix	
  some	
  pandas	
  design	
  issues	
  
• Build	
  out	
  data	
  wrangling	
  capabiliOes	
  (hierarchical	
  indexes,	
  etc.)	
  
• Create	
  “killer	
  apps”	
  (Ome	
  series	
  capabiliOes)	
  
• Evangelize	
  and	
  collaborate	
  with	
  other	
  projects	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Taking	
  advantage	
  of	
  temporary	
  
financial	
  freedom	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Making	
  a	
  book	
  happen	
  
•  A	
  chicken-­‐and-­‐egg	
  problem	
  
•  Fernando	
  Pérez,	
  Brian	
  Granger,	
  and	
  John	
  Hunter	
  
had	
  been	
  toying	
  with	
  the	
  idea	
  of	
  a	
  “SciPy	
  Book”	
  for	
  
a	
  couple	
  years	
  
•  Decided	
  to	
  forge	
  my	
  own	
  path	
  in	
  Nov	
  2011	
  
• WriOng	
  took	
  about	
  9	
  months	
  
• Helped	
  moOvate	
  me	
  to	
  “finish”	
  parts	
  of	
  pandas	
  
•  ~	
  50,000	
  copies	
  in	
  circulaOon	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Clarity	
  and	
  sooware	
  engineering	
  
•  Progress	
  in	
  sooware	
  not	
  just	
  about	
  hard	
  work	
  
•  Solving	
  the	
  right	
  problems	
  
• …	
  in	
  the	
  right	
  order	
  
• …	
  while	
  wasOng	
  liYle	
  Ome/energy	
  on	
  non-­‐impac}ul	
  issues	
  
• …	
  while	
  being	
  faced	
  with	
  real	
  world	
  concerns	
  (80/20	
  rule)	
  
•  Taking	
  the	
  Ome	
  to	
  develop	
  a	
  clear	
  vision	
  and	
  scope	
  for	
  a	
  project	
  is	
  a	
  major	
  factor	
  
in	
  its	
  success	
  or	
  failure	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
It	
  took	
  a	
  village	
  
•  Fernando	
  Perez	
  &	
  Brian	
  Granger	
  (IPython)	
  
•  Skipper	
  Seabold	
  &	
  Josef	
  Perktold	
  (statsmodels)	
  
•  Eric	
  Jones	
  (Enthought)	
  
•  Travis	
  Oliphant	
  &	
  Peter	
  Wang	
  (Enthought	
  &	
  ConOnuum)	
  
•  John	
  Hunter	
  (matplotlib)	
  
•  …	
  and	
  many	
  others	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
An	
  unlikely	
  train	
  ride	
  
	
  
SEA	
  —>	
  PDX	
  
November	
  18,	
  2011	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Seatmate:	
  “Are	
  you	
  a	
  
programmer?”	
  	
  
	
  
(he	
  saw	
  my	
  Emacs	
  buffers)	
  	
  
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Wes:	
  “	
  Yeah,	
  I	
  do	
  Python”	
  	
  
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Seatmate:	
  “Oh,	
  I	
  do	
  a	
  bit	
  of	
  
Python	
  too”	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Wes:	
  “Cool,	
  well,	
  there’s	
  
this	
  awesome	
  new	
  thing	
  
called	
  the	
  IPython	
  
notebook”	
  	
  
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
My	
  seatmate	
  was	
  computaOonal	
  bio	
  
professor	
  and	
  5-­‐year	
  PSF	
  member	
  	
  
Titus	
  Brown	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
And	
  he	
  would	
  later	
  assist	
  the	
  
IPython	
  team	
  in	
  their	
  Sloan	
  
FoundaOon	
  $1mm	
  grant	
  in	
  2012	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  words	
  about	
  	
  
John	
  Hunter	
  (1968	
  –	
  2012)	
  
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Business	
  ventures	
  2012	
  -­‐	
  2014	
  
•  2012	
  :	
  Lambda	
  Foundry	
  
• Support	
  and	
  develop	
  pandas	
  
• Explored	
  creaOng	
  a	
  commercial	
  Python	
  financial	
  toolkit	
  
•  2013	
  –	
  2014	
  :	
  DataPad	
  
• “Google	
  Drive	
  for	
  AnalyOcs	
  /	
  BI”	
  
• With	
  Chang	
  She	
  (MIT	
  —>	
  AQR	
  —>	
  pandas)	
  
• Silicon	
  Valley	
  VC-­‐backed	
  
• Acquired	
  by	
  Cloudera	
  in	
  September	
  2014	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Cloudera	
  
•  Sort	
  of	
  “the	
  Red	
  Hat	
  of	
  Big	
  Data”	
  
•  The	
  leading	
  open	
  source	
  Hadoop	
  pla}orm	
  
•  SupporOng	
  and	
  developing	
  a	
  liYle	
  over	
  20	
  Apache-­‐licensed	
  open	
  source	
  projects	
  
•  A	
  dream	
  job	
  
• Full	
  Ome	
  open	
  source	
  development	
  
• Solving	
  hard	
  data	
  problems	
  faced	
  by	
  the	
  world’s	
  largest	
  companies	
  
•  P.S.	
  we’re	
  hiring	
  engineers	
  in	
  AusOn	
  +	
  Bay	
  Area	
  
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  I’m	
  interested	
  in	
  right	
  now	
  
•  Ways	
  to	
  enable	
  collaboraOon	
  on	
  data	
  tools	
  across	
  programming	
  languages	
  
	
  
•  Domain	
  specific	
  language	
  design	
  and	
  compilaOon	
  
•  Improving	
  the	
  Python-­‐on-­‐Hadoop	
  experience	
  
•  LLVM	
  +	
  Code	
  generaOon	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Different	
  kinds	
  of	
  Big	
  Data	
  
•  Python	
  programmers	
  have	
  been	
  dealing	
  with	
  big	
  scienOfic	
  data	
  in	
  HPC	
  sevngs	
  
for	
  years	
  
•  Big…	
  
• Text	
  data	
  
• Homogeneous	
  array	
  data	
  
• Tabular	
  (structured)	
  data	
  
• JSON-­‐like	
  (semi-­‐structured)	
  data	
  
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
The	
  Great	
  Data	
  Tool	
  Decoupling™	
  
•  Thesis:	
  over	
  Ome,	
  user	
  interfaces,	
  data	
  storage,	
  and	
  execuOon	
  engines	
  will	
  
decouple	
  and	
  specialize	
  
•  In	
  fact,	
  you	
  should	
  really	
  want	
  this	
  to	
  happen	
  
• Share	
  systems	
  among	
  languages	
  
• Reduce	
  fragmentaOon	
  and	
  “lock-­‐in”	
  
• Shio	
  developer	
  focus	
  to	
  usability	
  	
  
•  PredicOon:	
  we’ll	
  be	
  there	
  by	
  2025;	
  sooner	
  if	
  we	
  all	
  get	
  our	
  act	
  together	
  
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
@wesmckinn	
  
Ad

More Related Content

What's hot (19)

Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
Wes McKinney
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
Thomas W. Dinsmore
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
Wes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
 

Viewers also liked (12)

pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
Wes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney
 
Hacking
HackingHacking
Hacking
Karan Poshattiwar
 
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
Wes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
Wes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney
 
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
Wes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney
 
Ad

Similar to My Data Journey with Python (SciPy 2015 Keynote) (20)

RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
Databricks
 
Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analytics
South West Data Meetup
 
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
ScalrCMP
 
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
NebulaInc
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
Neo4j
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...
Varad Meru
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
OpenStack Journey in Tieto Elastic Cloud
OpenStack Journey in Tieto Elastic CloudOpenStack Journey in Tieto Elastic Cloud
OpenStack Journey in Tieto Elastic Cloud
Jakub Pavlik
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Precisely
 
Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)
Rogue Wave Software
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use Cases
All Things Open
 
200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code
David Danzilio
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Data Con LA
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Steven Totman
 
Market trends in IT - exchange cala - October 2015
Market trends in IT - exchange cala - October 2015Market trends in IT - exchange cala - October 2015
Market trends in IT - exchange cala - October 2015
Eduardo Pelegri-Llopart
 
Introduction to OpenStack Storage
Introduction to OpenStack StorageIntroduction to OpenStack Storage
Introduction to OpenStack Storage
NetApp
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Cloudera, Inc.
 
How to Enterprise Node
How to Enterprise NodeHow to Enterprise Node
How to Enterprise Node
Julián David Duque
 
Exploring and Using the Python Ecosystem
Exploring and Using the Python EcosystemExploring and Using the Python Ecosystem
Exploring and Using the Python Ecosystem
Adam Cook
 
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
Databricks
 
Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analytics
South West Data Meetup
 
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
ScalrCMP
 
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
NebulaInc
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
Neo4j
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...
Varad Meru
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
OpenStack Journey in Tieto Elastic Cloud
OpenStack Journey in Tieto Elastic CloudOpenStack Journey in Tieto Elastic Cloud
OpenStack Journey in Tieto Elastic Cloud
Jakub Pavlik
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Precisely
 
Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)
Rogue Wave Software
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use Cases
All Things Open
 
200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code
David Danzilio
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Data Con LA
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Steven Totman
 
Market trends in IT - exchange cala - October 2015
Market trends in IT - exchange cala - October 2015Market trends in IT - exchange cala - October 2015
Market trends in IT - exchange cala - October 2015
Eduardo Pelegri-Llopart
 
Introduction to OpenStack Storage
Introduction to OpenStack StorageIntroduction to OpenStack Storage
Introduction to OpenStack Storage
NetApp
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Cloudera, Inc.
 
Exploring and Using the Python Ecosystem
Exploring and Using the Python EcosystemExploring and Using the Python Ecosystem
Exploring and Using the Python Ecosystem
Adam Cook
 
Ad

More from Wes McKinney (16)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 

Recently uploaded (20)

AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
CSUC - Consorci de Serveis Universitaris de Catalunya
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 

My Data Journey with Python (SciPy 2015 Keynote)

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   My  Data  Journey  with  Python   Wes  McKinney  @wesmckinn   SciPy  2015  Keynote,  2015-­‐07-­‐09  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   This  talk   •  2007-­‐present,  from  my  perspecOve   •  CelebraOng  our  successes   •  Challenges  and  opportuniOes  for  the  future  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Why  are  we  all  here?  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   My  pre-­‐2007  existence   •  I  was  a  mathemaOcian!   •  No  exposure  to  Python,  SQL,  R  (or  any  analyOcs  for  that  maYer)   •  Rude  awakening  ahead  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   My  first  job:  AQR  (quant  hedge  fund)   •  A  quant  finance  operaOon  that  lived  and  breathed  SQL  and  Excel   •  ProducOon  systems  in  C++,  Java,  Visual  BASIC,  and  C#  .NET   •  Some  PhD-­‐level  researchers  used  MATLAB  for  research  (as  was  common  in   finance  /  economics  departments)  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   ProducOvity  frustraOons   •  First  year:  several  analyOcs  and  staOsOcal  data  analysis  projects   • A  huge  amount  of  SQL   • Some  Java   • A  liYle  bit  of  R   • …  and  TONS  of  Excel   •  Projects  felt  like  5%  conceptualizaOon,  95%  tedium  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Python  in  early  2008:  different  Omes   •  A  bleeding  edge  stack   • NumPy  1.0.4   • SciPy  0.6.0   • matplotlib  0.91.2   • IPython  0.8.4,  SVN  history  begins  2/2008   • Cython  0.9.8   •  The  scienOfic  Python  community  seemed  mainly  focused  on  aYracOng  MATLAB,   HPC,  and  scienOfic  lab  users  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   2008:  Things  SciPythonistas  didn’t  care  too  much  about   •  RelaOonal  data  or  SQL   •  Missing  data  handling  (outside  numpy.ma)   •  StaOsOcs  and  econometrics  (first  statsmodels  release:  2011)   •  StaOsOcal  graphics   •  Machine  learning  (scikit-­‐learn  0.1:  2/2010)   •  AnalyOcs  and  business  intelligence  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Taking  a  gamble   •  Decided  to  give  Python  a  shot  for  AQR  projects  aoer  seeing  part  of  MASS  R   package  ported  in  scipy.stats.models  by  Jonathan  Taylor  at  Stanford   •  proto-­‐pandas  first  version  built  in  April  2008   • Focused  on  porOng  an  R  project  to  Python   •  May  ‘08:  Embedded  Python  interpreter  in  a  legacy  C++  system   •  5/2008  –  12/2008:  Skunkworks  Python  ports  and  evangelism  across  company  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Why  did  Python  work  out?   •  BaYeries  included   •  Interoperability  with  C++   • Embedding  Python  interpreter   • Wrapping  C++  in  Python  C  extensions   •  ProducOve  user  interface   • Python  language   • IPython  +  matplotlib  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Some  other  cool  things  we  built   •  A  global  macro  risk  modeling  system  (using  pandas  +  NumPy  +  PyTables)   •  A  heterogeneous  market  data  loading  and  cleaning  system   •  A  task-­‐based  cluster  compuOng  system  (similar  to  Celery)   •  Tick  data  storage  and  analyOcs     •  Various  GUIs  with  wxPython  +  matplotlib  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   End  2009:  pandas!   •  AQR  lets  me  open  source  pandas  0.1  on  Christmas,  2009.   ~/Downloads/pandas-­‐0.1  $  cloc  -­‐-­‐exclude-­‐ext  pandas     -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   Language                                          files                    blank                comment                      code   -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   Python                                                    41                      3124                      2933                      8225   Cython                                                      7                        418                          93                      1247   C/C++  Header                                          1                            0                            0                            1   -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   SUM:                                                        49                      3542                      3026                      9473   -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   2010  –  2011:  Python’s  data  growing  pains   •  pandas  did  not  evolve  much  aoer  its  iniOal  release   •  No  consensus  or  momentum  behind  any  project  for  analyOcs  /  data  wrangling   •  AQR  —>  Duke  StaOsOcal  Science   •  AQR  sponsors  bug  fixes  and  new  features  in  pandas  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   May  2011:  Gevng  inspired   •  2011-­‐05-­‐13:  Enthought  Datarray  Summit   • Discuss  how  to  enable  Python  to  become  more  useful  staOsOcal  compuOng   • Me:  “Library  fragmentaOon  is  destrucOve;  integraOon  is  beYer”   • Data  structures,  missing  data,  and  data  wrangling  tools   •  2011-­‐05-­‐23  –  2011-­‐06-­‐03  :  Python  finance  consulOng  engagement   • Realized  that  Python  data  tools  sorely  needed  in  industry   • But  not  nearly  mature  enough  yet  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   2011-­‐05-­‐30  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Tell  me  about  your  use  cases  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Making  pandas  a  beYer  tool   •  ConsulOng  at  AppNexus  (NYC  ad  tech  company)  opened  eyes  to  new  problems   •  June  2011  –  December  2012   • Fix  some  pandas  design  issues   • Build  out  data  wrangling  capabiliOes  (hierarchical  indexes,  etc.)   • Create  “killer  apps”  (Ome  series  capabiliOes)   • Evangelize  and  collaborate  with  other  projects  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Taking  advantage  of  temporary   financial  freedom  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Making  a  book  happen   •  A  chicken-­‐and-­‐egg  problem   •  Fernando  Pérez,  Brian  Granger,  and  John  Hunter   had  been  toying  with  the  idea  of  a  “SciPy  Book”  for   a  couple  years   •  Decided  to  forge  my  own  path  in  Nov  2011   • WriOng  took  about  9  months   • Helped  moOvate  me  to  “finish”  parts  of  pandas   •  ~  50,000  copies  in  circulaOon  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Clarity  and  sooware  engineering   •  Progress  in  sooware  not  just  about  hard  work   •  Solving  the  right  problems   • …  in  the  right  order   • …  while  wasOng  liYle  Ome/energy  on  non-­‐impac}ul  issues   • …  while  being  faced  with  real  world  concerns  (80/20  rule)   •  Taking  the  Ome  to  develop  a  clear  vision  and  scope  for  a  project  is  a  major  factor   in  its  success  or  failure  
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   It  took  a  village   •  Fernando  Perez  &  Brian  Granger  (IPython)   •  Skipper  Seabold  &  Josef  Perktold  (statsmodels)   •  Eric  Jones  (Enthought)   •  Travis  Oliphant  &  Peter  Wang  (Enthought  &  ConOnuum)   •  John  Hunter  (matplotlib)   •  …  and  many  others  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   An  unlikely  train  ride     SEA  —>  PDX   November  18,  2011  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Seatmate:  “Are  you  a   programmer?”       (he  saw  my  Emacs  buffers)    
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Wes:  “  Yeah,  I  do  Python”    
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Seatmate:  “Oh,  I  do  a  bit  of   Python  too”  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Wes:  “Cool,  well,  there’s   this  awesome  new  thing   called  the  IPython   notebook”    
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   My  seatmate  was  computaOonal  bio   professor  and  5-­‐year  PSF  member     Titus  Brown  
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   And  he  would  later  assist  the   IPython  team  in  their  Sloan   FoundaOon  $1mm  grant  in  2012  
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Some  words  about     John  Hunter  (1968  –  2012)  
  • 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Business  ventures  2012  -­‐  2014   •  2012  :  Lambda  Foundry   • Support  and  develop  pandas   • Explored  creaOng  a  commercial  Python  financial  toolkit   •  2013  –  2014  :  DataPad   • “Google  Drive  for  AnalyOcs  /  BI”   • With  Chang  She  (MIT  —>  AQR  —>  pandas)   • Silicon  Valley  VC-­‐backed   • Acquired  by  Cloudera  in  September  2014  
  • 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera   •  Sort  of  “the  Red  Hat  of  Big  Data”   •  The  leading  open  source  Hadoop  pla}orm   •  SupporOng  and  developing  a  liYle  over  20  Apache-­‐licensed  open  source  projects   •  A  dream  job   • Full  Ome  open  source  development   • Solving  hard  data  problems  faced  by  the  world’s  largest  companies   •  P.S.  we’re  hiring  engineers  in  AusOn  +  Bay  Area  
  • 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   What  I’m  interested  in  right  now   •  Ways  to  enable  collaboraOon  on  data  tools  across  programming  languages     •  Domain  specific  language  design  and  compilaOon   •  Improving  the  Python-­‐on-­‐Hadoop  experience   •  LLVM  +  Code  generaOon  
  • 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Different  kinds  of  Big  Data   •  Python  programmers  have  been  dealing  with  big  scienOfic  data  in  HPC  sevngs   for  years   •  Big…   • Text  data   • Homogeneous  array  data   • Tabular  (structured)  data   • JSON-­‐like  (semi-­‐structured)  data  
  • 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   The  Great  Data  Tool  Decoupling™   •  Thesis:  over  Ome,  user  interfaces,  data  storage,  and  execuOon  engines  will   decouple  and  specialize   •  In  fact,  you  should  really  want  this  to  happen   • Share  systems  among  languages   • Reduce  fragmentaOon  and  “lock-­‐in”   • Shio  developer  focus  to  usability     •  PredicOon:  we’ll  be  there  by  2025;  sooner  if  we  all  get  our  act  together  
  • 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   @wesmckinn  
  翻译: