GSoC 2017 - Coding Period | Week 3

Posted on 18/06/2017 at 3:30 PM by The Vibe.

Software Development Documentation Ruby GSoC 2017

GSoC Banner image

There is something special in the air about the 18th of June. But what might possibly be special about this day? Maybe it's the International Father's day ? Or maybe it's my birthday? Or maybe it's the Champions Trophy 2017 Finale ? Or maybe because it's just a Sunday? No points for guessing, it's all of the above packaged into one single day.

Today also marks the end of the 3rd week of coding period of my GSoC project, daru-io. Time flies afterall. Through this blog post, I'd like to document the progress that happened in this week.


JSON Importer

According to my timeline, the next importer I had planned to extend support, was the JSON Importer. For the uninitiated, JSON is the format in which most APIs provide response. For example, a typical API JSON response looks like the below -

  
    #! Simple JSON
    {
      simple_key_1: :simple_value_1,
      simple_key_2: :simple_value_2,
      ...,
      simple_key_n: :simple_value_n
    }


    #! Complexly nested JSON
    {
      simple_key_1:  :simple_value_1,
      complex_key_1: {
        simple_key_2: :simple_value_2,
        complex_key_2: [1,2,3],
        ... 
        # More complex structures constituting of nested arrays,
        # nested hashes or combinations of both.
      }
    }
  

Importing these simple JSON responses into Daru::DataFrame was quite simple, as the Daru::DataFrame.new() method already supports various types of inputs like hashes, arrays or array of hashes. Here's an example showing import from a simple JSON response -

  
  require 'daru/io/importers/json'
  => true

  url = 'https://data.nasa.gov/resource/2vr3-k9wn.json'
  df  = Daru::IO::Importers::JSON.new(url).call

  df

  => < Daru::DataFrame(202x10) >
       designation discovery_      h_mag      i_deg    moid_au orbit_clas  period_yr ...
     0 419880 (20 2011-01-07       19.7       9.65      0.035     Apollo       4.06 ...
     1 419624 (20 2010-09-17       20.5      14.52      0.028     Apollo          1 ...
     2 414772 (20 2010-07-28         19      23.11      0.333     Apollo       1.31 ...
     3 414746 (20 2010-03-06         18      23.89      0.268       Amor       4.24 ...
     4 407324 (20 2010-07-18       20.7       9.12      0.111     Apollo       2.06 ...
     5 398188 (20 2010-06-03       19.5      13.25      0.024       Aten        0.8 ...
     6 395207 (20 2010-04-25       19.6      27.85      0.007     Apollo       1.96 ...
     7 386847 (20 2010-06-06         18       5.84      0.029     Apollo        2.2 ...
   ...        ...        ...        ...        ...        ...        ...        ... ...
  

Cool. But what about the important scenario of complexly nested structures? The JsonPath gem comes to rescue here, by allowing users to select certain json-paths from the complex JSON structure, and then the Daru::DataFrame is constructed from these json-paths. Hence, Daru::IO::Importers::JSON can be used to import dataframe from complexly nested JSON, in the below manner -

  
  require 'daru/io/importers/json'
  => true

  url = 'http://api.tvmaze.com/singlesearch/shows?q=game-of-thrones&embed=episodes'
  df  = Daru::IO::Importers::JSON.new(url,
            "$.._embedded..episodes..name",
          "$.._embedded..episodes..season",
          "$.._embedded..episodes..number",
           index: (10..70).to_a,
           RunTime: "$.._embedded..episodes..runtime"
        ).call

  # Note that the hash json-path selectors like index, RunTime, etc.
  # should be given after normal json-path selectors like that of
  # name, season and number.

  df

  => < Daru::DataFrame(61x4) >
           name     season     number    RunTime
  10 Winter is           1          1         60
  11 The Kingsr          1          2         60
  12  Lord Snow          1          3         60
  13 Cripples,           1          4         60
  14 The Wolf a          1          5         60
  15 A Golden C          1          6         60
  16 You Win or          1          7         60
  17 The Pointy          1          8         60
  18     Baelor          1          9         60
  19 Fire and B          1         10         60
  20 The North           2          1         60
  21 The Night           2          2         60
  22 What is De          2          3         60
  23 Garden of           2          4         60
  24 The Ghost           2          5         60
  ...        ...        ...        ...        ...
  

Currently, the JSON Importer is able to parse from local JSON files, remote JSON files, JSON strings or JSON as Ruby objects (Arrays / Hashes). However, certain beautifications are pending in the code & documentation. If you're interested in knowing about what's making this work, please feel free to have a look at the lib/io/importers/json.rb file in this Pull Request. Also, have a look at this screenshot from the documentation to get to know better about the arguments that this method takes.

JSON Importer params


Discussion : A faster csv Importer

From this Issue, I came to know that the existing CSV Importer acts slow (kind of), especially when the CSV file is really huge. Being asked by mentor Sameer to check for better alternatives for faster imports from CSV, I tested quite a few Ruby gems that parse CSV files. Here are the benchmarks for the various gems -

  
  #! Benchmark results for CSV Importer

                        user       system      total         real
  existing stdlib     99.380000   1.290000  100.670000   (142.255080)
  modified stdlib     17.340000   0.470000   17.810000   ( 40.721766)
  smartercsv          65.400000   1.000000   66.400000   (116.842460)
  fastcsv             7.660000    0.380000    8.040000   ( 13.105506)
  rcsv                5.850000    0.210000    6.060000   (  7.756945)
  
  
  #! Importing a huge CSV file into Daru::DataFrame with rcsv gem
  require 'daru'
  require 'rcsv'
  df  = Daru::DataFrame.rows Rcsv.parse(File.open('path/to/water.csv'))
  
  
  #! Importing a huge CSV file into Daru::DataFrame with fastcsv gem
  require 'daru'
  require 'fastcsv'
  all = []
  File.open('path/to/water.csv') { |f| FastCSV.raw_parse(f) { |row| all.push row } }
  df = Daru::DataFrame.rows all[1..-1], order: all[0]
  

Yes, both fastcsv and rcsv come out to be 20x faster than the existing method. Depending on the outcome of further discussions on this issue's thread, Daru::IO might most likely feature a faster CSV Importer in the near future, with support for fastcsv and/or rcsv.