[ad_1]
In my first article on getting started with the Groovy programming language, I left off with an example of reading a CSV file in Groovy. In this article, I’m going to move to a more idiomatic Groovy style (make it groovier, as some would say), cover the use of Groovy maps as lookup tables, and finish up by using maps to calculate some results.
First things first—here is the final example from the last article, in more idiomatic Groovy:
import com.opencsv.CSVReader def mdCountryCSV = "Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv" new File(mdCountryCSV).withReader { reader -> def csvReader = new CSVReader(reader) def fNam = csvReader.readNext() csvReader.each { fVal -> def valByNam = [fNam,fVal].transpose().collectEntries() println valByNam } }Save this code snippet as
ex04.groovy
in the same directory as the first three examples, and the data you downloaded from the World Bank, and run it. What do you see?Some explanations are in order.
First, just to make the examples more readable, I’ve shortened my long variable names.
Second, Groovy takes a different view toward type checking than Java. As Burt Beckwith describes in his excellent book on Grails (and Groovy), “Programming Grails,” Groovy supports optional typing, allowing the user to control typing or leave it up to the language to figure out at run-time by using the
def
keyword to declare variables. I will add to Burt’s excellent exposition that old adage, “with great power comes great responsibility.” Leaving type decisions up to the language can introduce difficult-to-find bugs, and in any case, puts off until run-time the detection of any typing errors. But this kind of language design decision also means Groovy can provide not only more concise code (less to read can mean less to debug), but also some interesting dynamic capabilities related to materializing behavior at run-time.Third, the already-concise code used to copy the array of column names and corresponding array of column values provided by
csvReader
for each line of the file has been replaced with some of Groovy’s nice methods for handling collections and maps:
[fNam, fVal]
creates a new list composed of two elements: the array of field names and the array of field values on the current line.transpose()
converts this into a new list where each element is a pair of [name,value] for each field.collectEntries()
turns the list of pairs of [name,value] into a map where the keys are the field names and the values are the field value
Finally, the line println valByNam
just dumps the map generated by the above.
Ok, enough explaining. The above code is only marginally interesting because, really, who needs to re-format the contents of the country metadata as maps / key-value pairs? Let’s do something with that data!
Creating lookup tables
One thing we see in the country metadata is that the World Bank has a system for classifying countries according to their income. This column is called IncomeGroup (no space) in the file. As an example of linking two separate files together, let’s:
- Create a lookup table of country code versus income group from the country metadata — we’ll call this
iGLU;
- Use that lookup table to calculate population growth rates by income group from the population data file.
To do this, we first need to declare a lookup table and populate it. We’ll repurpose the above code:
import com.opencsv.CSVReader def iGLU = [:] // income group lookup table def mdCountryCSV = "Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2.csv" new File(mdCountryCSV).withReader { reader -> def csvReader = new CSVReader(reader) def fNam = csvReader.readNext() csvReader.each { fVal -> def valByNam = [fNam,fVal].transpose().collectEntries() iGLU[valByNam."Country Code"] = valByNam.IncomeGroup } } println iGLUSave this code as ex05.groovy and run it.
The notation
[:]
creates an empty map. We could be more specific, for example, by defining a hashtable-based map that takes String arguments and produces a string result. If you’re not too familiar with maps and hash tables, the Groovy documentation, section 2.2, provides more details.Note also the use of dot-notation to access the map — in particular, that quotes around
IncomeGroup
are not necessary and only so aroundCountry Code
because of its embedded blank. However, if what is to the right of the dot is a variable rather than a constant, we either must surround it by parentheses or go back to the square brackets and leave out the dot.Creating and initializing accumulators
In order to calculate the population growth by income group, we’re going to have to create accumulators. Since we’re using the classifications for income group supplied in the country metadata, it makes sense if our accumulators are defined as maps. Also, we’ll need two: one for population in the start year, one for population in the end year. Finally, we need to decide whether to initialize these indicators at the start or every time a new income group is encountered while reading the population data. For purposes of this exercise, we’re going to initialize first:
def iGSet = iGLU.values() as Set def pop1 = [:] // population by income group in start year def pop2 = [:] // population by income group in end year iGSet.each { ig -> pop1[ig] = 0l pop2[ig] = 0l }Append this to the end of
ex05.groovy
(you can delete theprintln
).The first line of code gets all the values from the
iGLU
income group lookup map and converts them to aSet
, which is a kind of collection where each element occurs at most once. We’ll use thisiGSet
to iterate over the unique values of income group as defined in the country metadata.Then we define our two accumulators,
pop1
, which we use to accumulate totals by income group in the start year, andpop2
, which we use for the end year.The next four lines initialize
pop1
andpop2
to zero (long — that’s what the0l
means). This is a good moment to note that Groovy’s designers decided that unqualified integers in Groovy source code are of typeBigInteger
, and unqualified decimal numbers are of typeBigDecimal
. I tend to avoid using these (software-implemented) types.Processing another file using lookups and accumulating
At this point, we can read the population data and accumulate it:
def populationCSV = "API_SP.POP.TOTL_DS2_en_csv_v2.csv" new File(populationCSV).withReader { reader -> def csvReader = new CSVReader(reader) def fNam (1..5).each { fNam = csvReader.readNext() } csvReader.each { fVal -> def valByNam = [fNam,fVal].transpose().collectEntries() def country = valByNam."Country Code" if (country && iGLU.containsKey(country)) { pop1[iGLU[country]] += Long.parseLong(valByNam."2014" ?: "0") pop2[iGLU[country]] += Long.parseLong(valByNam."2015" ?: "0") } } }Append this to the end of ex05.groovy.
Similar to the handling of the metadata file (it’s all just CSV after all), it’s worth noting that the population CSV is not well formed — it has four title lines prior to the column heading line. Therefore, defining the list of column names is more complex. The code
(1..5)
uses the Groovy range to generate a list of 5 elements 1, 2, 3, 4, 5, and theeach
executes the closure once for each element. This is equivalent in effect to a C or Java (or Groovy!)for (int i = 0; i < 5; i++) {...}
but doesn’t require us to define a spurious variable. Through this code,fNam
is eventually set to the fifth line — that is, the column headers.The if statement first checks that country is non-null and non-blank and then makes sure it’s found among the keys in the income group lookup table. This would be a good moment to take a break and study some Groovy semantics, especially the meaning of truth in Groovy (§5 on that page). Basically, non-null, non-empty, non-zero values are true.
Summarizing results
Finally, we’re using
Long
values at this point because there are too many people to be enumerated in a 32-bit integer.Now that we have the data accumulated, the one remaining step is to write it out:
iGSet.each { ig -> printf "group %s growth rate %.2f %%n", (ig ?: "Unspecified"),100d * (pop2[ig] - pop1[ig]) / pop1[ig] }Append this to the end of ex05.groovy. At this point, you have a full program and can run it, producing the following output:
group High income growth rate 0.56 % group Low income growth rate 2.73 % group Upper middle income growth rate 0.78 % group Unspecified growth rate 1.27 % group Lower middle income growth rate 1.46 %Note that
(ig ?: "Unspecified")
above uses the Groovy Elvis operator, which is a short form of(ig ? ig : "Unspecified")
, and a great example of the DRY principle (Don’t Repeat Yourself). We’re using this construct so that every income group has a text string printed, even if one wasn’t specified in the original data.As a data analyst, my purpose is not to interpret these results; however, the overall process in which I take some data (in this case, publicly available) and mine it for relationships is precisely what a lot of my work is about. With this example, you can see why Groovy’s concise and powerful collection and map abstractions, coupled with all the Java libraries out there, make it an indispensable tool for me.
Where to Next?
In the next installment, we will take a look at other structured data sources besides CSV files.
[ad_2]
Source link