Crunching Data with Apple’s TabularData Framework

this image is the first one of the article Crunching Data with Apple's TabularData Framework

Hallo allemaal, Leo hier.

Today we will explore a very cool and interesting Apple framework called TabularData. This framework is all about importing, organizing, and preparing a table of data to train a machine learning model. You can do a lot of neat data transformations with it, from the import moment until writing a new CSV file.

Before beginning, this will be one of the big posts so get a cup of tea or coffee and relax while discovering new APIs. Also, this API is only available for iOS 15 and up, so maybe will take some time until we all get our hands on that.

That said, let’s begin to explore this marvelous shiny new API from Apple.

Let’s code! But first…

 

The Painting of The Day

The painting I chose was a 1739 piece called Le Dejeuner from Francois Boucher.

Possibly the most popular 18th-century artist, Francois Boucher was a French painter in the Rococo style. When he was 17 years old, Boucher was apprenticed for a short time to the French painter Francois Lemoyne, and then to the engraver Jean-Francois Cars. After three years of work and artistic study, Boucher won the Grand Prix de Rome, a scholarship for artistic study, allowing him to travel to Italy and further his study in art.

I chose this painting because the family is eating breakfast around the table, and we are going to talk about a lot of tables today.

 

The Problem – Crunching Data with Apple’s TabularData Framework

You were asked to handle various CSV files, and crunch all the data before starting a machine learning process.

Let’s prepare everything that we need.

In theory, you can follow up on this article using Playgrounds. For some reason, the TabularData framework didn’t work in my Playgrounds and I had not enough patience to make it work locally so I just started a new app and did everything inside the viewDidLoad function.

That said, just start a new iOS project using storyboards.

The data we will work on will be a very interesting CSV file containing all the Oscar Male winnings from the beginning of Oscar nominations until now. You can access that CSV file here.

Before starting crunching Oscar’s winner data we will start with a much simpler one, a list of friends and their ages. After we heat the coding engines we can go to much more complex use cases with the Oscar data.

 

TabularData Framework Initialisation

What TabularData is used for? So the straightforward explanation is that we can pick up JSON or CSV and transform it into crunchable data. Ok, I’ve said Crunch data a lot now, but what is it?

Crunch data is the key initial step required to prepare big volumes of raw data for analysis. Including removing unwanted information and formatting, mapping data into the required format, and making it structured for analysis by other applications.

To start crunching data we need to import the TabularData framework:

import TabularData

Let’s initialize the DataFrame. It is a collection that arranges data in rows and columns. Our first example is very simple, just a list of friends with their id, name, and age.

Also, remember I’m doing this in a recently created iOS app storyboard project, so just put the code below in your ViewController class and let’s start coding.

override func viewDidLoad() {
    super.viewDidLoad()
    // init column by column
    var dataFrame = DataFrame() // Mark 1
    let idColumn = Column(name: "id", contents: [1, 2, 3]) // Mark 2
    let nameColumn = Column(name: "name", contents: ["Mike", "Alan", "Pepjin"]) // Mark 2
    let ageColumn = Column(name: "Age", contents: [34, 30, 23]) // Mark 2
    
    dataFrame.append(column: idColumn) // Mark 3
    dataFrame.append(column: nameColumn) // Mark 3
    dataFrame.append(column: ageColumn) // Mark 3
    print(dataFrame)
}

In Mark 1 we are creating an empty DataFrame. This is useful if you want to create your in-memory DataFrame. You can use it when you download some sort of data and want to crunch into other things before storing it.

The Mark 2 you are creating the columns. The columns are the same as a relational database. You have a title and the values. In this case, we are creating three rows for each column. One for each friend.

And finally, we are printing it. Getting the result below:

tabularData framework image 1

One of the best advantages of using the TabularData framework is that we have this beautiful printing in the console. It has the Column names followed by their <types> and also has the number of rows and columns in the last line.

This was the easiest way to initialize the DataFrame, now we are going to analyze a more complex case, Oscar’s data.

 

Initializing from a CSV file with options

I copy/pasted the info from the source into my project, but you can download it and just transform it into DataFrame directly. It’s your call. I also removed empty spaces from the file, so they could be automatically translated into integers what could be automatically translated.

tabularData framework image 2 tutorial

Now instead we create the DataFrame manually we will use the initializer to create it for us. It’s time to read from the CSV file.

let options = CSVReadingOptions(hasHeaderRow: true, delimiter: ",")
guard let fileUrl = Bundle.main.url(forResource: "male_oscar_age", withExtension: "csv") else { return }

var dataFrame2 = try! DataFrame(contentsOfCSVFile: fileUrl, options: options)
print("\(dataFrame2)")

To read something you need to set the file URL and you can also have options. The `CSVReadingOptions` are the way to set expectations about reading the CSV file. I recommend you explore the CSVReadingOptions because there is a lot to see there.

See below the results:

tabularData framework image 3 example

You can notice that we have 89 rows but are chopped to only 20, but what if we want to show more? We have the FormattingOptions to do that! You can set maximum line width, cell width, row count, and even if your print will contain the column type or not.

let formatingOptions = FormattingOptions(maximumLineWidth: 250, maximumCellWidth: 250, maximumRowCount: 30, includesColumnTypes: true)
print(dataFrame2.description(options: formatingOptions))

Resulting in:

tabularData image 5

 

Initialize TabularData Framework with Partial Loading

Now let’s imagine that you have a really big CSV file with thousands of rows and hundreds of columns, but you don’t need all of that data. Good to know that the DataFrame framework has some options to optimize your importing process.

You can partial loading only the first 10 rows:

let dataFrame3 = try! DataFrame(contentsOfCSVFile: fileUrl, rows: 0..<10, options: options)
print("\(dataFrame3)")

Resulting in:

tabularData framework image 6 example

Great isn’t it? But there’s more, you can also partial load the columns:

let dataFrame4 = try! DataFrame(contentsOfCSVFile: fileUrl, columns: ["Name","Age"], rows: 0..<10, options: options)
print("\(dataFrame4)")

Resulting in:

tabularData framework image 7 example

And to finish partial loading to speed up the importing process, you can define types for each column.

This is a good optimization tip because by default the framework every time tries to import from Date if can’t do that it tries to map to Int and if that also fails it imports as String.

The only thing to keep in mind is that the raw data maybe can’t be parsed into the selected data type and this will throw an Error.

See the code below:

let dataFrame5 = try! DataFrame(contentsOfCSVFile: fileUrl, columns: ["Name","Year"], rows: 0..<10, types: ["Name":.string, "Year":.integer], options: options)
print("\(dataFrame5)")

tabularData framework image 8 example

 

Writing CSV Results

You can of course write any tabular data to a CSV file easily. You just need to pass an URL to the DataFrame write CSV function.

See below:

guard let myCsvUrl = Bundle.main.url(forResource: "MyCsv", withExtension: "csv") else { return }

try! dataFrame.writeCSV(to: myCsvUrl)

var dataFrame6 = try! DataFrame(contentsOfCSVFile: myCsvUrl)
print(dataFrame6)

I had a blank file in the bundle

And you can also read what you write:

tabularData framework image 9 example

This finish this section about how to initialize a DataFrame, in other words, import your data to the framework. We learned that is very flexible how you can import data and also how to optimize that process already telling the types to parse.

 

Select Data from DataFrame in TabularData

Now the crunching time begins and we will start by selecting rows. We will use the first created DataFrame as an example, the friends DataFrame.

 

Selecting Rows

To select a row or a set of rows you can do various ways. Let’s start using the subscript function:

print(friendsDataFrame[row: 2])
print(friendsDataFrame[row: 3])

The interesting part of it is that if the row doesn’t exist it just returns an empty DataFrame as the image below shows:

tabularData framework image 10 example

You can also use the half-open range operator to get rows from an index until the end:

print(friendsDataFrame[1...])

Resulting in:

tabularData framework image 11 example

You can filter rows based on closure and a column:

let filtered = friendsDataFrame.filter(on: "Age", Int.self, { $0 == 34 || $0 == 23 })
print(filtered)

Observe the image below with the filtered result:

tabularData framework image 12 example

Weirdly, the row number is skipping one number. This is because the returned DataFrame is a discontiguous Data Frame Slice. And to traverse it we need the use indexes and a few algorithms:

var index = filtered.rows.startIndex
while index != filtered.rows.endIndex {
    print(filtered.rows.first(where: {$0.index == index})!)
    index = filtered.rows.index(after: index)
}

Printing:

tabularData framework image 13 example

 

Selecting columns

You can select a column by its number (remember this is zero-based) or by its name:

print(friendsDataFrame[column: 1])
print(friendsDataFrame.selecting(columnNames: "Age"))

Resulting in:

tabularData framework image 14 example

You can also remove columns that you don’t need anymore, but remember this is a mutating operation. This means that when you remove the column it will remove forever from the DataFrame.

print(friendsDataFrame.removeColumn("id"))
print(friendsDataFrame)

Resulting in:

final example of removing from tabularData column one less column

 

Various DataFrame Operations in TabularData Framework

Now let’s explore how can you transform a column into another data type, how can you print statistics about your columns, how can you sort your rows by a column, how can you group and join data by a column, and finally how to combine columns into a new column.

 

Mapping Data

You can decode a column that is a JSON to a data type with a JSON decoder. This way the column type will be transformed into the new data type you decoded. The example below the column was the type String, and after that became a type Human:

try? dataFrame6.decode(Human.self, inColumn: "name", using: JSONDecoder())

You can also apply a closure to a column to transform its data. In the example below we will transform the column age that is an Int into a String:

friendsDataFrame.transformColumn("Age") { (age: Int) -> String in
    if age <= 30 {
        return "Jong"
    } else {
        return "Oud"
    }
}

Now the new friends DataFrame is:

 

Creating Column Summaries

With the `DataFrame` object you can get statistics from a column. There are two return possibilities: the numeric summary and the standard summary.

The numeric summary is when you try to apply summary to a numeric column, this will return a `DataFrame` containing a lot of statistic data such as: mean, standard deviation, minimum value, maximum value, median with Q1 and Q3, mode, unique items count, nil count, and item count.

You can easily do this with Oscar’s actor age data:

dataFrame2.summary(ofColumns: 2).columns.forEach({ print($0) })

Resulting in:

summary tabularData image 1 example

summary tabularData image 2 example

summary tabularData image 3 guide

And you can also summarise data that is not numeric, like actors’ names from Oscar’s raw data. It will return all the columns from the numeric summary but the statistics-related ones are with nil values.

Test yourself:

dataFrame2.summary(ofColumns: 3).columns.forEach({ print($0) })

And you can summarize all the columns at once:

 

print(dataFrame2.summary())

 

Sorting

You can sort your DataFrame by one or more columns. You can also set if is in ascending or descending order. Or can use a closure to make that comparison.

print(dataFrame2.sorted(on: "Age",order: .descending))

Resulting in:

more operations on tabularData in Swift image example

Above we can check that the oldest man that won an Oscar prize had 76 years old.

 

Grouping Data

You can also group data by column. With this, you can answer the question: which ages are more likely for a man to win an Oscar?

print(dataFrame2.grouped(by: "Age").counts(order: .descending).description(options: formatingOptions))

Resulting in:

this is the count by age in tabularData in Swift tutorial

The answer to our question is that are more likely to win an Oscar if you have between 36 and 49 years old. You can also check that in the summary section of this article. The median is 42 years with Q1 at 38 years and Q3 at 49 years.

Join Data from different DataFrames in TabularData Framework

Let’s pick up our friends’ data frame and the Oscar’s DataFrame and see if any of my friends have the same age as any Oscar winner. You just need to say in which column you want to have them join. The default join is the inner join.

print(friendsDataFrame.joined(oscarsDataFrame, on: "Age"))

Check the result below:

image tabularData in Swift 9 names oscars

To avoid column name conflicts the joined function generates a new DataFrame with left and right before each column from the source DataFrames. We can analyze that Mike has the same age as three other Oscar winners and Alan same age as two Oscar winners.

If you want to do for example a left join, this means we will have all columns from the first DataFrame necessarily, but only the matched ones from the second DataFrame you can do this:

print(dataFrame.joined(dataFrame2, on: "Age", kind: .left))

Now Pepijn appears with nil values for the right column:

tabular data example tutorial in Swift iOS

 

Combining columns from the same data source

You can create new columns with pre-existing ones. Imagine that you want to create a header title for your view, so you need to get the name of the actor in the movie he played. You can use a closure and the combinColumns function to do that:

dataFrame2.combineColumns("Name", "Movie", into: "HeaderTitle", transform: { (name: String?, movie: String?) -> String? in
    guard let name = name,
          let movie = movie else {
        return nil
    }
    
    return "\(name) got an oscar for \(movie)"
})
print(dataFrame2[..<5])

Resulting in:

combining columns operation on TabularData in Swift

Also in the example above if you pay attention, we are just printing the first five rows with the subscript using the half-open range operator.

And that’s all exploration for today! Of course, there’s a lot more to see and I hope you could learn something new as I could learn.

 

Best Practices

Always declare types when loading data from user-provided CSV files. Don’t use this:

let dataframe = try DataFrame(contentsOfCSVFile: url)

prefer this:

DataFrame(contentsOfCSVFile: fileUrl, columns: ["Name","Year"], rows: 0..<10, types: ["Name":.string, "Year":.integer], options: options)

Using the second way the filter and sorting are natively optimized.

Also setting a type for each column is especially important when parsing dates, because the Date column may fail silently and parse into a String type. Forcing it to be Date type with a specified date parser will fail fast (and you can handle it) if the file isn’t in the right format and you won’t have further problems.

The most common loading errors are: Failed to parse, the wrong number of columns, bad encoding, or the classic misplaced quote. So pay attention to how your raw data is stored.

Always try to lazy load Date types because they are more complex than other types, or even not load them at all if you are not going to use them.

When sorting or joining always try to do that in basic Swift types like Int or String. And by the way, if you are going to group by more than 1 column you could improve the performance by first combining both columns in a single basic type column. For example, you want to group By Age and City, create a combined column called Age-City with both data, and then group by the new Column.

 

Summary – Crunching Data with Apple’s TabularData Framework

Today was one of the big articles exploring an Apple framework. We started with how to initialize a TabularData, we learned how to transform and augment data to fit your needs, and also performance tips for your apps!

The further steps are creating a real machine model. You can follow through with this Apple article.

That’s all my people, I hope you liked reading this article as much as I enjoyed writing it. If you want to support this blog you can Buy Me a Coffee or leave a comment saying hello. You can also sponsor posts and I’m open to freelance writing! You can reach me on LinkedIn or Twitter and send me an e-mail through the contact page.

Thanks for reading and… That’s all folks.

Credits:

title image

Share this post:

Related posts

Featured

Sponsor