Data Science with Scala and Smile¶

Monte Carlo Simulation¶

This example shows how to use the Monte Carlo method to estimate the area of a circle.

First, we’ll look at the code for generating a data file for the Monte Carlo simulation, given the desired number of darts. The resulting file consisting the x- and y-coordinates of each dart along with a boolean flag indicating whether or not the dart landed inside the circle.

//> using scala 3.3.6
//> using dep "com.lihaoyi::mainargs:0.7.6"

import mainargs.{main, arg, ParserForMethods}
import java.nio.file.{Path, Files, StandardOpenOption}
import java.nio.charset.StandardCharsets

object GenerateDarts:

  def main(args: Array[String]): Unit = ParserForMethods(this).runOrExit(args.toIndexedSeq)

  @main(doc = "Generate random darts for Monte Carlo simulation")
  def run(
    @arg(short = 'n', doc = "Number of darts to generate") n: Int = 1000000,
    @arg(short = 'o', doc = "Output CSV file") output: String = "darts.csv"
  ): Unit =
    
    val random = scala.util.Random

    Files.writeString(Path.of(output), s"x,y,inside\n", StandardCharsets.UTF_8)

    (1 to n).foreach: i =>
      val x = random.nextDouble() * 2 - 1
      val y = random.nextDouble() * 2 - 1
      val inside = x * x + y * y <= 1
      Files.writeString(Path.of(output), s"$x,$y,$inside\n", StandardCharsets.UTF_8, StandardOpenOption.APPEND)

    println(s"Generated $n darts and saved to $output")

Next, we’ll look at the code for estimating the area of a circle using the Smile library for statistical computing and machine learning in Scala. In particular, our code loads dataframes representing the points where the darts landed and figures the area of the circle from the ratio of darts that landed inside the circle to the total number of darts.

//> using scala 3.3.6
//> using dep "com.lihaoyi::mainargs:0.7.6"
//> using dep "com.github.haifengl::smile-scala:4.3.0"
//> using dep "org.slf4j:slf4j-simple:2.0.17"

import mainargs.{main, arg, ParserForMethods}
import java.nio.file.{Path, Files}
import java.nio.charset.StandardCharsets
import smile.data.DataFrame
import smile.io.Read
import org.apache.commons.csv.CSVFormat

object EstimateArea:

  def main(args: Array[String]): Unit = ParserForMethods(this).runOrExit(args.toIndexedSeq)

  @main(doc = "Monte Carlo area estimator") 
  def run(
    @arg(short = 'i', doc = "Input CSV file of darts (default: darts.csv)") input: String = "darts.csv",
    @arg(short = 'o', doc = "Optional output file to write the estimated area") output: Option[String] = None
  ): Unit =

    if !Files.exists(Path.of(input)) then
      println(s"Input file '${input}' does not exist.")
      sys.exit(1)

    val df: DataFrame = Read.csv(input,
      CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get()
    )

    val totalCount = df.nrow()

    if totalCount == 0 then
      println(s"Input file '${input}' is empty.")
      sys.exit(1)

    val insideCount = df.stream().filter(row => row.getBoolean("inside")).count()
    val estimatedArea = 4.0 * insideCount / totalCount
    val resultText = f"Estimated area of unit circle: $estimatedArea%.5f using $insideCount/$totalCount darts"

    println(resultText)

    output.foreach: outputPath =>
      Files.writeString(Path.of(outputPath), resultText + "\n", StandardCharsets.UTF_8)
      println(s"Result also written to ${outputPath}")

Going Between Java and Scala APIs¶

When using Java-centric APIs, such as Smile, we often need to convert between Java and Scala collections. This example shows how to convert between the two using the asScala and asJava methods. These conversions are not strictly needed unless we want to interact with these collections in a Scala way, which we usually do when using Scala.

scala> import scala.jdk.CollectionConverters.*

scala> java.util.List.of("hello", "world")
val res0: java.util.List[String] = [hello, world]

scala> res0.asScala
val res1: scala.collection.mutable.Buffer[String] = Buffer(hello, world)

scala> res1.asJava
val res3: java.util.List[String] = [hello, world]

scala> List("hola", "mundo")
val res4: List[String] = List(hola, mundo)

scala> res4.asJava
val res5: java.util.List[String] = [hola, mundo]

Graffiti/311 Chicago Data Portal¶

This example aims to introduce Scala + Smile with a compelling example from our hometown of Chicago. Well, it could be any city!

The Chicago Data Portal Graffiti/311 example provides a lightweight, scriptable toolkit for analyzing graffiti-related 311 service requests in the City of Chicago. Using Scala and command-line tools, the system allows users to inspect, filter, aggregate, and visualize graffiti complaint data efficiently—even on large files.

While graffiti is often–and wrongly–perceived as a cosmetic or quality-of-life issue, tracking and understanding patterns in graffiti reports can reveal deeper insights into urban infrastructure, neighborhood disinvestment, and resident engagement.

311 calls are not limited to being complaints; they are a form of civic participation. By examining this data over time and across geographic regions, we gain valuable information about public responsiveness, spatial inequality, and where municipal services are (or aren’t) being delivered equitably to communities (a longstanding issue in many Chicago neighborhoods).

This demonstration application helps make that analysis transparent, reproducible, and accessible.

Requirements (Functional and Non-Functional)¶

Data Acquisition¶

The system must be able to:

Download the Chicago 311 graffiti removal dataset in CSV format from the City of Chicago’s open data portal.

Data Inspection¶

The system must be able to:

Load and parse the downloaded CSV file, treating the first line as a header.
Display the dataset schema (column names).
Preview a limited number of rows from the dataset.

Data Filtering¶

The system must be able to:

Filter requests based on:
- Service status (e.g., “Completed”, “Open”).
- A start date and/or end date range using the “Creation Date” column.
Limit the number of matching rows displayed.
Save the filtered dataset to a new CSV file.

Data Aggregation¶

The system must be able to:

Count the number of requests grouped by a specific column (e.g., “Zip Code” or “Surface Type”).
Display the top N group values by count.

Data Visualization¶

The system must be able to:

Aggregate requests by month using the “Creation Date” column.
Generate a bar chart of graffiti removal trends over time.
Save the chart to a file (e.g., PNG), with no GUI dependencies.

Reproducibility & Automation¶

The system must be able to:

Be scriptable and composable using CLI arguments (no hardcoded values).
Use lazy evaluation where possible to minimize memory usage on large files.
Avoid dependence on any GUI (especially for visualization or preview).

Implementation¶

Downloader¶

//> using scala "3.3.5"
//> using dep "com.lihaoyi::mainargs:0.7.6"

import java.io.{BufferedInputStream, FileOutputStream}
import java.net.URI
import mainargs._

object FetchGraffitiData:
  val datasetURL = "https://data.cityofchicago.org/api/views/hec5-y4x5/rows.csv?accessType=DOWNLOAD"

  @main
  def run(
    @arg(name = "output", short = 'o', doc = "Path to save the downloaded CSV file")
    output: String = "311_graffiti.csv"
  ): Unit =
    val urlStream = new BufferedInputStream(URI.create(datasetURL).toURL().openStream())
    val fileOut = new FileOutputStream(output)

    urlStream.transferTo(fileOut)

    urlStream.close()
    fileOut.close()

    println(s"Downloaded dataset to $output")

  def main(args: Array[String]): Unit =
    ParserForMethods(this).runOrExit(args.toIndexedSeq)

mkdir -p dataset
curl -L -o dataset/311_graffiti.csv "https://data.cityofchicago.org/api/views/hec5-y4x5/rows.csv?accessType=DOWNLOAD"

Loader¶

//> using scala "3.3.5"
//> using dep "com.lihaoyi::mainargs:0.7.6"
//> using dep "org.apache.commons:commons-csv:1.14.0"

import java.nio.file.{Files, Paths}
import java.nio.charset.StandardCharsets
import scala.jdk.CollectionConverters.*
import org.apache.commons.csv.*
import mainargs._

object LoadGraffitiData:
  @main
  def run(
      @arg(name = "input", short = 'i') input: String,
      @arg(name = "limit", short = 'l') limit: Int = 5
  ): Unit =
    val reader =
      Files.newBufferedReader(Paths.get(input), StandardCharsets.UTF_8)
    val parser = CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get().parse(reader)

    val headers = parser.getHeaderNames.asScala
    println(s"Headers: ${headers.mkString(", ")}")

    val iter = parser.iterator().asScala
    println(s"\nFirst $limit rows:")
    iter
      .take(limit)
      .foreach: record =>
        val row = headers.map(h => s"$h=${record.get(h)}").mkString(", ")
        println(row)

  def main(args: Array[String]): Unit =
    ParserForMethods(this).runOrExit(args.toIndexedSeq)

Filtering¶

This shows how to filter based on some criteria. In this case, we are filtering dates/status of a particular report.

//> using scala "3.3.5"
//> using dep "com.lihaoyi::mainargs:0.7.6"
//> using dep "org.apache.commons:commons-csv:1.14.0"

import java.nio.file.{Files, Paths}
import java.nio.charset.StandardCharsets
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import scala.jdk.CollectionConverters.*
import org.apache.commons.csv.*
import mainargs._

object FilterGraffitiData:
  @main
  def run(
    @arg(name = "input", short = 'i') input: String,
    @arg(name = "status", short = 's') status: String = "Completed",
    @arg(name = "start-date") startDate: String = "2025-01-01",
    @arg(name = "end-date") endDate: String = "2025-01-31",
    @arg(name = "limit", short = 'l') limit: Int = 5,
    @arg(name = "count-only", doc = "If set, only print number of matching rows") countOnly: Boolean = false
  ): Unit =

    val reader = Files.newBufferedReader(Paths.get(input), StandardCharsets.UTF_8)
    val parser = CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get().parse(reader)
    val headers = parser.getHeaderNames.asScala
    val fmt = DateTimeFormatter.ofPattern("MM/dd/yyyy")

    val filtered = parser.iterator().asScala.filter: record =>
      val rowStatus = record.get("Status")
      val rowDate = LocalDate.parse(record.get("Creation Date"), fmt)

      val statusOK = rowStatus == status
      val startOK = !rowDate.isBefore(LocalDate.parse(startDate))
      val endOK = !rowDate.isAfter(LocalDate.parse(endDate))

      statusOK && startOK && endOK

    if countOnly then
      val total = filtered.size
      println(s"$total matching rows.")
    else
      val taken = filtered.take(limit).toList
      println(s"Showing ${taken.size} matching rows:")
      taken.foreach: record =>
        val row = headers.map(h => s"$h=${record.get(h)}").mkString(", ")
        println(row)

  def main(args: Array[String]): Unit =
    ParserForMethods(this).runOrExit(args.toIndexedSeq)

Aggregation¶

//> using scala "3.3.5"
//> using dep "com.lihaoyi::mainargs:0.7.6"
//> using dep "org.apache.commons:commons-csv:1.14.0"

import java.nio.file.{Files, Paths}
import java.nio.charset.StandardCharsets
import scala.jdk.CollectionConverters.*
import org.apache.commons.csv.*
import mainargs._

object AggregateGraffitiData:
  @main
  def run(
    @arg(name = "input", short = 'i') input: String,
    @arg(name = "group-by", short = 'g') groupBy: String = "ZIP Code",
    @arg(name = "top", short = 't') top: Int = 10
  ): Unit =
    val reader = Files.newBufferedReader(Paths.get(input), StandardCharsets.UTF_8)
    val parser = CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get().parse(reader)

    val counter = scala.collection.mutable.Map.empty[String, Int].withDefaultValue(0)
    val iter = parser.iterator().asScala

    iter.foreach: record =>
      val key = record.get(groupBy)
      counter(key) += 1

    val sorted = counter.toSeq.sortBy(-_._2).take(top)
    println(s"Top $top entries grouped by '$groupBy':")
    sorted.foreach:
      case (k, v) => println(f"$k%-20s → $v%5d")

  def main(args: Array[String]): Unit =
    ParserForMethods(this).runOrExit(args.toIndexedSeq)

Visualization¶

//> using scala "3.3.5"
//> using dep "com.lihaoyi::mainargs:0.7.6"
//> using dep "org.apache.commons:commons-csv:1.14.0"
//> using dep "org.knowm.xchart:xchart:3.8.8"

import java.nio.file.{Files, Paths}
import java.nio.charset.StandardCharsets
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import scala.jdk.CollectionConverters.*
import org.apache.commons.csv.*
import org.knowm.xchart.{CategoryChartBuilder, BitmapEncoder}
import org.knowm.xchart.BitmapEncoder.BitmapFormat
import mainargs._

object VisualizeGraffitiData:
  @main
  def run(
    @arg(name = "input", short = 'i') input: String,
    @arg(name = "output", short = 'o') output: String = "graffiti_trend.png"
  ): Unit =
    val reader = Files.newBufferedReader(Paths.get(input), StandardCharsets.UTF_8)
    val parser = CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get().parse(reader)
    val fmt = DateTimeFormatter.ofPattern("MM/dd/yyyy")

    val monthly = scala.collection.mutable.Map.empty[String, Int].withDefaultValue(0)
    parser.iterator().asScala.foreach: record =>
      val date = LocalDate.parse(record.get("Creation Date"), fmt)
      val key = f"${date.getYear}-${date.getMonthValue}%02d"
      monthly(key) += 1

    val (months, counts) = monthly.toSeq.sortBy(_._1).unzip

    val chart = new CategoryChartBuilder()
      .width(800).height(600)
      .title("Graffiti Removal Requests Per Month")
      .xAxisTitle("Month")
      .yAxisTitle("Requests")
      .build()

    chart.addSeries("Requests", months.asJava, counts.asJava.asInstanceOf[java.util.List[Number]])
    BitmapEncoder.saveBitmap(chart, output, BitmapFormat.PNG)

    println(s"Saved chart to $output")

  def main(args: Array[String]): Unit =
    ParserForMethods(this).runOrExit(args.toIndexedSeq)

Reproducibility¶

//> using scala "3.3.5"
//> using dep "com.lihaoyi::mainargs:0.7.6"
//> using dep "org.apache.commons:commons-csv:1.14.0"

import java.nio.file.{Files, Paths}
import java.nio.charset.StandardCharsets
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import scala.jdk.CollectionConverters.*
import org.apache.commons.csv.*
import mainargs._

object SaveFilteredGraffitiData:
  @main
  def run(
    @arg(name = "input", short = 'i') input: String,
    @arg(name = "output", short = 'o') output: String,
    @arg(name = "status", short = 's') status: Option[String] = None,
    @arg(name = "start-date") startDate: Option[String] = None,
    @arg(name = "end-date") endDate: Option[String] = None
  ): Unit =
    val reader = Files.newBufferedReader(Paths.get(input), StandardCharsets.UTF_8)
    val parser = CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get().parse(reader)
    val headers = parser.getHeaderNames.asScala
    val fmt = DateTimeFormatter.ofPattern("MM/dd/yyyy")

    val writer = Files.newBufferedWriter(Paths.get(output), StandardCharsets.UTF_8)
    val printer = CSVFormat.DEFAULT.builder
      .setHeader(headers.toSeq*)
      .get()
      .print(writer)

    val matched = parser.iterator().asScala.filter: record =>
      val rowStatus = record.get("Status")
      val rowDate = LocalDate.parse(record.get("Creation Date"), fmt)
      val statusOK = status.forall(_ == rowStatus)
      val startOK = startDate.forall(sd => !rowDate.isBefore(LocalDate.parse(sd)))
      val endOK = endDate.forall(ed => !rowDate.isAfter(LocalDate.parse(ed)))
      statusOK && startOK && endOK

    matched.foreach: record =>
      val row = headers.map(record.get).asJava
      printer.printRecord(row)

    printer.close()
    println(s"Wrote filtered records to $output")

  def main(args: Array[String]): Unit =
    ParserForMethods(this).runOrExit(args.toIndexedSeq)