Data Science with Scala and Smile

Monte Carlo Simulation

This example shows how to use the Monte Carlo method to estimate the area of a circle.

First, we’ll look at the code for generating a data file for the Monte Carlo simulation, given the desired number of darts. The resulting file consisting the x- and y-coordinates of each dart along with a boolean flag indicating whether or not the dart landed inside the circle.

 1//> using scala 3.3.6
 2//> using dep "com.lihaoyi::mainargs:0.7.6"
 3
 4import mainargs.{main, arg, ParserForMethods}
 5import java.nio.file.{Path, Files, StandardOpenOption}
 6import java.nio.charset.StandardCharsets
 7
 8object GenerateDarts:
 9
10  def main(args: Array[String]): Unit = ParserForMethods(this).runOrExit(args.toIndexedSeq)
11
12  @main(doc = "Generate random darts for Monte Carlo simulation")
13  def run(
14    @arg(short = 'n', doc = "Number of darts to generate") n: Int = 1000000,
15    @arg(short = 'o', doc = "Output CSV file") output: String = "darts.csv"
16  ): Unit =
17    
18    val random = scala.util.Random
19
20    Files.writeString(Path.of(output), s"x,y,inside\n", StandardCharsets.UTF_8)
21
22    (1 to n).foreach: i =>
23      val x = random.nextDouble() * 2 - 1
24      val y = random.nextDouble() * 2 - 1
25      val inside = x * x + y * y <= 1
26      Files.writeString(Path.of(output), s"$x,$y,$inside\n", StandardCharsets.UTF_8, StandardOpenOption.APPEND)
27
28    println(s"Generated $n darts and saved to $output")

Next, we’ll look at the code for estimating the area of a circle using the Smile library for statistical computing and machine learning in Scala. In particular, our code loads dataframes representing the points where the darts landed and figures the area of the circle from the ratio of darts that landed inside the circle to the total number of darts.

 1//> using scala 3.3.6
 2//> using dep "com.lihaoyi::mainargs:0.7.6"
 3//> using dep "com.github.haifengl::smile-scala:4.3.0"
 4//> using dep "org.slf4j:slf4j-simple:2.0.17"
 5
 6import mainargs.{main, arg, ParserForMethods}
 7import java.nio.file.{Path, Files}
 8import java.nio.charset.StandardCharsets
 9import smile.data.DataFrame
10import smile.io.Read
11import org.apache.commons.csv.CSVFormat
12
13object EstimateArea:
14
15  def main(args: Array[String]): Unit = ParserForMethods(this).runOrExit(args.toIndexedSeq)
16
17  @main(doc = "Monte Carlo area estimator") 
18  def run(
19    @arg(short = 'i', doc = "Input CSV file of darts (default: darts.csv)") input: String = "darts.csv",
20    @arg(short = 'o', doc = "Optional output file to write the estimated area") output: Option[String] = None
21  ): Unit =
22
23    if !Files.exists(Path.of(input)) then
24      println(s"Input file '${input}' does not exist.")
25      sys.exit(1)
26
27    val df: DataFrame = Read.csv(input,
28      CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get()
29    )
30
31    val totalCount = df.nrow()
32
33    if totalCount == 0 then
34      println(s"Input file '${input}' is empty.")
35      sys.exit(1)
36
37    val insideCount = df.stream().filter(row => row.getBoolean("inside")).count()
38    val estimatedArea = 4.0 * insideCount / totalCount
39    val resultText = f"Estimated area of unit circle: $estimatedArea%.5f using $insideCount/$totalCount darts"
40
41    println(resultText)
42
43    output.foreach: outputPath =>
44      Files.writeString(Path.of(outputPath), resultText + "\n", StandardCharsets.UTF_8)
45      println(s"Result also written to ${outputPath}")

Going Between Java and Scala APIs

When using Java-centric APIs, such as Smile, we often need to convert between Java and Scala collections. This example shows how to convert between the two using the asScala and asJava methods. These conversions are not strictly needed unless we want to interact with these collections in a Scala way, which we usually do when using Scala.

scala> import scala.jdk.CollectionConverters.*

scala> java.util.List.of("hello", "world")
val res0: java.util.List[String] = [hello, world]

scala> res0.asScala
val res1: scala.collection.mutable.Buffer[String] = Buffer(hello, world)

scala> res1.asJava
val res3: java.util.List[String] = [hello, world]

scala> List("hola", "mundo")
val res4: List[String] = List(hola, mundo)

scala> res4.asJava
val res5: java.util.List[String] = [hola, mundo]

Graffiti/311 Chicago Data Portal

This example aims to introduce Scala + Smile with a compelling example from our hometown of Chicago. Well, it could be any city!

The Chicago Data Portal Graffiti/311 example provides a lightweight, scriptable toolkit for analyzing graffiti-related 311 service requests in the City of Chicago. Using Scala and command-line tools, the system allows users to inspect, filter, aggregate, and visualize graffiti complaint data efficiently—even on large files.

While graffiti is often–and wrongly–perceived as a cosmetic or quality-of-life issue, tracking and understanding patterns in graffiti reports can reveal deeper insights into urban infrastructure, neighborhood disinvestment, and resident engagement.

311 calls are not limited to being complaints; they are a form of civic participation. By examining this data over time and across geographic regions, we gain valuable information about public responsiveness, spatial inequality, and where municipal services are (or aren’t) being delivered equitably to communities (a longstanding issue in many Chicago neighborhoods).

This demonstration application helps make that analysis transparent, reproducible, and accessible.

Requirements (Functional and Non-Functional)

Data Acquisition

The system must be able to:

  • Download the Chicago 311 graffiti removal dataset in CSV format from the City of Chicago’s open data portal.

Data Inspection

The system must be able to:

  • Load and parse the downloaded CSV file, treating the first line as a header.

  • Display the dataset schema (column names).

  • Preview a limited number of rows from the dataset.

Data Filtering

The system must be able to:

  • Filter requests based on:

    • Service status (e.g., “Completed”, “Open”).

    • A start date and/or end date range using the “Creation Date” column.

  • Limit the number of matching rows displayed.

  • Save the filtered dataset to a new CSV file.

Data Aggregation

The system must be able to:

  • Count the number of requests grouped by a specific column (e.g., “Zip Code” or “Surface Type”).

  • Display the top N group values by count.

Data Visualization

The system must be able to:

  • Aggregate requests by month using the “Creation Date” column.

  • Generate a bar chart of graffiti removal trends over time.

  • Save the chart to a file (e.g., PNG), with no GUI dependencies.

Reproducibility & Automation

The system must be able to:

  • Be scriptable and composable using CLI arguments (no hardcoded values).

  • Use lazy evaluation where possible to minimize memory usage on large files.

  • Avoid dependence on any GUI (especially for visualization or preview).

Implementation

Downloader

 1//> using scala "3.3.5"
 2//> using dep "com.lihaoyi::mainargs:0.7.6"
 3
 4import java.io.{BufferedInputStream, FileOutputStream}
 5import java.net.URI
 6import mainargs._
 7
 8object FetchGraffitiData:
 9  val datasetURL = "https://data.cityofchicago.org/api/views/hec5-y4x5/rows.csv?accessType=DOWNLOAD"
10
11  @main
12  def run(
13    @arg(name = "output", short = 'o', doc = "Path to save the downloaded CSV file")
14    output: String = "311_graffiti.csv"
15  ): Unit =
16    val urlStream = new BufferedInputStream(URI.create(datasetURL).toURL().openStream())
17    val fileOut = new FileOutputStream(output)
18
19    urlStream.transferTo(fileOut)
20
21    urlStream.close()
22    fileOut.close()
23
24    println(s"Downloaded dataset to $output")
25
26  def main(args: Array[String]): Unit =
27    ParserForMethods(this).runOrExit(args.toIndexedSeq)
1mkdir -p dataset
2curl -L -o dataset/311_graffiti.csv "https://data.cityofchicago.org/api/views/hec5-y4x5/rows.csv?accessType=DOWNLOAD"

Loader

 1//> using scala "3.3.5"
 2//> using dep "com.lihaoyi::mainargs:0.7.6"
 3//> using dep "org.apache.commons:commons-csv:1.14.0"
 4
 5import java.nio.file.{Files, Paths}
 6import java.nio.charset.StandardCharsets
 7import scala.jdk.CollectionConverters.*
 8import org.apache.commons.csv.*
 9import mainargs._
10
11object LoadGraffitiData:
12  @main
13  def run(
14      @arg(name = "input", short = 'i') input: String,
15      @arg(name = "limit", short = 'l') limit: Int = 5
16  ): Unit =
17    val reader =
18      Files.newBufferedReader(Paths.get(input), StandardCharsets.UTF_8)
19    val parser = CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get().parse(reader)
20
21    val headers = parser.getHeaderNames.asScala
22    println(s"Headers: ${headers.mkString(", ")}")
23
24    val iter = parser.iterator().asScala
25    println(s"\nFirst $limit rows:")
26    iter
27      .take(limit)
28      .foreach: record =>
29        val row = headers.map(h => s"$h=${record.get(h)}").mkString(", ")
30        println(row)
31
32  def main(args: Array[String]): Unit =
33    ParserForMethods(this).runOrExit(args.toIndexedSeq)

Filtering

This shows how to filter based on some criteria. In this case, we are filtering dates/status of a particular report.

 1//> using scala "3.3.5"
 2//> using dep "com.lihaoyi::mainargs:0.7.6"
 3//> using dep "org.apache.commons:commons-csv:1.14.0"
 4
 5import java.nio.file.{Files, Paths}
 6import java.nio.charset.StandardCharsets
 7import java.time.LocalDate
 8import java.time.format.DateTimeFormatter
 9import scala.jdk.CollectionConverters.*
10import org.apache.commons.csv.*
11import mainargs._
12
13object FilterGraffitiData:
14  @main
15  def run(
16    @arg(name = "input", short = 'i') input: String,
17    @arg(name = "status", short = 's') status: String = "Completed",
18    @arg(name = "start-date") startDate: String = "2025-01-01",
19    @arg(name = "end-date") endDate: String = "2025-01-31",
20    @arg(name = "limit", short = 'l') limit: Int = 5,
21    @arg(name = "count-only", doc = "If set, only print number of matching rows") countOnly: Boolean = false
22  ): Unit =
23
24    val reader = Files.newBufferedReader(Paths.get(input), StandardCharsets.UTF_8)
25    val parser = CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get().parse(reader)
26    val headers = parser.getHeaderNames.asScala
27    val fmt = DateTimeFormatter.ofPattern("MM/dd/yyyy")
28
29    val filtered = parser.iterator().asScala.filter: record =>
30      val rowStatus = record.get("Status")
31      val rowDate = LocalDate.parse(record.get("Creation Date"), fmt)
32
33      val statusOK = rowStatus == status
34      val startOK = !rowDate.isBefore(LocalDate.parse(startDate))
35      val endOK = !rowDate.isAfter(LocalDate.parse(endDate))
36
37      statusOK && startOK && endOK
38
39    if countOnly then
40      val total = filtered.size
41      println(s"$total matching rows.")
42    else
43      val taken = filtered.take(limit).toList
44      println(s"Showing ${taken.size} matching rows:")
45      taken.foreach: record =>
46        val row = headers.map(h => s"$h=${record.get(h)}").mkString(", ")
47        println(row)
48
49  def main(args: Array[String]): Unit =
50    ParserForMethods(this).runOrExit(args.toIndexedSeq)

Aggregation

 1//> using scala "3.3.5"
 2//> using dep "com.lihaoyi::mainargs:0.7.6"
 3//> using dep "org.apache.commons:commons-csv:1.14.0"
 4
 5import java.nio.file.{Files, Paths}
 6import java.nio.charset.StandardCharsets
 7import scala.jdk.CollectionConverters.*
 8import org.apache.commons.csv.*
 9import mainargs._
10
11object AggregateGraffitiData:
12  @main
13  def run(
14    @arg(name = "input", short = 'i') input: String,
15    @arg(name = "group-by", short = 'g') groupBy: String = "ZIP Code",
16    @arg(name = "top", short = 't') top: Int = 10
17  ): Unit =
18    val reader = Files.newBufferedReader(Paths.get(input), StandardCharsets.UTF_8)
19    val parser = CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get().parse(reader)
20
21    val counter = scala.collection.mutable.Map.empty[String, Int].withDefaultValue(0)
22    val iter = parser.iterator().asScala
23
24    iter.foreach: record =>
25      val key = record.get(groupBy)
26      counter(key) += 1
27
28    val sorted = counter.toSeq.sortBy(-_._2).take(top)
29    println(s"Top $top entries grouped by '$groupBy':")
30    sorted.foreach:
31      case (k, v) => println(f"$k%-20s → $v%5d")
32
33  def main(args: Array[String]): Unit =
34    ParserForMethods(this).runOrExit(args.toIndexedSeq)

Visualization

 1//> using scala "3.3.5"
 2//> using dep "com.lihaoyi::mainargs:0.7.6"
 3//> using dep "org.apache.commons:commons-csv:1.14.0"
 4//> using dep "org.knowm.xchart:xchart:3.8.8"
 5
 6import java.nio.file.{Files, Paths}
 7import java.nio.charset.StandardCharsets
 8import java.time.LocalDate
 9import java.time.format.DateTimeFormatter
10import scala.jdk.CollectionConverters.*
11import org.apache.commons.csv.*
12import org.knowm.xchart.{CategoryChartBuilder, BitmapEncoder}
13import org.knowm.xchart.BitmapEncoder.BitmapFormat
14import mainargs._
15
16object VisualizeGraffitiData:
17  @main
18  def run(
19    @arg(name = "input", short = 'i') input: String,
20    @arg(name = "output", short = 'o') output: String = "graffiti_trend.png"
21  ): Unit =
22    val reader = Files.newBufferedReader(Paths.get(input), StandardCharsets.UTF_8)
23    val parser = CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get().parse(reader)
24    val fmt = DateTimeFormatter.ofPattern("MM/dd/yyyy")
25
26    val monthly = scala.collection.mutable.Map.empty[String, Int].withDefaultValue(0)
27    parser.iterator().asScala.foreach: record =>
28      val date = LocalDate.parse(record.get("Creation Date"), fmt)
29      val key = f"${date.getYear}-${date.getMonthValue}%02d"
30      monthly(key) += 1
31
32    val (months, counts) = monthly.toSeq.sortBy(_._1).unzip
33
34    val chart = new CategoryChartBuilder()
35      .width(800).height(600)
36      .title("Graffiti Removal Requests Per Month")
37      .xAxisTitle("Month")
38      .yAxisTitle("Requests")
39      .build()
40
41    chart.addSeries("Requests", months.asJava, counts.asJava.asInstanceOf[java.util.List[Number]])
42    BitmapEncoder.saveBitmap(chart, output, BitmapFormat.PNG)
43
44    println(s"Saved chart to $output")
45
46  def main(args: Array[String]): Unit =
47    ParserForMethods(this).runOrExit(args.toIndexedSeq)
../_images/xchart-graffiti-trend.png

Reproducibility

 1//> using scala "3.3.5"
 2//> using dep "com.lihaoyi::mainargs:0.7.6"
 3//> using dep "org.apache.commons:commons-csv:1.14.0"
 4
 5import java.nio.file.{Files, Paths}
 6import java.nio.charset.StandardCharsets
 7import java.time.LocalDate
 8import java.time.format.DateTimeFormatter
 9import scala.jdk.CollectionConverters.*
10import org.apache.commons.csv.*
11import mainargs._
12
13object SaveFilteredGraffitiData:
14  @main
15  def run(
16    @arg(name = "input", short = 'i') input: String,
17    @arg(name = "output", short = 'o') output: String,
18    @arg(name = "status", short = 's') status: Option[String] = None,
19    @arg(name = "start-date") startDate: Option[String] = None,
20    @arg(name = "end-date") endDate: Option[String] = None
21  ): Unit =
22    val reader = Files.newBufferedReader(Paths.get(input), StandardCharsets.UTF_8)
23    val parser = CSVFormat.DEFAULT.builder().setHeader().setSkipHeaderRecord(true).get().parse(reader)
24    val headers = parser.getHeaderNames.asScala
25    val fmt = DateTimeFormatter.ofPattern("MM/dd/yyyy")
26
27    val writer = Files.newBufferedWriter(Paths.get(output), StandardCharsets.UTF_8)
28    val printer = CSVFormat.DEFAULT.builder
29      .setHeader(headers.toSeq*)
30      .get()
31      .print(writer)
32
33    val matched = parser.iterator().asScala.filter: record =>
34      val rowStatus = record.get("Status")
35      val rowDate = LocalDate.parse(record.get("Creation Date"), fmt)
36      val statusOK = status.forall(_ == rowStatus)
37      val startOK = startDate.forall(sd => !rowDate.isBefore(LocalDate.parse(sd)))
38      val endOK = endDate.forall(ed => !rowDate.isAfter(LocalDate.parse(ed)))
39      statusOK && startOK && endOK
40
41    matched.foreach: record =>
42      val row = headers.map(record.get).asJava
43      printer.printRecord(row)
44
45    printer.close()
46    println(s"Wrote filtered records to $output")
47
48  def main(args: Array[String]): Unit =
49    ParserForMethods(this).runOrExit(args.toIndexedSeq)