Topics in Big Data

Last Taught: Spring 2020

The goal of this class is to cover topics in Big Data. The focus will be on principles and practices of data storage, data modeling techniques, data processing and querying, data analytics and applications of machine learning using these systems. We will learn about application of these concepts on large scale urban analytics in cities.

Syllabus

  1. Applications of Big Data
  2. History of Database and Big Data Systems
  3. Big Data Infrastructure
    • -computing clusters (quick overview of cloud computing)
    • -Understanding the database anatomy and optimizing access
    • -Online transactions
    • -Understanding NoSQL
    • -Column storage vs Row Storage
    • -Distributed System Resilience - Zookeeper, HashiCorp Consul, System-d and Google Chubby, Paxos and Consensus
    • -Storage Produts: Big Table, Dynamo DB, Spanner, Memcached
  4. Computation Models and Big Data Processing (Batch Data)
    • -Classical Workflow Systems
    • -Map Reduce and HDFS
    • -Spark and RDD
  5. Computation Models and Big Data Processing (Streaming Data)
    • -Pulsar and Kafka - Data Collection and Management (Pub/Sub systems)
    • -Storm and Heron
  6. Analytics
    • -Clustering and Dimensionality Reduction
    • -Link Analysis and Page Rank
    • -Large Scale Machine Learning
  7. Practical Applications (Projects)
    • -City Scooter Data Analysis
    • -City Accident Data Analysis
    • -Recommender Systems
    • -Transit Energy systems

Principles of Operating Systems

Last Taught: Spring 2019

This is an introductory course on operating systems. You will learn basic concepts in OS design and implementation. The course content will consist of a balance between theory/concepts and practical hands-on material.

Syllabus

The course covers the following topics through the semester in the order specified.

  1. Overview of C programming language
  2. History of operating systems
  3. Architecture of a modern computing system
  4. Process creation and management
  5. Scheduling policies
  6. Memory management, virtual memory, paging
  7. Concurrency, threads, mutual exclusion
  8. Interprocess communication, message queues, pipes, shared memory, sockets
  9. Devices, file systems
  10. Time permitting: security policies, distributed systems

Reliable Distributed Systems

Last Taught: Fall 2019

The goal of this class is to provide a foundation in the area of reliable and resilient distributed computing. This is specially important in order to be able to construct high assurance applications in this era of Internet of Things and Smart Cities. The technical landscape of the technologies in this area is changing rapidly - Memcached (a new kind of key-value store) has displaced standard file system storage, Chubby supports scalable locking and synchronization, ZooKeeper enables consistency-based distributed services. Big Table manages sparse but enormous data sets. ZeroMQ. MQTT and DDS provide the reliable communication services.

Syllabus

The course is going to be divided into 5 modules. Each module will end with an assignment and a report on the topic.

  1. Review of Networking - We will review the concept of sockets, internet routing, TCP/UDP and DNS. These concepts are the backbone of distributed systems.
  2. Module 2: Internet of Things and cloud - We will review internet of things, including the different distributed application interaction patterns, for example pub/sub, synchronous and asynchronous point to point communication. You will learn to use or review the use of MQTT, REST/Websocket and ZeroMQ, DDS in this module. Towards the end of this module you will build a vehicle-to-vehicle (V2V) communication network between cars in a simulated environment called TORCS.
  3. Module 3: Reliability in Distributed systems – In this module you will be introduced to formal concepts related to reliability in distributed computing systems.
  4. Module 4: Understanding performance bottlenecks, Quality of Service and Failures and Testing - In this module we will review what is quality of service, how is it expressed and what failure means. We will also study the mechansigms of testing the systems. It will be important to understand the concept of time synchronization. Additionally, during this module you will get introduced to the FMECA analysis and will have to analyze the failure modes of the distributed application you have built during the second module.
  5. Module 5: Handling Failures in Distributed systems – In this module you will be introduced to formal concepts related to reliability in distributed computing systems. We will discuss various techniques for overcoming failures, and achieving consistency, availability, and reliability in distributed systems. We will be using the Guide to Reliable Distributed Systems book for the majority of this module and the next modules to follow.
  6. Module 6: Related technologies – In this last module we discuss several tools and techniques to retrofit reliability to complex systems. We will review concepts of security schemes, clock synchronization, and transaction schemes that are used to achieve reliability in practical distributed systems. At the end of this module you will be able to apply these reliability concepts to the car’s V2V network assignments.
  7. Final Project- the final module for this course is going to be a project, which you will develop in teams.

Data Science for Smart Cities

Integrating technological and socio-economic approaches to challenges facing metropolitan areas experiencing unprecedented growth. Infrastructure and resources needed for sustainable development and maintaining quality of life. Adapting technology-driven internet-of-things framework to the Smart Cities concept of urban development. Ethical and justice concerns, including privacy and equitable access to data. Algorithmic methods of machine learning and statistics, such as supervised and unsupervised learning, factor analysis, multi-dimensional regression analysis, and hierarchical linear modeling. Agent-based and equation-based simulation modeling. Linear and nonlinear optimization. Mixed methods approaches to gathering and analyzing qualitative and quantitative socio-economic data. Computational methods for data from large distributed infrastructure. The course involves lectures on the fundamental content material and exercises designed to familiarize students with the use of data analytics and qualitative research methods – combining and culminating in the semester-long multi-disciplinary group project. The group projects will be cover three primary topics related to Smart City applications - (1) Transportation; (2) Energy Management; and (3) Water Quality. Each project team will include four undergraduate students with different disciplinary backgrounds.

Syllabus

  1. Background/ primer on the smart cities/IOT concept, and example smart cities from around the world.
  2. Data collection at massive scales using the Internet of Things, mobile phones, and social media. GIS (Geographical Information Systems).
  3. Introduction to Analytics Tools + Data sets: Google Colab + Github + Visualizing and Preprocessing Data sets
  4. Data Science Methods and their relevance to Smart City Applications.
  5. Introduction to analytics and mining (machine learning) methods for Smart City applications
  6. Qualitative Research Methods: Purpose, Strategies, and Design
  7. Socio-economic data collection and analysis.
  8. Information privacy, security, responsible use, and ethical issues in data collection and use