A Reasonable MVP Tech Stack

This is a quick summary of products needed to get a development team started. We can manage accounts for code repositories. GitLab is especially good because it is open-source, and it is highly customizable. It is one of several Team ‘infrastructure’ projects necessary for effectively building software. This is a summary of code repository platforms with explanations: Atlassian (firm)– the company with a suite of different project management tools, including Trello, Jira, BitBucket, and others.

Read More

Computer Vision Using PyTorch

The deep learning movement began by applying neural networks to image classification. PyTorch became a leading framework for work in this field. This post provides a cheatsheet to some of the basic methods used for computer vision, using PyTorch. Configuration This is a typical environment setup. Seed the Random Number Generator for all devices (both CPU and CUDA) using manual_seed() so that work can be reproduced. Computations are deterministic only on your specific problem, platform, and PyTorch release.

Read More

Neural Network Basics: Linear Regression with PyTorch

In just a few short years, PyTorch took the crown for most popular deep learning framework. Its concise and straightforward API allows for custom changes to popular networks and layers. While some of the descriptions may some foreign to mathematicians, the concepts are familiar to anyone with a little experience in machine learning. This post will walk the user from a simple linear regression to an (overkill) neural network model, with thousands of parameters, which provides a good base for future learning.

Read More

PySpark Refresher Tutorial

Spark is the primier BigData tool for data science, and PySpark supports a natural move from the local machine to cluster computing. In fact, you can use PySpark on your local machine in standalone mode just as you would on a cluster. In this post, we provide a refresher for those working on legacy or other systems, and want to quickly transition back to Spark. Environment When using the pyspark-shell, the spark.

Read More

A Cheatsheet for Python's Pipenv

Python’s Pipenv and Pyenv make a strong team for creating a consistent development environment for exact specifications. Pyenv allows you to choose from any Python version for your project. Pipenv attempts to improve upon the original virtual environment (venv) and requirements.txt file. It does some things well, including integration of virtual environment with dependecy management, and is straight-forward to use. Unfortunately, it doesn’t always live up to the originally-planned, ambitious, goals.

Read More

Working with XML Data Using Python

Together with JSON, the XML format is the most popular structure for data on the web. It is not only used for data storage, but also for websites, in the form of HTML. XML was seen as the ubiquitous data format, but with the ascent of Javascript, JSON became more popular for web applications. Still, XML is an effective format, and learning to parse and work with it is necessary for anyone who works with a variety of data sources.

Read More

An Introduction to Numpy and Pandas

Numpy and Pandas are the basic data science tools in the Python environment. Having a good understanding of their capabilities and how they process data is imperative to writing optimal code. This post provides an introductory overview and a refresher for those who might come back to the libraries after taking a break. The end of the post explains external interfaces for increasing code execution and performing more sophisticated matrix operations.

Read More

Python Functional Programming

Introduction Most programming languages are procedural or are written in an imperative style: programs are lists of instructions that tell the computer what to do with the program’s input. Even ‘purely’ OOP languages, such as Java, are typically written in an imperative style with little thought give in actual OOP modeling. Functional code is characterised by one thing: the absence of side effects. It doesn’t rely on data outside the current function, and it doesn’t change data that exists outside the current function.

Read More

Object Oriented Programming with Python

Object Oriented Programming (OOP) programming became popular with Java. Microsoft quickly followed-up with the C# language. Now, OOP concepts are available in many languages. Python inherits alot of these OOP attributes, but performs it in its usual pythonic minimalism. The concepts are the same, but much of the cruft, such as accessor modifiers, are limited, leaving an succinct and enjoyable object modeling experience. Introduction OOP deals with classes (blueprints) and objects (instances of blueprint).

Read More

Spark Deployments

Hadoop is seen as the staple of clusters and distributed management. Spark is ubiquitous data science tool. What if you combine Hadoop with Spark? We will explore that question and compare different deployment architectures in this post. Introduction As Storage you use HDFS. Analytics is done with Apache Spark and YARN is taking care of the resource management. Why does that work so well together? From a platform architecture perspective, Hadoop and Spark are usually managed on the same cluster.

Read More