Determine a Reasonable Sample Size for Training a Text Classifier

Consider that we are creating a text classification model that will discover a few target needles in a massive haystack of records. Maybe we are looking for messages of Death among emergency services messages - 20 for every 50,000 messages. Labeling this large dataset and finding the Death messages is costly, so we don’t want to label more than is necessary. If we already have enough to estimate the target proportion, let us determine whether the sample size used for training the classification model is large enough.

Read More

Review of Research Paper: 'Pubertal Suppression for Transgender Youth and Risk of Suicidal Ideation'

This post reviews the paper ‘Pubertal Suppression for Transgender Youth and Risk of Suicidal Ideation’, one of the primary scientific papers supporting the use of drugs to inhibit the progress of puberty in children. This research is winner of the American Academy of Pediatrics ‘Paper of the Year’ award, 2020, and is cited 104 times; thereby, lending large credence to its conclusions. The review focuses on the paper’s quality of data, arguments made by the authors, and statistical results used to support the arguments.

Read More

Comparing Brand Monitoring Vendors with Other Operational Awareness Approaches

Operational Awareness is one of the most important aspects a firm must perform. However, most firms don’t have an office directly responsible for this work. Instead, a subset of this work, Brand Monitoring, is part of the Marketing department, which might include a more general reputation and crisis management office, if the firm is large enough. Below, explores this field, as well as tools and approaches to improve efficacy of results.

Read More

Programming for Balancing Short- and Long-Term Needs

Software engineering is all about design and balancing requirements. Anyone can learn syntax and how to code, but there are so many challenges to scaling as the codebase grows. Some aspects to balance include delivering functionality vs code debt, deploying solutions vs preferred architecture, and countless others. In addition, developers need to not only understand and keep updated on the language, but also the ecosystem of libraries and tools, underlying data structures and algorithms, and also design patterns.

Read More

Cheatsheet for Documentation

The README.md file Docstring Single-line for a function Focus on ‘do this, return that’. def multiplier(a,b): """Take in two numbers, return their product." product = a*b return product Multi-line for function The PEP 257 provides standard conventions for usage. def multiplier(a,b): """ Take in two numbers, return their product. This is typical multiplication for two scalars with no extension to matrices. Args: a(int): a decimal integer b(int): another decimal integer Returns: product(str): string of the product of a and b Raises: IOError: an error occurred.

Read More

Cheatsheet for PyTest Configurations

Multiple tests Run the same test code with many different parameters to create multiple tests. All tests will run even if there is a failure. recs = [(1,2),(2,3),(3,4)] @pytest.mark.parametrize("x, y", recs ) def test_extract_process(x, y): val = my_function(x) assert val == y Fixtures Use fixtures to run code before and after all tests. Database fixture @pytest.fixture() def resource_db(): # setup log_file = Path('./tests/tmp/process.log') db_file = Path('./tests/tmp/test.db') logger = Logger(log_file).create_logger() db = Database(db_file = db_file, tables_list = LIST_ALL_TABLES, meta = meta, logger = logger, path_download = '.

Read More

Terminology Useful for NLP

NLP allows for both theoretical and practical study of language. Below are a few of aspects of study, as well as terminology, that is frequently used within the field. Because of the interdisciplinary nature of computational linguistics, the terms come from linguistics, computer science, and mathematics. Semiotics - a philosophical theory covering the relationship between signs and the things they reference. Phonetic and Phonological Knowledge - Phonetics is the study of language at the level of sounds while phonology is the study of the combination of sounds into organized units of speech.

Read More

Three Levels of Customer Explanations for Data Science

Every customer is different. That seems obvious until they begin to ask you specific questions about your work - or they don’t ask anything at all. It is good to have a framework for providing information: starting general and moving to more specific. Note that at no time are you at the math or calculation level. I’ve never seen that go well. Solution level: What you’re doing Nothing special here, just input and output.

Read More

Determining Sample Size for AI Models

We are going to dive into the deep disturbing world of sample size in AI. This work is RARELY done as part of AI solutions. There are comparatively few research papers on this topic, and the approaches they offer tend to be specific to the underlying problem addressed in the paper. However, the simple fact that it is poorly understood gives great understanding to the world of AI, which is why I describe it as ‘disturbing’.

Read More

Design Thinking and Employee Maturity

I thought about why I felt like some data scientists are more Junior- to Mid- level data scientists; rather than Mid- to Senior- level. The reason I think this is because I often ask them to propose some type of design for a solution to multiple requirements / problems. The response is almost always ‘how do I do this?’ - bad answer. This is very different from high-performing teams I’ve been on where it is more common to have multiple people arguing over multiple solution designs.

Read More