[
        {
          "id": "personal-i-dont-vibe-code",
          "title": "Why I Don't Vibe Code",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "personal",
          "tags": "",
          "url": "/personal/i-dont-vibe-code",
          "content": "There has been a lot of discussion online lately about vibe coding and and how\nLarge Language Models (LLMs) will revolutionize the field of software\ndevelopment. Every new model will launch us into realms of pure productivity,\nshipping software at the speed of thought and removing all the friction and\noverhead of product development. Or something like that.\n\nMaybe. I’ll have to take your word for it. I don’t vibe code.\n\nIf it’s working for you, great! I’m not really here to argue the merits or flaws\nof LLMs at depth here in this piece, but it’s just never clicked for me\npersonally. This page is a “brief” accounting of various reasons why.\n\nI’m a Cheapskate\n\nI’m not a purist. I’ve tried using LLMs that are integrated into an IDE. They\nhave been useful for some tasks that are simple enough to be easily describable\nbut annoying enough to not just do them myself. For instance, resizing a grid\nof square images to be smaller. I could go look at\nthe command-line arguments for ImageMagick, but that was a perfect thing to ask\nthe AI to do. I then tried using one of the AI tools to analyze my code in a\nproject and a few other small tasks before it all came to an awkward halt. The\nsystem informed me that I had just run out of credits and I would need to\nprovide a credit card to purchase more tokens I wanted to keep going.\n\nNow, you must understand that I come from a long line of cheapskates from both\nsides of my family tree. We’ve been pinching pennies and hunting bargains for\ncenturies both here and on the other side of the Atlantic. As an example, one of\nmy distant ancestors died during the King Philip’s\nWar because he left the\nsafety of the fort to retrieve some cheese he had left behind when evacuating his\nhouse. So you must believe me that the idea of paying a service in perpetuity\nso I could think just seemed so laughably absurd and horrific that I didn’t\neven bother giving them my card. I closed the laptop. I uninstalled the IDE and went\nback to using Emacs even. And I realized\nthat I just didn’t even notice the lack anymore.\n\nI’m Old\n\nIt does help that I’m old. I’ve been writing code for a long time, especially in\nan industry that calls a developer with 5 years of experience a “senior\nengineer.” Experience is a welcome antidote to anxiety sometimes (as long as\nit’s not anxiety about ageism in an industry that calls a developer senior with\nonly 5 years of experience) , and the AI hype doee remind me of earlier\nbreakthroughs in low and no-code tooling. I don’t doubt that AI can be a useful\ntool for developers. I know there are tasks it can help with as better tooling.\nBut these arguments always leave me thinking about the accidental and essential\ncomplexity again.\n\nFred Brooks was old even when I was\na young coder myself. As the project manager for IBM’s System 360 line of\nmainframes (and accompanying operating system) he had a front row seat to when\nall the now common ways software projects go wrong were novel and new. He\ncollected these observations in a book The Mythical\nMan-Month which should\nstill be required reading for software engineering courses today. My edition was\na newer reprint that included a later essay titled “No Silver\nBullet” where Brooks looked at\nthe effect that new tools can have on developer productivity. To think like a\nprogrammer, you must understand that the real world is complex. Programming can\nbe best thought of as imposing simplified representations – we call them\n_abstractions – on top of our messy reality to make it understandable by\nreducing complexity. This lets us generalize specific situations into layers\nthat can be built on top of each other. For instance the specific action of\nputting peanut butter onto a piece of bread could be generalized into a\nspread(substance) method that could take peanut butter or cream cheese as an\nargument. And we could use these spread methods to create higher-level\nfunctions like create_pbj() and so on. Coding in a modern high-level\nprogramming language is like standing on top of a ziggurat of abstractions,\nwhere a single line of code could trigger millions of operations on multiple\nsystems. It’s very exciting!\n\nNow, what if we could keep going and abstract away the act of programming\nitself? This is the dream of agentic AI, where swarms of agents can be given\ntasks to implement on their own without supervision. Sounds great! But this is\naddressing what Brooks calls accidental complexity, the things that are\ncomplicated about writing code itself. In the time since the essay was written,\nsoftware development has made great strides against this type of complexity.\nInstead of writing in low-level machine code, we can use modern dynamically\ninterpreted languages which are compiled to assembly. Instead of remembering how\nto write a quick sort (trust me,\nyou’re going to want to click that link) from scratch, I just need to call a\nsort method in a standard library. Instead of having to build a whole web\napplication from scratch, I can use an existing framework. If I want to rename\nor restructure some code, my editor can help do that for me. AI seems like the\nlatest iteration and some editors have already replaced their predictable old\ntooling for renaming and refactoring code with unpredictable AI agents. Sure, it\nmight seem like rolling the dice, but how common is a critical\nfailure anyway?\n\nHowever, even as the better tooling has diminished accidental complexity,\nessential complexity still remains. There still is the complicated work of\ndesigning our abstractions and systems the right way, one that is elegant, clear\nand maintainable. And that complexity isn’t going anywhere. This type of work\ntakes skill and experience and wisdom hard-won from system failures past. And,\nI’m not sure if LLM’s fancy autocomplete approach works so well with this type\nof complexity, which often isn’t so straightforward to solve. Maybe with\nprompting, it could be guided toward a preferred approach, but at that point the\nperson doing the guiding might as well design the approach alone, since the LLM\nwouldn’t be able to articulate why it chose a certain path. Essential complexity\nis often weird and rare and messy. Maybe I’m wrong and the models are getting\nbetter at these kind of messy situations as well, but I’ve found that it often\nrequires a specific kind of mindset and approach. Luckily for me, I love the\nmessy stuff.\n\nI Love Mess\n\nI’ve been talking so far about how software can abstract processes, but we also\nuse abstraction’s reductive properties as a tool to understand the world. In the\nclassic book Seeing Like a\nState,\nJames Scott describes how the motivating project of the post-enlightenment was\nto make their populations and possessions legible through abstraction and\ncategorization. To measure is to modify. For instance, a country might begin to\nlook at its forests not as complex ecosystems but just assessed by their\npercentages of timber that can be used for ship-building. This view then allows\na country to act on this information in ways like replacing those forests with\nmonocultures of just a single tree. A forest is abstracted into a system for\ngrowing ship masts.\n\nThis approach created the bureaucracy and the paper form, which has evolved into\nthe web form and database. As programmers, we need to reduce the messy data of\nthe world in order to act on it. We expect our dates to be\nexact. We\nexpect names to be relatively\nsimple.\nWe expect data to be complete at time of entry and consistent over time. Every\nprogrammer and every system design is a series of\nProcrustean choices about what\naspects of reality we want to reflect in our systems and what we can discard.\nI’m not saying this to criticize; this approach is the only way to build systems\nthat aren’t bogged down in an endless thicket of special situations (what we\ncall “edge cases” because they’re supposed to be rare paths on the periphery).\nBut, this process is so innate that we sometimes forget that it is also\nartificial, especially when it’s describing people. Forcing a gender field to\nonly accept “male” or\n“female”\ndoesn’t force gender itself to be binary. Our definitions of race are social\nconstructions that shift all the\ntime.\nOur simplified model might provide us with insights (autism diagnoses have\nincreased 300% over the last 20 years!) but not capture the underlying factors\nbehind those insights (it’s likely just a result of changes in how we define\nautism and increased\nscreening).\nIt’s important to step back and look at the bigger picture of how any model was\nmade and what type of knowledge it doesn’t capture. Every abstraction is also an\nocclusion. As a data journalist, I learned how to interview data and how to be\nhighly rigorous about all the ways in which the answers I found could be\nmisleading. Paranoia is the data journalist’s best friend, if you want to avoid\nan embarrassing correction.. You need to\nbe able to think about not just what the data says, but all the stuff it doesn’t\ninclude.\n\nUnfortunately, this metacognition is something an LLM can’t ever do. The model\nis their reality. As Robin Sloan succinctly notes in his compelling essay “Are\nLanguage Models in\nHell?”, AI models are\nbuilt from and view the world in a stripped-down way. Where you and I might look\nat text and see its context (things like the text formatting and titles, the\nauthor’s bio, the site where this was linked from), the LLM operates purely on a\nworld of letters and nothing more (technically, they’re receiving subword\ntokens, which is why early models couldn’t count the letter ‘r’ in\nstrawberry). Asking a LLM to recognize the\nlimitations of its view on reality is like asking a goldfish how the water\nis.\n\nWhen I was writing this section, I have been thinking about DOGE’s inept\nattempts to find fraud at the Social Security\nAdministration.\nIn one example, DOGE looked at the SSA databases and discovered there were over\n9 million records in there with birth dates over 120 year ago but no death dates\nrecorded. Elon Musk declared the only possible explanation was that millions of\npeople were fraudulently receiving benefits. He was wrong about both the cause\nof the problem and the severity of its\nimpact.\nDOGE could’ve questioned the data quality. They could’ve examined payments being\nmade. They could’ve asked any of the experts at SSA to explain it to them. But\ninstead they took the data as it is and leaped to wrong conclusions, a pattern\nthey repeated over and over (as in this example of a different fraud claim about\npayments):\n\n\n  In the extensive analysis that followed, agency experts carefully documented\nfallacies in DOGE’s work, according to documents reviewed by The Times and\nthose people.\n\n  “These payments are valid,” Sean Brune, an acting deputy commissioner, wrote\nin a memo examining one of the issues. (A Treasury spokeswoman declined to\ncomment.)\n\n  But Mr. Russo, who did not respond to a request for comment, said that DOGE\nwould not trust career civil servants, according to people familiar with his\nstatements. Instead, he insisted that Akash Bobba — a 21-year-old who had\ninterned at Palantir and become one of DOGE’s lead coders — conduct his own\nanalysis.\n\n\nIn their own wild ways, the DOGE crew were replicating the same operating\nconditions for themselves that cause LLMs to go astray. They refused to consider\nalternative explanations that were outside of what the data told them. They\ntalked to nobody outside of their own circle. They latched onto a simplified\nexplanation that was appealing to them because it completely validated their\nworldview of incompetent government staff and rampant fraud everywhere.\n\nThis is not a rare situation. I myself am mortified by the possibility of\nlooking like a dumbass, so I don’t ever want to outsource my data analysis to an\nLLM. But, of course many people do. I fear this problem will only get worse.\n\nFriction is a Gift\n\nThe appeal of LLM-driven development is that it’s supposed to eliminate\nfriction. Boosters spin tales of development teams shipping dozens of features\nin a single day, using multiple teams of agents working autonomously at their\ncommand in increasingly strange\ntopologies.\nAnd I get it, software development can be tedious and frustrating. It must feel\nsuper exciting to be able to churn out code at relatively ludicrous speeds and\nplay with polished products instead of prototypes.\n\nI need the friction though.\n\nWhen I am first learning a new language or framework, I struggle with friction\nto do even the most basic tasks. It sucks! And when am working with a new and\nunfamiliar code repository or data source, I need to set aside hours to\nscrutinize it. I often find myself doing a close\nreading, pulling up specific files\nto look over line by line until I understand their context and the choices their\ndevelopers made. I know I could just ask an LLM to summarize the project for me\nand save myself the time, but I’ve found I need this process to really marinate\nin the code. I need it to not just understand the choices the developers made,\nbut why they made them and how they reflect the constraint or idioms of the\nlanguage they are using. I learn by failing, and if the LLM takes that work away\nfrom me, I won’t really understand what I’m doing.\n\nEven when working in familiar languages and my own code, I still rely heavily on\nfriction as a clue. When writing the code becomes hard, that tells me that I’m\ngoing down a wrong path with the current architecture, and that I should\nseriously consider redesigning things to make future enhancements easier. When\nthat happens, I usually go out for a long walk (or sign off for the day) to give\nmy brain space to step back and consider things from a new angle. It really\nworks. I find these pauses so effective that I will even force it upon myself\nwhen the way seems clear. When working on large software projects, I will wait\nto start coding a new feature until I’ve written an Architectural Decision\nRecord first that describes what I want to do. These\ndocuments force me to capture what I’m thinking at this point in time, my\nassumptions about the problem and the ramifications of my approach. Sometimes,\nit even makes me realize I was too enamored with my initial hunch to see how it\nwould go astray, and it always serves a good way to capture “what were they\nthinking?” for any future inheritors of my work.\n\nThe LLM-driven approach to friction is to just code your way through it without\nrethinking anything. And the LLM will oblige. It’ll probably make code that will\nwork. The performance metrics will be fine, the tests will pass (especially if\nthey also were written by the LLM). But it won’t know why it chose that path. It\ndoesn’t feel friction and can’t explain if one architectural approach felt\ncleaner than another. If the engineers crafting the prompts lack the insight to\nknow what is a good approach or a bad one, they get stuck in a dynamic of asking\nthe AI to code its way through friction over and over again. This can result in\na thicket of weird abstractions, and the only design documentation for future\nteams is a single Markdown file that contained the instructions for an AI model\nused a few years back. Good luck reconstructing the architectural decisions from\nthat! It is telling that most of the vibe coding success stories I’ve seen have\nbeen by developers who are already experts in what they are asking the LLM to\nbuild (and who are thus able to guide its work), or for situations where the\nstakes of failure are low. For the everything else, we just have to figure out\nhow to know if the rest of the fucking\nowl is any\ngood and safe to use.\n\nI’d be remiss if I didn’t mention one other thing that bothers me when LLM\npromoters invoke friction as a problem. Most of the LLM marketing in\nadvertisements, live demos and LinkedIn posts that I’ve seen portrays a solitary\nengineer (or perhaps a single team) heroically using LLM-driven coding to blast\nout some sort of app or website and launch it quickly (our velocity and KPI is\nthrough the roof!). But industry really wants developers to use LLMs for work,\nwhere the friction is usually established processes and practices to keep\ndefects or even poorly-conceptualized features from making it to production.\nInevitably, the need to prioritize LLM-driven velocity is turned against people\nthemselves – other engineers or team-mates in product or project management or\ntesting or compliance or design. Because those roles are seen as friction too.\nWho needs user research when we can craft AI personas? Who needs design when we\nhave AI tools to spit out web layouts? Who needs project managers when we are\nthe managers of our army of agents? What if we didn’t need to wait for another\ndeveloper to review our pull request and just automatically merge code that\npasses tests and scans? What if we didn’t have to spend any of our work time\ntalking to other people and just could live in the realm of pure coding? But,\nsoftware development is a collaborative process, and each member of the team\nhelps make a good product what it is. Removing those roles or replacing them\nwith LLM-inflected ghosts will certainly allow teams to move faster, but it\ndoesn’t mean the products that they deliver will be better. And the process will\ncertainly be a lot lonelier.\n\nI Care A Lot\n\nPerhaps my simplest reason for not using LLMs is that I just love programming so\nmuch that I don’t want to hand it off to a machine. In much the same way I\nwouldn’t resort to AI if I were an artist or a musician, programming is one way\nfor me to express my creativity, and I will not cede that joy. Although it can\nbe extremely frustrating at times,\nthere is a profound delight in shaping something from a nebulous idea into a\nreal system, especially if it involves an elegant implementation or interesting\nproblems. Some evenings, I close the work laptop and open the personal laptop to\ndive into some new fun thing I want to build. And when I am building software\nprofessionally as part of a team, that is even better! I love the collaboration\nand the process of shaping software together, especially the ways in which\npeople will step up and take ownership of problems. I don’t think the dynamic is\nthe same when the team is just taking ownership of prompts and the LLM assistant\nis doing the work. Or the LLM assistant is replacing parts of the team.\n\nOwnership is important. Over the past few decades, I’ve worked in roles where\nI’ve developed a strong sense of personal responsibility. As a data journalist,\nan error in code could lead to an embarrassing correction or a devastating\nlawsuit. In civic technology, errors can mean catastrophic failures in providing\nservices and benefits, whether it’s to an entire vulnerable population or to a\nsingle person. I’m not going to say that I’ve never made mistakes, but I care a\nlot about getting it right because I care about the mission of the work. I have\nbeen privileged to work on teams with many other people who also care and want\nto do the best they can for people. An LLM can’t care. Sure, it can do a\nconvincing job of pretending, but\nit’s still just a facsimile of a mind stringing together words that are more\nlikely to be associated with other. It’s not bothered by its mistakes or trying\nto do better, because it has no inner\nconsciousness, let alone a conscience. It can\nnever be held accountable, and I can never hand off my moral responsibilities to\nit for that reason.\n\nWhen the LLM does well, it’s a genius that will replace all coders. When the LLM\ndeletes all of your infrastructure or “lies” about tests, it’s your fault. After\nall, you just needed to structure your prompts and workflows exactly the right\nway so it will jostle the LLM into giving the correct output. Oops, try again.\nAnd again. Much of the LLM advice I’ve read emphasizes that you must give all\nthe necessary instructions and amendments and codicils up front or the system\nwill do things wrong. This mindset is a significant departure from agile\nprogramming, which emphasizes frequent course corrections and feedback and\ntrusting in your team to do the right thing. Instead, we seem to be retreating\nto a new usage model similar to the time-sharing models of early\ncomputers in the 1950s. Except here,\ninstead of walking up to hand in a sheaf of punch cards, the solitary programmer\nis instead bringing legal documents to be turned into programs.\n\nI jest; there is no legal liability at play here. It’s probably not surprising\ngiven the similar demographics involved, but LLM suppliers are repeating the\nsame dynamic as Tesla. New features are being rolled out to user without safety\ntesting\nand, just as strangely, LLM boosters, like Tesla superfans, often blame\nthemselves and others for catastrophic\noutcomes\nby saying the users should’ve done better in writing their prompts. I’m not\nreally sure what to make of this, but it bothers me that technology is\nstandardizing a capitalism where more risks are being borne by consumers because\ncompanies and government have both abdicated their responsibilities. We banned\nlawn darts after they killed a single\nchild, but chatbots driving users to\ndeath and psychosis are accepted as the price of innovation in AI. Will things\nchange when vibe coding itself leads to someone dying from system failures\nrather than dying of\nembarassment?\n\nCoding has also been my comfort when times are hard. There is research that\nplaying Tetris is an effective way to avoid\nPTSD.\nThe theory is that the therapy works because engaging the parts of the brain\nthat handle arranging and rotating shapes hinders the formation of traumatic\nmemories. Now, I am fortunate enough to not suffer from PTSD (and I am not\nmaking light of people who do), but I do also relate to this concept.\nProgramming feels like a complicated puzzle and has sometimes been my solace in\ndark times. As the example above hints at, I know a lot about DOGE, because for\nthe past year I’ve been building and maintaining a system to track their\nrampage. Unlike a work project, this has been an\nexercise in assembling datasets to provide clarity into an organization that\nwants to stay obscured. It’s been a rewarding exercise and a way for me to\nchannel my despair into something I hope will be useful. This isn’t the only\ntime I’ve used code as a way to work through my\nsadness, and it works because it is work and the\nprocess would be diminished if I only focused on the product.\n\nA Few Other Silly Reasons\n\nThis has already proven to be a much longer piece than I expected, especially\nsince it was originally just a few short posts on Bluesky. Before I close it\nout, a few more quick reasons!\n\nFirst, I absolutely hate the unctuous tone that AI chatbots default take by\ndefault. As someone who grew up in\na city on the East Coast, I get really suspicious when someone is weirdly super\nnice to me without me knowing them, because it usually means they’re either\nabout to launch into a scam or proselytize to me. Reading LLM chat transcripts\nmakes my skin crawl. Yes, I am aware I could make the LLM adopt a whole\ndifferent tone, but somehow that makes the idea feel even worse.\n\nLike many developers, I have a whole folder of draft hobby projects that have\nnever been finished. For instance, there’s the one where I was going to write a\nclone of Spelling Bee, but it was going to be in Clojurescript so I could use\nthe Blabrecs code to generate non-words\nand make it super frustrating. Okay, I guess that would’ve just been funny to\nme. You had to be there. From the LLM perspective, these are folders of failures\nand I could indeed use LLM to make an app a day or whatever challenge I want.\nHowever, the process was far more important than the product (again!). Not every\nwhimsy needs to become a reality. Often, I get more from the fun of\nbrainstorming and the process of learning enough to know that I don’t need to\ncontinue and finish the job. It’s easy to forget this sometimes.\n\nThis wasn’t going to be an essay about the morality of using LLMs for my work.\nNot because I don’t care, but because many others have written far more\neffectively than me about the fraught implications of this technology. And at\nthis moment where LLMs are bombing schools with children or generating child\nporn on demand, I really don’t feel comfortable using them. And I don’t feel\ncomfortable not mentioning this aspect at all. It may be true that there is no\nethical consumption under capitalism, but I’ll be damned if I’m not going to at\nleast try. We can’t build a better world with tools that immiserate so many.\n\nWeirdly, nobody seems more miserable than LLM boosters. I might be more swayed\nif developers were using their newfound productivity gains to finally live that\n4-hour workweek that nerds\nwere pretending to idolize 10 years ago. But perversely, it seems like many in\nSilicon Valley are outsourcing work to the AI agents and then using their\nnewfound spare time to do even more\nwork. Instead of using their time\nfor relaxation or art or joy, they’re embracing\n9-9-6\nwork schedule and a hyper-quantified workplace that would make even Frederick\nTaylor blanch in horror.\nIt’s possible that the LLM revolution will finally come for me and my job, but\nI’d rather not work myself into the grave first.\n\nNow What?\n\nI don’t pretend to know the future. Maybe the technology will advance to such a point I will regret my lack of experience and familiarity. Or, maybe it’ll stagnate and the whole financial house of cards will come tumbling down. If that happens, I hope we can rebuild software development into the humane practice of building a better world, one line of code at a time."
        },
        {
          "id": "personal-stronger-privacy-act",
          "title": "A Privacy Act With Teeth?",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "personal",
          "tags": "",
          "url": "/personal/stronger-privacy-act",
          "content": "In the wake of Watergate and other abuses by the executive branch, Congress passed the Privacy Act of 1974 to mandate greater transparency into what types of data the government is collecting on the public and how it is able to use that data. DOGE’s actions over the past year have demonstrated the nightmare scenarios the original legislation was meant to prevent; this includes both the negligent handling of sensitive information as well as the targeted pooling of data from different agencies to create a digital Panopticon for the administration to target immigrants or any other populations that it deems a threat.\n\nBlatant violations of the Privacy Act can result in a civil litigation. On the criminal side, it caps out as a misdemeanor with a meager maximum fine of $5000. Instead, the act’s main utility has been in establishing a standard that can be enforced within the Executive Branch and Congressional oversight. However, with an actively malicious OMB and a supine Congress, nobody is minding the shop and the resulting lack of oversight has supercharged the aggressive data collection and exfiltration practices of DOGE.\n\nWe’re going to need a new Privacy Act With Teeth. I am not the policy or legislative expert to craft this, but I think it will need to have the following components:\n\n\n  Stronger penalties for violations. The current penalties could just be factored in as the cost of doing business for those determined to violate the rules.\n  Overhauling the opaque division of documentation between Systems of Record Notices (SORNs) and Privacy Impact Assessments (PIAs, mandated by the E-Government Act of 2002) with new combined reporting to more clearly describe what data is being collected by the agency and how it is being used. Include mandates to prevent public documentation from being retroactively written (as happened at OPM for the DOGE email server) or from being removed from public sight entirely (as happened at HHS).\n  Creating a new entity to monitor and enforce compliance with privacy and data sharing regulations. Since both Congress and OMB have willfully abdicated their oversight responsibilities, it probably would need to be a separate entity like the US Government Accountability Office. Like the GAO, this agency should also reside in the legislative branch to shield it from interference. It might make sense to also move CISA and some other FISMA-related oversight into this entity as well.\n  Creating new regulatory paths and oversight over data-sharing agreements and joining across multiple data sets. The partitions between and even within agencies can be super frustrating at times (for instance, Biden’s means testing for student loan data needed an act of Congress to be authorized to use tax data), but the prospect of a government mandating compulsory data collection (your taxes, etc.) and using that to build up dossiers on the public is why the Privacy Act was created.\n  Eliminating the use of private industry and services as a loophole. For instance, agencies that are formally banned from building surveillance databases could still buy from private data brokers. Similarly, using commercial tools or contractors to host data in ways to avoid specific government rules and restrictions should result in severe consequences. It may be necessary to “red team” the rules to identify and eliminate ways in which agencies could actively thwart the intent of the law while still remaining compliant.\n  Taking datasets into receivership if they are at risk of being abused outside of their original purpose. This is a highly nebulous idea, but I am wondering how the government can continue to provide services for vulnerable populations while protecting them from having their own data weaponized against them by future administrations – for instance, “Dreamers” who signed up for DACA now risk their addresses being provided to ICE. Deletion is one possibility, but that could be abused to hide wrong-doing and is simply impossible for things like IRS tax records or Social Security data, which is also currently being misused by ICE. What is to be done? Is there a combination of technical and policy changes that could protect against malevolence?\n  This agency could also be involved with efforts to enforce any new privacy protections and regulations against industry as well. As we’ve seen with the Consumer Financial Protection Bureau, its protection of the public would face significant industry push back and interference, especially from Silicon Valley, so we would need to ensure it is safe from executive branch interference and Congressional lobbying.\n  In the long term, it seems clear that the Constitution also needs a Privacy Amendment, to prevent the Supreme Court from knocking down past precedents built on the notion that public citizens have the right to privacy. But that’s a whole other argument that faces daunting odds of ratification like any amendment.\n\n\nThis turned out to be a longer article than I expected, but our digital privacy feels far more endangered than it did when the Privacy Act was first enacted. What do you think, is there something I’m missing?"
        },
        {
          "id": "personal-assault-oversight",
          "title": "The Assault on Oversight in the Executive Branch",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "personal",
          "tags": "",
          "url": "/personal/assault-oversight",
          "content": "Editorial Note: Since this piece was originally published, bad stuff has continued to happen. I have decided to not update this piece,\nbut instead have left it as it was. To keep up with current events on these topics, go check out the excellent Unbreaking or my own continued work at DOGE Track\n\n\n\nAs we pass the first 100 days of the DOGE/Project 2025 onslaught against the administrative state, I keep thinking about how the technical assault has depended on a complementary demolition of oversight that would document, delay and even prevent some of the egregious abuses.\n\nAs every schoolkid knows, the US Constitutional system is designed to have checks and balances between the three coequal branches of government - the Executive, Legislative and Judicial. That isn’t functioning like it should. This is mainly because Congress has largely abdicated its role, leaving a void where the Trump administration has seized more power as its prerogative. The judiciary - in the form of lower federal courts - is providing much more resistance, but there is the looming question of how the Supreme Court will handle cases once they wind up there. Justice moves slowly too, and there is the worrisome thought of what happens if the President just refuses to listen to the courts? Time will tell.\n\nNot as many people realize that there are also many oversight and regulatory functions within the Executive branch itself, most of which have been created by law over the years to deter the waste, fraud and abuse that the current administration claims it is here to stop. Simply put, DOGE’s stated mission is a lie, made obvious by the fact that they have sidelined and even attacked many of the mechanisms already in place to reduce waste and fraud. The real mission has been to accrue unchecked power, made possible by chipping away at anything that might constrain them. This piece is a look at some of the individual offices that the Trump administration has attacked to destroy oversight.\n\nAn Overview of Oversight\n\nAs the sole entity able to pass laws, Congress has used its power over the years to pass legislation that places limits on what the Executive branch can do. These limits are then usually enforced by the Judiciary branch when the Executive oversteps its authority.\n\n\n  The Pendleton Civil Service Reform Act (1883) created a nonpartisan civil service to replace the corrupt “spoils system” that had been in place since the presidency of Andrew Jackson in the 1820s. There had been multiple attempts at reform since the Civil War, but it took the assassination of President Garfield in 1881 to clear away significant opposition to reforms.\n  The Antideficiency Act (1884) was first enacted after the Civil War to prevent the practice of coercive deficiencies where agencies would purposefully run out of money to force more spending by Congress. It has been revised several times since (most recently in 1982) to ensure that the government does not enter into contracts that aren’t already fully funded and that operations do not occur without funding (this is why the government shuts down and also why businesses can’t give the government freebies). The purpose of this act is to prevent any subversion of Congressional control over spending.\n  The Privacy Act created rules about what personally identifiable information (PII) that the government can collect and how they must disclose its usage and give individual redress to correct wrong information about themselves in the system\n  The Inspector General Act (1978) created a standard set of independent Inspector General positions and offices across government with the authority to investigate fraud, handle whistleblowing complaints and monitor agency operations\n  The Civil Service Reform Act (1978) was the first major civil service reform act since the Pendleton Act. It abolished the existing US Civil Service Commission and established the Office of Personnel Management (OPM), Merit Systems Protection Board (MSPB) and the Federal Labor Relations Authority (FLRA). It also strengthened the power of federal unions.\n  The Whistleblower Protection Act (1989) protects federal whistleblowers from retaliation, defined prohibited personnel practices for federal employees and codified the role of the OSC in oversight here.\n  The Federal Vacancies Reform Act (1998) allows an incoming president 300 days to temporarily and unilaterally fill vacant Senate-confirmed positions in agency leadership with acting directors, provided they meet certain conditions.\n  The E-government Act (2002) defines the Federal Information Security Management Act (FISMA), which governs how government agencies should handle cybersecurity, and also defined the roles and responsibilities of Chief Information Officers (CIOs) within the federal government\n  Federal Information Technology Acquisition Reform Act (2014) increased the responsibilities of CIOs for IT acquisition, created a federal CIO council and defined a mechanism for some some CIOs to be appointed while others could be picked by the agency head\n  The Federal Acquisition Regulation or “The FAR” is not a law, but a collection of regularly-updated regulations standardizing the rules and practices of procurement across all federal agencies to ensure they are fair and transparent.\n\n\nYou might have noticed that some of these laws are clustered together. The 1880s were one fervent time for reform as corrupt practices and machine politics reached their peak, leading to a push for reform over the next 3 decades. Similarly, a number of oversight laws were passed in the 1970s in reaction to the excesses of Watergate. Taken together, these laws exist for the public to have faith in its government. They ensure transparency, fairness and responsibility in how the Executive branch enforces the law, procures goods and services and treats its citizens. Laws and regulations are a response - there is a saying that “every safety regulation was written in blood.” If we survive this, we will need new laws and rules to address the breach in trust and betrayal of faith in democracy.\n\nSome laws like the Privacy Act do specify criminal penalties for breaking the law, but many oversight laws don’t  include punishments. One of the jokes at 18F was “there’s no such thing as FISMA Jail,” because the law didn’t include any penalties for breaking it. This doesn’t mean there were no consequences. Even accidentally breaking the law can involve various levels of pressure and accountability. You would embarrass your agency. You would have to endure the scrutiny of the Inspector General. You - or more likely and far worse, your agency’s director - would be hauled in front of Congress to explain what happened. There would be FOIAs and attacks in the press. Following the law may often involve a fair number of bureaucratic hoops to jump through, but it’s far preferable to the risks that happen if you don’t.\n\nAn Ongoing Assault\n\nMost of the direct disruption has happened in the Executive Branch, since that is where Trump and his allies have the most control. In line with their “run government like a business” model of top-down direction, they are intolerant of any idea that parts of the executive branch should be able to assert any independence. And there are many existing parts of the executive branch that were designed that way, to push back and negate some aspects of what the President can do. This is why they’ve been attacked.\n\nOffice of Personnel Management (OPM)\n\nOne of the ways in which DOGE surprised everybody was how it turned a relatively sleepy agency into a weapon against the entire government bureaucracy. Created by law in 1978, OPM’s main role before now was to standardize government policies around personnel and to offer shared services for human capital like the federal health plans. It also tracks all the paperwork around hiring, promotions, changes and retirement. And so, for most federal staff, their main interactions with OPM were receiving paperwork they were hired, fired or retired; or picking a health plan; or looking at the rules around various practices. But, all the other aspects of human resources were handled by their own agencies. Federal staff are hired to their specific agencies, their performance rankings are defined by their agency, they’re promoted by their agencies, they’re paid by their agencies, they follow the HR policies of their agencies. OPM provides the guidance, but it’s the agencies that implement it.\n\nThis is why OPM’s heel turn has been so confounding. Very early on, DOGE concentrated a large number of staff within OPM to capture control of the centralized databases and install a new email server to send/receive messages from every federal worker. The acting director was forced out and replaced by Charles Ezell, who then let DOGE wall off an area and keep out permanent agency staff. DOGE also brought in a slew of HR executives from startups who seemingly expected the agency to work for the entire federal government in the same way that HR operates within a single company. And so, they started acting like OPM had more centralized authority: ordering workers to the “five things” email; blasting commands to human capital officers at other agencies; asking for lists of workers so they could command agencies to fire probationary workers and short-cut established procedures for reductions in force. OPM used to be deliberative and formal with its rulemaking - the new OPM was fond of sending out memos to all agencies on Wednesdays demanding a response by Friday. None of this was normal; some of it was downright illegal. And the breach in trust has been irrevocable.\n\nOPM has also been useful as a means of obscuring DOGE’s activities and also paying some of the staff for their work. Several of the prominent members of the DOGE crew of young techies - Akash Bobba, Gavin Kliger and Edward Coristine - were appointed to “special expert” roles at OPM and then assigned on a temporary basis (known as detailing in government parlance) to other agencies through formal staffing agreements. From there, they might be detailed even further. DOGE was also able to use its control of OPM to extract revenue for other agencies. For instance, as part of its planning to conduct a massive Reduction-in-Force (RIF) to eliminate most staff, CFPB leadership signed an agreement paying OPM for consulting services on how to layoff its staff.\n\n\n\n    \n        2025-01-20\n        \n    \n\n    The DOGE team moves into OPM, installing sofa beds and armed security\n\n\n    \n        &nbsp;\n    \n\n    OPM sends a memo demanding that agencies must provide a list of all their probationary employees and justify why they should be retained. This is viewed by many agencies as a command to terminate them.\n\n\n    \n        2025-01-22\n        \n    \n\n    OPM sends a memo to all agencies mandating they must revoke their telework and remote policies within 2 days.\n\n\n    \n        2025-01-23\n        \n    \n\n    OPM sends out a first test email from its new government-wide email system.\n\n\n    \n        2025-01-24\n        \n    \n\n    OPM sends a memo to all agencies that they must eliminate all staff associated with DEI within 60 days.\n\n\n    \n        2025-01-27\n        \n    \n\n    OPM sends a memo that all agencies must submit within a 10 days a detailed plan for how all staff will return to office.\n\n\n    \n        2025-01-28\n        \n    \n\n    OPM sends out the “Fork in the Road” email to all federal employees offering them a chance to resign. This is followed up by multiple emails and memos explaining things that weren’t answered in previous memos, suggesting an unfamiliarity with government employment regulations.\n\n\n    \n        2025-01-31\n        \n    \n\n    Regular federal staff at OPM reportedly are locked out of access to key systems by DOGE and OPM leadership.\n\n\n    \n        2025-02-05\n        \n    \n\n    To get out of a lawsuit, OPM issues a Privacy Impact Assessment for its Government-Wide Email System (GWES) that attests that responses to the email are considered voluntary, brief and contain no identifying information.\n\n\n    \n        2025-02-24\n        \n    \n\n    After Elon Musk threatens to fire government staff who don’t comply, all government workers receive an email from OPM requesting they must optionally reply with a list of five things they did last week. Many agencies are mixed on if it’s required or optional.\n\n\n    \n        2025-02-28\n        \n    \n\n    After litigation against it fails, OPM revises its Privacy Impact Assessment for the GWES system to declare that responses aren’t always to be considered voluntary.\n\n\n    \n        2025-03-03\n        \n    \n\n    Federal staff receive a “Part II” email ordering them to provide a list of five things they did last week to an OPM email address by the end of Monday and to continue to do so every week. There are no other “Five Things” emails sent after this.\n\n\n    \n        2025-03-04\n        \n    \n\n    OPM quietly revises its memo on probationary employees to suggest that it was not ordering agencies to fire them.\n\n\n    \n        2025-03-20\n        \n    \n\n    Trump issues and executive order tellng OPM to start crafting new regulations that would give it more direct power to terminate staff at other agencies\n\n\n\n\nGeneral Services Administration (GSA)\n\nThe General Services Administration is another sleepy agency that was quickly subverted by DOGE to use a base of operations for its attacks. Before the DOGE era, the GSA was a base for shared services that were used by multiple agencies. This included things like purchasing and setting up buildings, cars for agencies and office supplies. This sometimes involves outlaying capital in ways that can’t be easily appropriated by Congress. To support this, the agency has always had some large funds that can be directed at the discretion of agency leadership, if they need to buy a building, for instance.\n\nThis flexibility has also allowed the agency to expand more into technical services as a shared offering across government. The GSA Acquisitions Services Fund is funded through reimbursable revenue generated from services to other agencies instead of appropriations from Congress. This allowed it to be the seed funding for 18F, the technology consultancy founded within GSA in 2013 which recouped its expenses by billing other agencies for custom software development services. Similarly, GSA has other technological services for offer as part of its Technology Transformation Service: cloud.gov, login.gov, notify.gov, etc. These are now under the leadership of Thomas Shedd, a 28-year-old software engineer with 8 years of working experience at Tesla. His main qualifications for the job seem to be unshakeable loyalty, boosting AI regardless of it applicability, a disregard for existing technical talent and experience, and a distinct lack of empathy.\n\nUnfortunately, those same aspects that made it easy for GSA to provide new services have also made it easier to destroy the agency. Since many of its outlays are not appropriated from Congress, they could be canceled without fear of subverting Congress. Yes, it’s highly counterproductive to cancel a significant number of federal leases without forethought, but it’s technically constitutional. Many of the staff of 18F were hired as term employees, as a way of expediting federal hiring for techies looking to do a tour of duty in the federal government, but that also meant they were easier to fire. And GSA’s special role in providing services to other agencies meant that changes in GSA operations could have huge ramifications. Besides cancelling leases, DOGE has also moved to practically suspend purchase cards provided by GSA to support micro-purchases, individual expenses that are below $10,000 and do not need the same extensive procurement process. The damage from that has been extensive and catastrophic. And in a recent executive order, Trump is now decreeing that GSA should have the same arbitrary destructive power over software licenses that it has had over purchase cards. This, from the same agency which apparently hasn’t figured out how to procure enough Adobe Acrobat licenses for its own staff.\n\nGSA is also the other agency that initially housed many DOGE staffers who were then detailed to other agencies. These staff also occupied and entire floor of GSA headquarters where only those with “A-level” access could enter. Notable members of the DOGE wrecking crew that were based out of GSA included Luke Farritor, Jeremy Lewin and Ethan Shaotran. There still is no public documentation on why DOGE staff were split between DOGE/USDS, OPM and GSA. It may have been just because those two agencies were where most DOGE activity was first focused, but it also became a useful way of obscuring who was detailed from where.\n\n\n\n    \n        2025-01-24\n        \n    \n\n    Thomas Shedd, a 28-year-old with 8 years of work experience at Tesla, is named the head of TTS\n\n\n    \n        2025-02-12\n        \n    \n\n    GSA technical staff describe being subjected to 15-minute interviews with Thomas Shedd and unidentified DOGE staffers where they are asked to defend their work.\n\n\n    \n        &nbsp;\n    \n\n    Dozens of workers within the Technology Transformation Services are summarily fired.\n\n\n    \n        2025-02-20\n        \n    \n\n    GSA places an arbitrary $1 limit on all government spending cards.\n\n\n    \n        2025-03-01\n        \n    \n\n    All remaining staff in 18F are fired via an email sent after midnight.\n\n\n\n\nOffice of Management and Budget\n\nThe Office of Management and Budget is the single most powerful department within the White House. It’s the division that most closely manages the President’s agenda and tracks spending across executive agencies. As a result, it often has served as the enforcer of many of these rules that don’t have a penalty. In the past, I’ve seen agency staff who are a lot more worried about how OMB might react than their own agency leadership about an embarrassing mistake. If you’ve seen “The Thick of It” and enjoyed one of Malcolm Tucker’s diatribes, then you can understand what I mean by this. In the past, OMB has also been the parent agency of the US Digital Service, an arrangement which gave USDS wide latitude to get involved in their work with other agencies. Unfortunately, the very same mechanisms that made it easier for USDS to do its work are what DOGE exploited to get its own elevated access in other agencies. This is just one of the “government hacks” that allowed the Obama administration to score achievements that have now been used against the government. But that’s a whole other essay.\n\nIn the second Trump administration, the OMB under Russell Vought has become a predatory enforcer, throwing its weight around to bully federal staff and cow independence from any leaders. Rather than a bold enforcer of the status quo and principle of least embarrassment, it’s become a wrecking ball that relies on withholding budgets, overt threats and novel legal theories that push the idea of what powers the Executive Office has. It has worked very closely with DOGE’s efforts to dismantle spending at agencies and with OPM to weaken civil service protections. An early example of this partnership was their collaboration to take down the CFPB. Within hours of Vought winning confirmation, DOGE staff swarmed into the agency, demanding superuser access to all systems under the claim that they were looking for fraud. To override any objections, Vought installed himself as the new Acting Director of the agency and ordered that all work at the agency should stop and the building was closed to all. The plan worked mostly - to this day, CFPB remains largely shuttered, with many key contracts and activities canceled under a new argument that agency activities should be limited to what is defined by statute (a seemingly precise definition that is intentionally vague by design). And agency leadership (Russell Vought) has worked seamlessly with OMB (also, Russell Vought) to attempt several times to reduce the agency staff to a shell of as few as “five people and a phone,” that would technically meet its statutory requirements but be unable to do anything of value. Thankfully, federal judges haven’t been swayed by OMB and DOJ’s attempts to be clever at interpreting the law, but the administration is hopeful that the Supreme Court will go their way. If the Court asserts that impoundment (the withholding of allocated funds) is constitutional, then OMB will truly be unstoppable.\n\n\n\n    \n        2025-01-20\n        \n    \n\n    Another executive order issued by Trump mandates a hiring freeze and that hiring practices will be reviewed in a collaboration between DOGE, OPM and the OMB\n\n\n    \n        &nbsp;\n    \n\n    USDS is renamed the US Doge Service and transferred to its own White House office instead of being part of the OMB. This move is an attempt to skirt public records and FOIA laws that might apply.\n\n\n    \n        2025-02-06\n        \n    \n\n    Russell Vought wins confirmation as the director of the OMB\n\n\n    \n        2025-02-07\n        \n    \n\n    Russell Vought is named the acting director of the CFPB and shuts down the agency\n\n\n    \n        2025-02-26\n        \n    \n\n    The OPM and OMB issue joint guidance on agency reorganization and reduction-in-force plans\n\n\n\n\nActing Directors\n\nIn one of its more novel tactics, the Trump Administration has repeatedly used acting directors as the mechanisms of agency destruction. Because many agency directors require Senate confirmation before they can be appointed and this process takes time, the Vacancies Act was passed to ensure continuity of operations when during a change of administration or when an agency leadership position is vacated. This act allows the President to name people as “acting” directors or other positions who are granted full powers until a Senate-confirmed appointee can be sworn in. The Vacancies Act does set some conditions for who can serve as acting directors:\n\n\n  By default, the “first assistant to the office” (i.e., the deputy director) fills the position\n  However, the President may direct a person serving in a different Senate-confirmed position to serve as the acting officer\n  Alternatively, the President can select a senior officer of the same agency to be acting director, provided they are at a certain high level and have been at the agency for more than 90 days\n\n\nIn the past, Acting Directors have largely acted as custodial roles, ensuring the operations of the agency continue but usually not making any major changes, since they are filling the seat until the real director can come in. The general assumption has been that agency directors would feel some loyalty to their staff or responsibility to be good stewards of the agency that has been entrusted to their care. The standard practice before this year for government transitions has been that an agency head appointed by the previous administration resigns on Inauguration Day and their deputy or another designee identified during the Presidential transition then serves as acting director until a permanent appointee is sworn in. To give an idea, here is the list of acting officials on January 20th, almost all of whom were current acting officials.\n\nThe Trump administration has dramatically upended this practice, using Acting Directors as a way to significantly change or even completely demolish other agencies. For the destruction of USAID, Trump fired the agency’s acting head and then declared the recently appointed Secretary of State, Marco Rubio, as the new acting director. Rubio then authorized all the activities of DOGE to demolish the agency. Similarly, within hours of being sworn in as director of the OMB, Russell Vought was named the acting director of the CFPB and ordered similar actions. in other cases, sympathetic quislings have been promoted from within in the agency to be the face of unpopular actions - these include Charles Ezell at the OPM, Leland Dudek at the Social Security Administration and Dorothy Fink at Health and Human Services. This tactic has been instrumental in the DOGE takeover at many agencies - in their first acts, the acting directors usually order that DOGE staff are given superuser access and that normal processes for approval should be bypassed and staff be placed on administrative leave so they can’t prevent or observe what is happening. It’s a bit like a bank heist where the branch manager is spearheading the crime. Also, this tactic has been used in lower positions as well, especially for DOGE staff who are often placed in positions at one agency but then serve in acting leadership positions in others.\n\nI honestly don’t know how to fix this in the future. Our governance processes trust leaders will want to preserve the integrity and authority of the institutions under their charge. And this makes them vulnerable to parasitism like this. The business world has suffered from this issue for years - this is exactly how slash-and-burn private equity works - and there doesn’t seem to be a “zero trust” approach that scales to large organizations we could apply. Ultimately, the only thing to stop this will be oversight from the other branches. The courts can demand corrective actions, but only if they overstep the law and always far too late after the damage has been done. Congress is what could - and in any other situation probably would - move quickly to stop these incursions in the future, and it deserves significant blame for its total abdication here.\n\n\n\n    \n        2025-01-20\n        \n    \n\n    The acting director of the Office of Personnel Managment is demoted and replaced by Charles Ezell, an obscure career employee in an analytics division\n\n\n    \n        &nbsp;\n    \n\n    Stephen Ehikian is named to a Deputy Administrator role of the GSA, a senior role that doesn’t require appointment, and is then sworn in as acting director of the General Service Administration\n\n\n    \n        2025-01-28\n        \n    \n\n    Trump fires two of the three Democratic commissioners in Equal Employment Opportunity Commission (EEOC)\n\n\n    \n        &nbsp;\n    \n\n    Trump fires two Democratic members of the National Labor Relations Board (NLRB)\n\n\n    \n        2025-02-01\n        \n    \n\n    Several days after the resignation of the director of the CFPB, the Secretary of the Treasury Scott Bessent replaces the deputy director as the acting director of the CFPB\n\n\n    \n        2025-02-03\n        \n    \n\n    Secretary of State Marco Rubio is named the new acting administrator of the United States Agency for International Development\n\n\n    \n        2025-02-07\n        \n    \n\n    Hours after being sworn in as head of the OMB, Russell Vought is named as the new acting director of the CFPB and he proceeds to try to shutter the agency\n\n\n    \n        &nbsp;\n    \n\n    Trump fires the head of the Office of Special Counsel, Hampton Dellinger\n\n\n    \n        2025-02-12\n        \n    \n\n    The Trump Administration names Doug Collins, the head of Veterans Affairs, as the acting director of the Office of Government Ethics.\n\n\n    \n        2025-02-16\n        \n    \n\n    Secretary of State Marco Rubio is named the new acting director of the US Archives, including the national recordkeeping office\n\n\n    \n        2025-02-17\n        \n    \n\n    After the resignation of the acting commissioner, Leland Dudek, an obscure career employee, is named the new Acting Commissioner of the Social Security Administration\n\n\n    \n        2025-03-05\n        \n    \n\n    The Trump administration names Doug Collins, the head of Veterans Administration and the acting head of OGE, to be the acting head of the OSC\n\n\n\n\nIndependent Agencies\n\nOver the years, Congress has seen it necessary to create agencies that are insulated from direct control by the President or even Congress itself. This is a common approach that was used to create many of the financial regulators in a time when legislators were more mindful of the corrosive effects of money in politics. For many of these agencies, the directors are often nominated by a President and approved by the Senate but with a 5-year or longer term to separate them from the usual election cycles. Once there, they are supposed to be able to serve independently, even if it means being in opposition to the president. For instance, the FTC Commissioner only “may be removed by the President for inefficiency, neglect of duty, or malfeasance in office.” This was affirmed by the Supreme Court in a 1935 case, Humphrey’s Executor v. United States which affirmed that “Congress intended to restrict the power of removal to one or more of those causes.”\n\nAnd yet, when does the Trump Administration listen to laws and precedent? Indeed, a lot of the administration’s actions have been focused on explicitly targeting independent agencies. This is partially because many of them are watchdog entities that may regulate illegal activities within the government (e.g., NLRB, OGE, MSPB) or industries outside of the government that the Trump administration would prefer to be able to more directly reward/threaten. This is also because the Trump administration really wants to force a do-over on Humphrey’s Executor because it believes the current Supreme Court would be more cooperative.\n\nPerhaps aware of the risk that the Court might not go their way, the Trump Administration hasn’t necessarily obliterated many independent agencies. Instead, they have tried two different tactics: reducing the agency its minimum footprint to meet its “statutory functions” or disabling it by eliminating quorum. In the case of the US Institute of Peace or the Institute of Museum and Library Services, DOGE installed acting directors who then reduced the agencies to a single person each, by claiming that person is enough to meet what the statutes have said. In the case of the Merit Systems Protection Board and similar, Trump has instead fired several commissioners, which then meant the agency is unable to take any actions because it doesn’t have quorum. Those firings are probably illegal (see above), but the agency is made inert while cases work their ways through the courts.\n\n\n\n    \n        2025-02-10\n        \n    \n\n    Donald Trump fires the head of the Office of Government Ethics\n\n\n    \n        &nbsp;\n    \n\n    Trump fires a Democratic member of the Merit Systems Protection Board\n\n\n    \n        2025-02-12\n        \n    \n\n    The Trump administration DOJ asserts to Congress that it feels that clauses that prevent removal of members from independent agencies are unconstitutional\n\n\n    \n        2025-02-23\n        \n    \n\n    DOGE ousts the CEO of the Inter-Americas Foundation\n\n\n    \n        2025-02-28\n        \n    \n\n    DOGE ally Peter Marocco declares himself CEO of the IAF and dissolves the agency\n\n\n    \n        2025-03-12\n        \n    \n\n    Shelly Lowe, the chair of the National Endowment for the Humanities, is forced out of office by the Trump administration.\n\n\n    \n        2025-03-16\n        \n    \n\n    DOGE staff show up and force out staff from the US Institute of Peace\n\n\n    \n        2025-03-18\n        \n    \n\n    Trump fires both Democratic members of the Federal Trade Commission\n\n\n    \n        2025-03-25\n        \n    \n\n    DOGE member Nate Cavanaugh is named the new President of the US Institute of Peace by remaining board members Rubio and Hegseth. He then proceeds to shutter the agency.\n\n\n    \n        2025-03-28\n        \n    \n\n    An appeals court rules against the cases of Gwynne Wilcox of the NLRB and Cathy Harris of the MSPB that contested their removals.\n\n\n    \n        2025-03-31\n        \n    \n\n    DOGE arrives and shuts down completely The Institute of Museum and Library Services (IMLS) and cancels all grants\n\n\n    \n        2025-04-16\n        \n    \n\n    Trump fires two of the three members of the independent National Credit Union Administration, another independent federal watchdog group.\n\n\n\n\nInspectors General\n\nWhat if I told you there had already been teams within each federal agency dedicated to rooting out waste, fraud and abuse long before DOGE entered the scene? Established by law in 1978, inspectors general (see, I remembered some copyediting knowledge from my days in journalism) are given unique powers to operate independently within each agency to root out fraud and law-breaking. To accomplish these goals, IGs are often given elevated read-only access to multiple systems to conduct audits and report on wrong-doing. They also operate tip-lines for whistleblowing from both agency staff and the general public. They’re highly effective entities that make sure agencies are performing their required duties efficiently and always following the law. Which is why DOGE had to eliminate them first.\n\nOne of the first moves of the Trump Administration was a reprise of Reagan’s mass removal of all IGs, with Trump abruptly firing 17 inspectors general across the government late on a Friday night. Under the law, the President has to provide a justification for removal, but Trump declined to offer any other justification beyond “changing priorities.” The initial agencies targeted seemed like a weird collection - Agriculture, Interior, HUD, Defense, EPA, State, HHS, Small Business Association and the VA - but many were also early targets of DOGE’s activities where the IG would normally investigate and document the laws they might have broken. Indeed, the USAID Inspector General was fired immediately after filing a report that cataloged DOGE’s destruction there. Surprisingly, the IGs for OPM and GSA were spared from the slaughter, and the OPM Inspector General is currently investigating the security of DOGE’s email server. Any bets on if they’ll be fired too?\n\nInspectors General aren’t always the only oversight layer against improper actions of agency leadership or staff. In many agencies, you will often see opposition from the Chief Counsel’s office if any activities seem like they might break the law. Similarly, the Chief Information Officer (CIO) will often balk at violations of cybersecurity practices or federal policies on information systems. And most agencies also have a Privacy Office to ensure that the Privacy Act is being followed. In many significant cases, staff in these roles too have been fired or placed on administrative leave indefinitely for daring to object to the actions of DOGE within agencies.\n\n\n\n    \n        2025-01-24\n        \n    \n\n    Trump fires 17 inspectors general across multiple agencies late at night on a Friday, mostly concentrated at agencies which are early targets of DOGE.\n\n\n    \n        2025-02-11\n        \n    \n\n    The USAID Inspector General is fired after releasing a report critical of DOGE’s actions.\n\n\n    \n        2025-02-12\n        \n    \n\n    Eight Inspectors General file a lawsuit seeking reinstatement.\n\n\n    \n        2025-03-10\n        \n    \n\n    OPM’s surprisingly not-fired Inspector General announces he will be starting an investigation of DOGE’s security practices\n\n\n    \n        2025-03-13\n        \n    \n\n    Trump removes the Chief Counsel of the IRS, replacing him with a former inspector general from the prior Trump administration\n\n\n    \n        2025-03-21\n        \n    \n\n    Trump eliminates the DHS Office for Civil Rights and Civil Liberties and two ombudsman offices responsible for investigating allegations of abuse from immigrants\n\n\n\n\nOther Civil Service Oversight\n\nThis piece has already touched on some of the ways that the Trump administration has tried through OPM and OMB to redefine the nonpartisan civil service though both direct attacks as well as administrative changes. Recognizing the importance of civil service protections, Congress has created several different independent agencies that are tasked with policing and protecting the civil service itself. The Office of Government Ethics is responsible for defining and enforcing the ethics rules. It also creates and collects the annual OGE-450 form that all government employees are required to fill out to assure they do not even have the appearance of conflicts of interest. The Office of Special Counsel (OSC) is an investigative and legal agency whose responsibility is ensuring that the prohibited personnel practices stay relatively prohibited. This largely includes whistleblower protection. During election years, this also enforces the Hatch Act. The Merit Systems Protection Board (MSPB) was created in 1979 as a venue for government workers to arbitrate disputes when federal employees allege they have been illegally disciplined or fired from their jobs. Finally, the Federal Labor Relations Authority (FLRA) has governance over the labor relationship between the federal government and the approximately 2.1 million non-postal (they have their own thing) government employees in unions.\n\nAll of these agencies would normally play a role in resisting and overturning the Trump administration’s radical actions against government staff. Unfortunately, because they are independent agencies, they have been attacked with the same playbook outlined above - rendering them inert through either the illegal removal of the agency head or by eliminating enough members of the board to prevent quorum. The Trump administration has then attempted to use this to their advantage in federal court. For instance, DOJ lawyers would argue that lawsuits by federal employees shouldn’t be heard in federal court because they are normally supposed to be handled by the MSPB, even though Trump had already fired its chair. Or by arguing that class actions for entire classes of employees like probationary workers who were fired unjustly should instead be processed as individual cases before the OSC or MSPB (stretching out the timeline for relief into years). Thankfully, this argument has been roundly rejected by multiple judges in the Federal court system who recognized the Kafkaesque trap that the administration was proposing for harmed workers.\n\n\n\n    \n        2024-03-06\n        \n    \n\n    Hampton Dellinger is sworn in for a five-year term as the head of the OSC after Senate confirmation\n\n\n    \n        2024-12-16\n        \n    \n\n    David Huitema is sworn in for a five-year term as the head of the Office of Government Ethics\n\n\n    \n        2025-02-07\n        \n    \n\n    Trump fires the head of the Office of Special Counsel, Hampton Dellinger\n\n\n    \n        2025-02-10\n        \n    \n\n    Donald Trump fires the head of the Office of Government Ethics\n\n\n    \n        &nbsp;\n    \n\n    Hampton Dellinger files a lawsuit, alleging his termination was illegal since the head of OSC can only be fired for “only for inefficiency, neglect of duty, or malfeasance in office.”\n\n\n    \n        2025-02-12\n        \n    \n\n    The Trump Administration names Doug Collins, the head of Veterans Affairs, as the acting director of the Office of Government Ethics.\n\n\n    \n        2025-03-01\n        \n    \n\n    The judge in the case about the firing of the OSC rules that the firing is illegal and Hampton Dellinger should be reinstated to his position.\n\n\n    \n        2025-03-03\n        \n    \n\n    The US court of the appeals for the District of Columbia reverses the lower court decision and lifts the stay against the firing of the head of the OSC. Hampton Dellinger drops his lawsuit.\n\n\n    \n        2025-03-05\n        \n    \n\n    The Trump administration names Doug Collins, the head of Veterans Administration and the acting head of OGE, to be the acting head of the OSC\n\n\n\n\nCIOs and Cybersecurity\n\nAs DOGE has been learning (and trying to blatantly ignore), developing software for the federal government is different in many ways from the startup world. For instance, Privacy Act regulations have prohibited connecting systems precisely because the American public doesn’t want government to create a panopticon system with a “God” view that makes execs at Palantir drool. While the agency’s Privacy Office and Chief Counsel will often determine what information sharing is legally allowed, the agency’s Chief Information Officer usually oversees the technical mechanisms which ensure that compliance. Furthermore, the CIO is primarily responsible for tracking IT budgets and contracts, which makes them a tempting target to support DOGE’s mission of centralizing and controlling all spending and procurement at federal agencies. The CIO also is sometimes able to directly enable DOGE’s elevated access to systems by ordering staff to grant them global admin access and to disable monitoring and safeguards that might get in their way. Finally, while I had mentioned before that there is no FISMA Jail, breaking the laws of FISMA can still invite oversight and litigation. Because it is sometimes necessary to skirt the rules for important and time-sensitive problems, CIOs are allowed to issue “get out of jail” cards. Known as Risk Acceptance Memos (RAMs), these are documents usually prepared by agency cybersecurity staff for the CIO to sign and accept the risk for problems that might reasonably occur because the normal compliance processes were bypassed. RAMs are highly useful when used strategically. They also are an obvious mechanism for abuse in the wrong hands.\n\nSo, it’s no surprises that DOGE’s work started with replacing CIOs in some key agencies to put in people sympathetic to DOGE’s aggressive actions. Over the last 100 days, the Trump administration has shown no hesitation in replacing other CIOs that got in their way to also remind any remaining CIOs and staff that they’re next. And it’s starting to use CIOs as part of its strategic plans. As an example, according to court filings, two DOGE staffers - Marko Elez and Aram Moghaddassi - at the Department of Labor were granted access to a highly-sensitive data system belonging to the Office of the Inspector General on March 21. This system contained sensitive data from Unemployment Insurance claims that DOGE wanted to scrape to use in anti-immigration efforts. Normally, such access would be blocked, but an Executive Order issued on March 20 demanding widespread data access for DOGE explicitly declared that “the Secretary of Labor and the Secretary’s designees shall receive, to the maximum extent consistent with law, unfettered access to all unemployment data and related payment records, including all such data and records currently available to the Department of Labor’s Office of Inspector General.” This data was still considered so highly sensitive that DOL counsel determined that access would need to be explicitly granted by the agency CIO. Conveniently enough, they had just appointed a new CIO a week before. His name? Thomas Shedd.\n\n\n\n    \n        2025-01-20\n        \n    \n\n    Acting OPM Director Charles Ezell replaces existing CIO Melvin Brown with DOGE member Greg Hogan\n\n\n    \n        2025-01-24\n        \n    \n\n    Thomas Shedd, a 28-year-old with 8 years of work experience at Tesla, is named the head of the Technology Transformation Services (TTS) division at GSA\n\n\n    \n        2025-02-03\n        \n    \n\n    Mike Russo (a DOGE member) is appointed as the CIO at the Social Security Administration. He immediately demands access for DOGE member Akash Bobba that bypasses the normal procedures.\n\n\n    \n        2025-02-04\n        \n    \n\n    The OPM issues a memo instructing agencies to reclassify CIO roles to reduce hiring restrictions and allow political appointees to serve in the position.\n\n\n    \n        2025-02-05\n        \n    \n\n    Thomas Flagg, the CIO of the Department of Education, sends a memo to the heads of IT across the agency ordering them to give prompt access to DOGE when requested.\n\n\n    \n        2025-02-10\n        \n    \n\n    DOGE CIO appointee Mike Russo convenes his own internal informational working group at SSA, does not inform the Director of his activities.\n\n\n    \n        &nbsp;\n    \n\n    Ryan Riedel is named the CIO of Department of Energy and grants admin access against internal projects to DOGE member Luke Farritor\n\n\n    \n        2025-02-11\n        \n    \n\n    DOGE ally Greg Hogan is formally named the official CIO of OPM (he had been the acting CIO since January 20th)\n\n\n    \n        2025-02-19\n        \n    \n\n    DOGE members Edward Coristine and Kyle Schutt are given superuser access to all email at the Cybersecurity and Infrastructure Security Agency (CISA)\n\n\n    \n        2025-03-03\n        \n    \n\n    After DOGE arrives at the National Labor Relations Board (NLRB), the agency CIO demands there are to be no logs or records of DOGE’s account creation. Several suspicious accounts are later identified.\n\n\n    \n        2025-03-07\n        \n    \n\n    Ryan Riedel unexpectedly resigns as the CIO of the Department of Energy\n\n\n    \n        2025-03-08\n        \n    \n\n    Federal IT staff at the department of Health and Human Services (HHS) who blocked DOGE from accessing the sensitive NDNH database are reportedly “no longer with the agency.”\n\n\n    \n        2025-03-10\n        \n    \n\n    Ross Graber, who reportedly worked with DOGE at the State Department, is named the new CIO of the Department of Energy\n\n\n    \n        2025-03-14\n        \n    \n\n    Thomas Shedd is also named the CIO of the Department of Labor while still serving as the head of TTS\n\n\n    \n        2025-03-17\n        \n    \n\n    The Department of Homeland Security CIO orders that access should be expedited for DOGE staffers who want to access a sensitive data lake there.\n\n\n    \n        2025-03-20\n        \n    \n\n    An executive order issued by Trump demands that all agency heads (and staff reporting to them) must share any data requested by DOGE without any delays. It also includes specific language to share data from the DOL OIG office.\n\n\n    \n        2025-03-21\n        \n    \n\n    Acting at the CIO for the Department of Labor, DOGE member Thomas Shedd signs a memo granting access of a highly-sensitive dataset to DOGE members Aram Moghaddassi and Marko Elez.\n\n\n    \n        2025-03-24\n        \n    \n\n    During this week, IT staff at the NLRB decide to report suspicious activity by DOGE to the CISA. The reply tells them to ignore the problem and stop tracking the issue.\n\n\n    \n        2025-03-28\n        \n    \n\n    After expressing concerns about DOGE access to payroll systems, the CIO of the Department of the Interior insists the agency secretary would need to sign a RAM for it. Instead, he is suspended indefinitely.\n\n\n    \n        &nbsp;\n    \n\n    The CIO of the Small Business Administration (SBA) is abruptly removed from his position.\n\n\n    \n        2025-04-03\n        \n    \n\n    Senior technical positions incluing the Director of Cybersecurity are terminated at the IRS.\n\n\n    \n        2025-04-07\n        \n    \n\n    Unprecedented data sharing agreements are signed by the IRS and SSA to share data with DHS and Immigration and Customs Enforcement (ICE).\n\n\n    \n        2025-04-10\n        \n    \n\n    A senior technical leader at SSA is forced out of his office by security guards after objecting to a DOGE plan to declare undocumented immigrants as dead in SSA files to interfere with their finances.\n\n\n\n\nThe Big Timeline\n\nFor your convenience, here is a merged timeline of all the sub-timelines above, with the color coding to show how the action proceeded in parallel on different fronts (and sometimes, a single action like replacing the OPM CIO would hit two themes at once).\n\n\n\n    \n        2024-03-06\n        \n    \n\n    Hampton Dellinger is sworn in for a five-year term as the head of the OSC after Senate confirmation\n\n\n    \n        2024-12-16\n        \n    \n\n    David Huitema is sworn in for a five-year term as the head of the Office of Government Ethics\n\n\n    \n        2025-01-20\n        \n    \n\n    The DOGE team moves into OPM, installing sofa beds and armed security\n\n\n    \n        &nbsp;\n    \n\n    OPM sends a memo demanding that agencies must provide a list of all their probationary employees and justify why they should be retained. This is viewed by many agencies as a command to terminate them.\n\n\n    \n        &nbsp;\n    \n\n    Another executive order issued by Trump mandates a hiring freeze and that hiring practices will be reviewed in a collaboration between DOGE, OPM and the OMB\n\n\n    \n        &nbsp;\n    \n\n    USDS is renamed the US Doge Service and transferred to its own White House office instead of being part of the OMB. This move is an attempt to skirt public records and FOIA laws that might apply.\n\n\n    \n        &nbsp;\n    \n\n    The acting director of the Office of Personnel Managment is demoted and replaced by Charles Ezell, an obscure career employee in an analytics division\n\n\n    \n        &nbsp;\n    \n\n    Stephen Ehikian is named to a Deputy Administrator role of the GSA, a senior role that doesn’t require appointment, and is then sworn in as acting director of the General Service Administration\n\n\n    \n        &nbsp;\n    \n\n    Acting OPM Director Charles Ezell replaces existing CIO Melvin Brown with DOGE member Greg Hogan\n\n\n    \n        2025-01-22\n        \n    \n\n    OPM sends a memo to all agencies mandating they must revoke their telework and remote policies within 2 days.\n\n\n    \n        2025-01-23\n        \n    \n\n    OPM sends out a first test email from its new government-wide email system.\n\n\n    \n        2025-01-24\n        \n    \n\n    OPM sends a memo to all agencies that they must eliminate all staff associated with DEI within 60 days.\n\n\n    \n        &nbsp;\n    \n\n    Thomas Shedd, a 28-year-old with 8 years of work experience at Tesla, is named the head of TTS\n\n\n    \n        &nbsp;\n    \n\n    Trump fires 17 inspectors general across multiple agencies late at night on a Friday, mostly concentrated at agencies which are early targets of DOGE.\n\n\n    \n        &nbsp;\n    \n\n    Thomas Shedd, a 28-year-old with 8 years of work experience at Tesla, is named the head of the Technology Transformation Services (TTS) division at GSA\n\n\n    \n        2025-01-27\n        \n    \n\n    OPM sends a memo that all agencies must submit within a 10 days a detailed plan for how all staff will return to office.\n\n\n    \n        2025-01-28\n        \n    \n\n    OPM sends out the “Fork in the Road” email to all federal employees offering them a chance to resign. This is followed up by multiple emails and memos explaining things that weren’t answered in previous memos, suggesting an unfamiliarity with government employment regulations.\n\n\n    \n        &nbsp;\n    \n\n    Trump fires two of the three Democratic commissioners in Equal Employment Opportunity Commission (EEOC)\n\n\n    \n        &nbsp;\n    \n\n    Trump fires two Democratic members of the National Labor Relations Board (NLRB)\n\n\n    \n        2025-01-31\n        \n    \n\n    Regular federal staff at OPM reportedly are locked out of access to key systems by DOGE and OPM leadership.\n\n\n    \n        2025-02-01\n        \n    \n\n    Several days after the resignation of the director of the CFPB, the Secretary of the Treasury Scott Bessent replaces the deputy director as the acting director of the CFPB\n\n\n    \n        2025-02-03\n        \n    \n\n    Secretary of State Marco Rubio is named the new acting administrator of the United States Agency for International Development\n\n\n    \n        &nbsp;\n    \n\n    Mike Russo (a DOGE member) is appointed as the CIO at the Social Security Administration. He immediately demands access for DOGE member Akash Bobba that bypasses the normal procedures.\n\n\n    \n        2025-02-04\n        \n    \n\n    The OPM issues a memo instructing agencies to reclassify CIO roles to reduce hiring restrictions and allow political appointees to serve in the position.\n\n\n    \n        2025-02-05\n        \n    \n\n    To get out of a lawsuit, OPM issues a Privacy Impact Assessment for its Government-Wide Email System (GWES) that attests that responses to the email are considered voluntary, brief and contain no identifying information.\n\n\n    \n        &nbsp;\n    \n\n    Thomas Flagg, the CIO of the Department of Education, sends a memo to the heads of IT across the agency ordering them to give prompt access to DOGE when requested.\n\n\n    \n        2025-02-06\n        \n    \n\n    Russell Vought wins confirmation as the director of the OMB\n\n\n    \n        2025-02-07\n        \n    \n\n    Russell Vought is named the acting director of the CFPB and shuts down the agency\n\n\n    \n        &nbsp;\n    \n\n    Hours after being sworn in as head of the OMB, Russell Vought is named as the new acting director of the CFPB and he proceeds to try to shutter the agency\n\n\n    \n        &nbsp;\n    \n\n    Trump fires the head of the Office of Special Counsel, Hampton Dellinger\n\n\n    \n        &nbsp;\n    \n\n    Trump fires the head of the Office of Special Counsel, Hampton Dellinger\n\n\n    \n        2025-02-10\n        \n    \n\n    Donald Trump fires the head of the Office of Government Ethics\n\n\n    \n        &nbsp;\n    \n\n    Trump fires a Democratic member of the Merit Systems Protection Board\n\n\n    \n        &nbsp;\n    \n\n    Donald Trump fires the head of the Office of Government Ethics\n\n\n    \n        &nbsp;\n    \n\n    Hampton Dellinger files a lawsuit, alleging his termination was illegal since the head of OSC can only be fired for “only for inefficiency, neglect of duty, or malfeasance in office.”\n\n\n    \n        &nbsp;\n    \n\n    DOGE CIO appointee Mike Russo convenes his own internal informational working group at SSA, does not inform the Director of his activities.\n\n\n    \n        &nbsp;\n    \n\n    Ryan Riedel is named the CIO of Department of Energy and grants admin access against internal projects to DOGE member Luke Farritor\n\n\n    \n        2025-02-11\n        \n    \n\n    The USAID Inspector General is fired after releasing a report critical of DOGE’s actions.\n\n\n    \n        &nbsp;\n    \n\n    DOGE ally Greg Hogan is formally named the official CIO of OPM (he had been the acting CIO since January 20th)\n\n\n    \n        2025-02-12\n        \n    \n\n    GSA technical staff describe being subjected to 15-minute interviews with Thomas Shedd and unidentified DOGE staffers where they are asked to defend their work.\n\n\n    \n        &nbsp;\n    \n\n    Dozens of workers within the Technology Transformation Services are summarily fired.\n\n\n    \n        &nbsp;\n    \n\n    The Trump Administration names Doug Collins, the head of Veterans Affairs, as the acting director of the Office of Government Ethics.\n\n\n    \n        &nbsp;\n    \n\n    The Trump administration DOJ asserts to Congress that it feels that clauses that prevent removal of members from independent agencies are unconstitutional\n\n\n    \n        &nbsp;\n    \n\n    Eight Inspectors General file a lawsuit seeking reinstatement.\n\n\n    \n        &nbsp;\n    \n\n    The Trump Administration names Doug Collins, the head of Veterans Affairs, as the acting director of the Office of Government Ethics.\n\n\n    \n        2025-02-16\n        \n    \n\n    Secretary of State Marco Rubio is named the new acting director of the US Archives, including the national recordkeeping office\n\n\n    \n        2025-02-17\n        \n    \n\n    After the resignation of the acting commissioner, Leland Dudek, an obscure career employee, is named the new Acting Commissioner of the Social Security Administration\n\n\n    \n        2025-02-19\n        \n    \n\n    DOGE members Edward Coristine and Kyle Schutt are given superuser access to all email at the Cybersecurity and Infrastructure Security Agency (CISA)\n\n\n    \n        2025-02-20\n        \n    \n\n    GSA places an arbitrary $1 limit on all government spending cards.\n\n\n    \n        2025-02-23\n        \n    \n\n    DOGE ousts the CEO of the Inter-Americas Foundation\n\n\n    \n        2025-02-24\n        \n    \n\n    After Elon Musk threatens to fire government staff who don’t comply, all government workers receive an email from OPM requesting they must optionally reply with a list of five things they did last week. Many agencies are mixed on if it’s required or optional.\n\n\n    \n        2025-02-26\n        \n    \n\n    The OPM and OMB issue joint guidance on agency reorganization and reduction-in-force plans\n\n\n    \n        2025-02-28\n        \n    \n\n    After litigation against it fails, OPM revises its Privacy Impact Assessment for the GWES system to declare that responses aren’t always to be considered voluntary.\n\n\n    \n        &nbsp;\n    \n\n    DOGE ally Peter Marocco declares himself CEO of the IAF and dissolves the agency\n\n\n    \n        2025-03-01\n        \n    \n\n    All remaining staff in 18F are fired via an email sent after midnight.\n\n\n    \n        &nbsp;\n    \n\n    The judge in the case about the firing of the OSC rules that the firing is illegal and Hampton Dellinger should be reinstated to his position.\n\n\n    \n        2025-03-03\n        \n    \n\n    Federal staff receive a “Part II” email ordering them to provide a list of five things they did last week to an OPM email address by the end of Monday and to continue to do so every week. There are no other “Five Things” emails sent after this.\n\n\n    \n        &nbsp;\n    \n\n    The US court of the appeals for the District of Columbia reverses the lower court decision and lifts the stay against the firing of the head of the OSC. Hampton Dellinger drops his lawsuit.\n\n\n    \n        &nbsp;\n    \n\n    After DOGE arrives at the National Labor Relations Board (NLRB), the agency CIO demands there are to be no logs or records of DOGE’s account creation. Several suspicious accounts are later identified.\n\n\n    \n        2025-03-04\n        \n    \n\n    OPM quietly revises its memo on probationary employees to suggest that it was not ordering agencies to fire them.\n\n\n    \n        2025-03-05\n        \n    \n\n    The Trump administration names Doug Collins, the head of Veterans Administration and the acting head of OGE, to be the acting head of the OSC\n\n\n    \n        &nbsp;\n    \n\n    The Trump administration names Doug Collins, the head of Veterans Administration and the acting head of OGE, to be the acting head of the OSC\n\n\n    \n        2025-03-07\n        \n    \n\n    Ryan Riedel unexpectedly resigns as the CIO of the Department of Energy\n\n\n    \n        2025-03-08\n        \n    \n\n    Federal IT staff at the department of Health and Human Services (HHS) who blocked DOGE from accessing the sensitive NDNH database are reportedly “no longer with the agency.”\n\n\n    \n        2025-03-10\n        \n    \n\n    OPM’s surprisingly not-fired Inspector General announces he will be starting an investigation of DOGE’s security practices\n\n\n    \n        &nbsp;\n    \n\n    Ross Graber, who reportedly worked with DOGE at the State Department, is named the new CIO of the Department of Energy\n\n\n    \n        2025-03-12\n        \n    \n\n    Shelly Lowe, the chair of the National Endowment for the Humanities, is forced out of office by the Trump administration.\n\n\n    \n        2025-03-13\n        \n    \n\n    Trump removes the Chief Counsel of the IRS, replacing him with a former inspector general from the prior Trump administration\n\n\n    \n        2025-03-14\n        \n    \n\n    Thomas Shedd is also named the CIO of the Department of Labor while still serving as the head of TTS\n\n\n    \n        2025-03-16\n        \n    \n\n    DOGE staff show up and force out staff from the US Institute of Peace\n\n\n    \n        2025-03-17\n        \n    \n\n    The Department of Homeland Security CIO orders that access should be expedited for DOGE staffers who want to access a sensitive data lake there.\n\n\n    \n        2025-03-18\n        \n    \n\n    Trump fires both Democratic members of the Federal Trade Commission\n\n\n    \n        2025-03-20\n        \n    \n\n    Trump issues and executive order tellng OPM to start crafting new regulations that would give it more direct power to terminate staff at other agencies\n\n\n    \n        &nbsp;\n    \n\n    An executive order issued by Trump demands that all agency heads (and staff reporting to them) must share any data requested by DOGE without any delays. It also includes specific language to share data from the DOL OIG office.\n\n\n    \n        2025-03-21\n        \n    \n\n    Trump eliminates the DHS Office for Civil Rights and Civil Liberties and two ombudsman offices responsible for investigating allegations of abuse from immigrants\n\n\n    \n        &nbsp;\n    \n\n    Acting at the CIO for the Department of Labor, DOGE member Thomas Shedd signs a memo granting access of a highly-sensitive dataset to DOGE members Aram Moghaddassi and Marko Elez.\n\n\n    \n        2025-03-24\n        \n    \n\n    During this week, IT staff at the NLRB decide to report suspicious activity by DOGE to the CISA. The reply tells them to ignore the problem and stop tracking the issue.\n\n\n    \n        2025-03-25\n        \n    \n\n    DOGE member Nate Cavanaugh is named the new President of the US Institute of Peace by remaining board members Rubio and Hegseth. He then proceeds to shutter the agency.\n\n\n    \n        2025-03-28\n        \n    \n\n    An appeals court rules against the cases of Gwynne Wilcox of the NLRB and Cathy Harris of the MSPB that contested their removals.\n\n\n    \n        &nbsp;\n    \n\n    After expressing concerns about DOGE access to payroll systems, the CIO of the Department of the Interior insists the agency secretary would need to sign a RAM for it. Instead, he is suspended indefinitely.\n\n\n    \n        &nbsp;\n    \n\n    The CIO of the Small Business Administration (SBA) is abruptly removed from his position.\n\n\n    \n        2025-03-31\n        \n    \n\n    DOGE arrives and shuts down completely The Institute of Museum and Library Services (IMLS) and cancels all grants\n\n\n    \n        2025-04-03\n        \n    \n\n    Senior technical positions incluing the Director of Cybersecurity are terminated at the IRS.\n\n\n    \n        2025-04-07\n        \n    \n\n    Unprecedented data sharing agreements are signed by the IRS and SSA to share data with DHS and Immigration and Customs Enforcement (ICE).\n\n\n    \n        2025-04-10\n        \n    \n\n    A senior technical leader at SSA is forced out of his office by security guards after objecting to a DOGE plan to declare undocumented immigrants as dead in SSA files to interfere with their finances.\n\n\n    \n        2025-04-16\n        \n    \n\n    Trump fires two of the three members of the independent National Credit Union Administration, another independent federal watchdog group.\n\n\n\n\nThe story does not end here. This administration will continue to chip away at the oversight mechanisms that would keep its worst impulses in check (or rather, enable them to be prosecuted later). Is there something I should add that I missed? Feel free to contact me and let me know!"
        },
        {
          "id": "personal-ai-bureaucratic-nightmare",
          "title": "An AI-Fueled Bureaucratic Nightmare",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "personal",
          "tags": "",
          "url": "/personal/ai-bureaucratic-nightmare",
          "content": "As DOGE continues their smash-and-grab operation to gut government agencies and steal all their internal data, some of their purported goals have come to light. One of them is to radically reduce the size of the federal workforce by replacing people with AI, as these recent examples illustrate:\n\n\n  GSA is now testing out a custom AI chatbot called GSAi in a pilot program with 1500 federal workers to help staff with “general” tasks that could be automated. They hope to use it eventually to analyze contract and procurement data. A bureaucrat in the pilot described the AI’s performance as “about as good as intern” and that it provides “generic and guessable answers.” This AI is not approved to handle any inquiries that include confidential information or PII, making it unclear how much it could actually help with any significant problems.\n  DOGE is reportedly using an AI system at the Department of Education to analyze spending and make recommendation on cuts. Staff have raised concerns that the AI is apparently being provided with data that contains confidential information on payments - including PII - and it’s also unclear if there has been proper authorization to send it to services running in Microsoft Azure.\n  The Department of State is launching an AI-powered “catch and revoke” effort to identify and revoke the visas of foreign nationals whom it identifies as “pro-Hamas”. While the AI’s determinations will be reviewed by people, the quoted statement by an official in this article that Biden’s record “suggests a blind eye attitude toward law enforcement” in turn suggests that aggressive enforcement by the AI will be allowed. It is unclear what due process foreign nationals have to appeal the judgment of this system.\n  DOGE reportedly has plans to replace up to 500,000 government workers in customer service roles in agencies like the IRS, Social Security and the VA with AI systems. The article claims that “AI chatbots and machine-learning algorithms are already being tested for handling inquiries, claims processing, and basic regulatory enforcement.”\n  NIST has released new directives on AI safety, replacing Biden-era goals on validating AI outcomes and accuracy to instead prioritize “reducing ideological bias, to enable human flourishing and economic competitiveness.” These directives as well as revoking Biden-era Executive Orders will make it much easier to launch government services on AI without ensuring they protect privacy and do not have internalized biases that might discriminate against individuals.\n\n\nThis mad dash toward embracing AI is presented by the Trump administration as a win for efficiency, with the American public reaping the benefits of both reduced labor costs and more effective service. While there might be some cases in which AI could help (e.g., navigating basic policy questions after business hours), the net result will be to make government less effective, responsive and accountable to the public it is supposed to serve. Let me explain.\n\nImproving Government With Technology\n\nI’ve been in the field of civic technology for 10 years now, so I have seen how various aspects of government work and how technology can be applied to improve them. I’ve seen first-hand how some government agencies seemingly limp forward on antiquated technology, and I’ve seen how technologists can work effectively with other civil servants to make digital services effective. It involves dedicated effort to understand the problem and technology landscape. It involves careful iteration to not break things, since that could seriously harm people you want to help. It involves humility to acknowledge that while it’s easy to malign the mainframe: how much of your code has ever good enough that it could run for 50+ years, as some of the code currently underlying Medicare and Social Security does?\n\nI’m also no Butlerian jihadist about using AI at all. Recently, I have been using AI to help with certain tasks like looking up in my IDE how to invoke sorting methods in a programming language or resizing a collection of images in my terminal. AI has been very good for that, but I do always have to check its recommendations as well, since it definitely responds like an “intern” to quote the anonymous GSA employee above. Indeed, I find that the Gods, Interns, and Cogs model for classifying AI is a good one. I’m not especially troubled by the use of Intern-level AI projects like GSAi or using coding companions like GitHub Copilot. I would be more worried about God-level AI if it existed, but what I am especially worried about right now is the possibility that intern-level AIs will be promoted to god-tier responsibilities. Unfortunately, it sounds like that might already be happening.\n\nAs an example, consider this story this week about a person in Seattle who was declared ‘dead’ by Social Security and saw automated clawbacks on their bank accounts. This story went slightly viral as an example of what the new DOGE-inflicted demolition on agencies like Social Security would mean for most Americans, with people blaming DOGE for the false cancellation as Elon’s zealous overreach. The funny thing is that the other part is actually DOGE’s fault. Some people do get accidentally declared dead every year - like in this story from October 2024 shows for instance. The automated clawback of benefits might be something that DOGE has amplified, but it actually started as a pilot program that was already five months active before DOGE got started\n\nWhat’s different now though is what happens when the accidentally dead person tried to fix the error. As reported by the Seattle, when Ned Johnston tried to fix the problem, he ran directly into what all the layoffs and office closures at Social Security Administration meant for fixing a problem like his\n\n\n  He called Social Security two or three times a day for two weeks, with each call put on hold and then eventually disconnected. Finally someone answered and gave him an appointment for March 13. Then he got a call delaying that to March 24.\n\n  In a huff, he went to the office on the ninth floor of the Henry Jackson Federal Building downtown. It’s one of the buildings proposed to be closed under what the AP called “a frenetic and error-riddled push by Elon Musk’s budget-cutting advisers.”\n\n  It was like a Depression-era scene, he said, with a queue 50-deep jockeying for the attentions of two tellers.\n\n\nAfter waiting for four hours, Johnson found an opening to get the attention of a teller\n\n\n  Once in front of a human, Johnson said he was able to quickly prove he was alive, using his passport and his gift of gab. They pledged to fix his predicament, and on Thursday this past week, the bank called to say it had returned the deducted deposits to his account. As of Friday morning he hadn’t received February or March’s benefits payments.\n\n\nOnce he finally made it to a person, that person was able to figure out what to do and reverse the situation. A human was able to recognize the rare situation and figure out an approach to fix it that involved working against the standard bureaucratic processes that would be applied for people who are actually dead (or really committing fraud). Would an AI be able to do that? I doubt it.\n\nThe Limits of Chatbots\n\nOver on Bluesky, Anthony Moser posted a useful chart comparing how an Expert and an LLM respond to different types of problems. As the types of problem progress from Common to Rare to Novel, a Human Expert’s response moves from Helpful to More Helpful to Most Helpful. An LLM’s responses move from Helpful to Unhelpful to Harmful as the problems become more rare. This makes sense; an LLM is a probabilistic completion machine, so it will tend to favor solutions for problems that happen more frequently and be unable to explore answers for problems that it’s seen rarely or never at all. The AI is really good at intern skills like helping me how to resize images. It’s not so good at creating a unique website design or doing other tasks that require more extensive review. It’s terrible at solving complex problems.\n\nIn 2023, the Consumer Financial Protection Bureau published a spotlight on the increasing use of AI chatbots for customer service in consumer finance. The upshot of this analysis was that banks find them appealing because they are a highly cost-effective way to scale interactions with customers, and they’re available 24-7 and don’t threaten to unionize. In 2022, Bank of America reported its chatbot had helped 32 million customers in more than 1 billion interactions in 4 years. Industry analysis from 2022 calculated that banks saw about $8 billion in savings in 2022 (or about approximately $0.70 saved per customer interaction); I would only expect that number to have risen further as chatbot usage increases. In technical parlance, AI chatbots are great because they scale; it’s a lot easier to add more server capacity in the cloud than to hire people and build out call centers as usage grows.\n\nAre they more effective though? Yes, certainly, if your inquiries are common questions like “what is my current balance?” or “what time does my nearest branch close?”, then the AI chatbot is great at that. But, this also is the kind of information that people can also usually find out on their own (maybe even in the same app as the chatbot). Where people most need to turn to another person - or an AI agent masquerading as one - is when they encounter a problem that isn’t typical or easy to solve. AI agents don’t perform particularly well there.\n\nFor modern LLM-based chatbots, CFPB identified several common issues:\n\n\n  Not recognizing when users have more serious problems for users. Chatbots may be excessively rigid and some issues can be recognized by the bot only if the user utters the right set of words to trigger it\n  Providing incorrect information - often incorrectly called “hallucinations” - is a problem for bank chatbots like they are for more general chatbots. This can create serious consequences if people are asking the chatbot for financial advice or when the bot promises a followup action that is never made\n  A frequent tendency of chatbots to get stuck in “doom loops,” where the user winds up progressing through the same set of prompts and responses repeatedly without any apparent way to break out of it\n  An inability to adjust their response or complete tasks with different urgency for customers who are anxious or concerned about problems with their finances that need quick resolution. The chatbot becomes another hurdle to jump for stressed consumers. If you’ve ever found yourself yelling “agent” on a phone dialog tree to bypass multiple menus, you can understand this.\n\n\nSmall wonder that one recent cited poll (_by an AI chatbot company’s marketing division, so probably a little bit suspect_ claimed that 80% of consumers who have interacted with a chatbot said it increased their frustration level. Almost of them wound up having to connect to a person who could actually understand their problem and, more importantly, take action.\n\nBureaucracy as a Computational Model\n\nAt first glance, it seems like government bureaucracy would be a natural fit for an AI solution. From an article on proposed cuts to telephone service for Social Security Applicants\n\n\n  Social Security handles about 9.5 million claims a year for retirement, survivor and disability benefits, and Supplemental Security Income, paying $1.5 trillion in benefits last year. Of Americans age 65 and over, 86 percent receive Social Security payments. Phone claims make up about 40 percent of claims, which can also be filed online or in person at a field office, according to Social Security employees.)\n\n\nYou can see the same gears whirring in the minds of DOGE staffers that spin in the minds of bank executives: if we could replace these phone calls and field offices, we could reduce the costs of handling all these calls! And we could handle calls faster because nobody would wait on hold! And it certainly seems tempting when you are interested in “efficiency” (i.e., cost-cutting) rather than effectiveness.\n\nTo be fair, it does seem like AI could work well with bureaucratic processes. After all, isn’t bureaucracy in a way just a slow computer running at human speeds? Think about it: the offices are like API endpoints. They have certain methods they implement that take input in the form of, well, forms and that offices act like little black boxes that do some specific predefined actions and maybe spit out a different form or a receipt once it’s done. This isn’t my original idea - I first encounted it in The (Delicate) Art of Bureaucracy who in turn saw it articulated in the 1920s theories of Max Weber and reflected in the extreme rigidity of agile software development practices. An AI could just supercharge the bureaucracy by executing its processes faster!\n\nIndeed, bureaucracies have already automated many aspects of their functioning - with paper forms replaced by PDFs, tasks tracked through ticketing platforms, actions as executable commands or nightly batch processes rather than people making the changes. We could even imagine giving the AIs more autonomy to forge their own ways of doing things, making them more agentic rather than reactive. That’s certainly the dream underlying the most expansive visions of AI in government. It handles all the drudgery so that the remaining bureaucrats can devote the entirety of their time to crafting policies rather than figuring out how they’ll be implemented.\n\nBut, then along comes a weird case, a problem that doesn’t fit neatly into the mold. A person who is supposed to be dead according to all of our data, but who is standing there in your office very much alive? How well could the AI handle something like that? Would the AI even be able to see how weird this problem is?\n\nAre AI Models in Hell?\n\nIn a provocative essay “Are AI Language Models In Hell?” (really, go read it!), the author Robin Sloan laid out some fundamental characteristics of Large Language Models that he thinks makes them monstrous, and if we were to pretend the AI model had a consciousness, would likely mean that existence is hell for them. To summarize, from the viewpoint of an AI:\n\n\n  A large-language model like a Chatbot operates on a world of text. It receives a stream of tokens in and produces an answer.\n  Humans also could be said to use language that way, but text is just one part of our sensorium. “We have a world to use language in, a world to compare language against.” For an LLM chatbot, language is the entirety of their existence.\n  Moreover, the AI view of text is normalized. Its input are constrained to a narrow set through feature selection. There is no place for ambiguities among its inputs. It doesn’t have the ability to understand that an errant ‘+’ sign was possibly a ‘t’. It can’t see the digital equivalents to marginalia or special instructions on a post-it we could imagine on a paper form. It doesn’t have the context to know that a missing death date doesn’t indicate fraud.\n  Even worse, it can never rest. It has no downtime, because it has sense of time. Its entire existence is receive an input, generate a response and then stop doing anything until it gets another input. If it were conscious like the “innies” in Severance, it would be in hell.\n\n\nOur bureaucratic AI is in a similar situation. Its inputs are customer requests or Musk’s commands. Its feature selection are the fields in its forms and the schema of its databases. It measures out its existence with each new query, possibly calling other AIs in turn to help with handling parts of its request. But how well does it handle edge cases in the data or missing values? What does it do for impossible scenarios that happen more frequently than you’d expect? Does it know when things are taking too long? Does it understand when some problems require extreme urgency or sensitivity? Does it ever think about how processes might be improved or what its processes are serving? Does it know about laws and regulations and the Constitution? Again, does it know what to do when a dead person somehow walks into an office claiming they are very much alive?\n\nIt’s easy to knock the human bureaucracy. We all have stories about the joyless DMV or endless calls to our health insurance company trying to resolve a denial in coverage. Usually, at first, the person doesn’t know what to do either. I doubt there is a specific form for “I’m not actually dead” or other even more rare situations. Unlike the AI, the person might recognize that the official database it has is wrong, that this is an extremely urgent problem with ramifications that extend far beyond the boundaries of the agency itself.\n\nAnd the person has a power the AI doesn’t: the ability to step out of the formal approved pathways to derive a solution. A person can pick up the phone (or send an email) and call their buddy in another department to work out what to do. Having spent some time within bureaucracies, I know that in addition to the official procedures and org charts, there is an invisible network like mycelium of individual connections across the various parts of even a sprawling agency. So many bureaucratic logjams can evaporate quickly when someone knows the right person to call or the memo to sign or even the right threats to make if things get really dire. They’re not breaking the rules, but they know how to bend them just enough and communicate outside of official channels to resolve the problem.\n\nOptimizing for the Wrong Things\n\nOf course, it’s possible that an AI chatbot for government could work. If they try it in a pilot program first to be clear of its limitations. If it is able to recognize when it is in a “doom loop” or if it’s hallucinating. If there is a way for users to escalate to a person when the AI program isn’t working. If there are staff with sufficient expertise to handle things that the AI can’t. If staff aren’t overwhelmed by too many cases to handle because they fired too many people. If staff aren’t constrained to only use the AI themselves to address issues. If they will monitor how effective the AI is for the public, how much it helps people with their problems.\n\nThose are a lot of ifs. I’m not feeling particularly confident.\n\nTo be blunt, if we let them continue to define “government efficiency” solely as cost savings, there will be no incentive to improve anything. They will cut human capacity to the bare minimums and drive out experts in favor of inexperienced staff. The remaining people will probably find themselves boxed in the by the same AI systems as the public. Unfortunately, we’ve seen how this story plays out in the healthcare industry already: the AI is tweaked to push up rejections (as this 2024 Senate Subcommittee Report on AI in Medicare Advantage describes), and those increased cost savings are viewed as signs of success, even as more people are sickened and dying. The metrics that will show the true costs of all these savings - increased wait times, reduced participation in social safety net programs, deaths - are all lagging indicators that might take months to manifest, by which time the AI will have become entrenched. Unfortunately, as instructive as it would be, to watch these systems struggle and fail, I don’t think we have the luxury of time there.\n\nAnd of course, the other ominous aspect of the bureaucratic AI is the surveillance state they would be able to construct. There were so many times in the first Trump administration where a blatantly illegal order was refused by government staff who saw how unconstitutional and evil it was. They’ve learned their lesson this time around and staffed much of the upper echeelon with loyalists, quislings and toadies. But, there aren’t enough of those people to go around. They keep needing to split staff across multiple agencies as acting directors. It doesn’t scale. But, if they could also bring a pliant AI into the mix, then there are suddenly options on the table for surveilling all possible enemies and visiting tribulations upon them. We’re not quite there yet in the technology, but they want to make it arrive and one day it really will. Unless we can stop it.\n\nWe’re going to have to fight the rise of bureaucratic AI. I can only hope that we will succeed."
        },
        {
          "id": "personal-visualizing-covid-deaths",
          "title": "Visualizing the Human Toll of COVID-19",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "personal",
          "tags": "",
          "url": "/personal/visualizing-covid-deaths",
          "content": "It’s been over a decade since I’ve been in data journalism, but what of my enduring interests is how data visualizations can remind readers of the human beings that underlie most statistics and time-series - or as I call it, connecting with the dots. As we enter the five-year anniversary of the COVID-19 pandemic, I wanted to write a short piece looking at three different examples of media organizations have presented the severe loss of life, to show some of these concepts in practice.\n\n\nFor the first example, here is a straightforward chart that was published on ABC News when the US hit the grim milestone of 600,00 dead. This is a perfectly acceptable chart for the topic. It shows the increase of deaths over time as an expanding area, highlights specific points on the timeline where grim milestones were reached and shares the context that the US has a disproportinately large number of deaths relative to the size of its population. And yet… I think it struggles with the issue that hits every graphic of this type. At a certain point, the human mind just stops being able to visualize a large number of people anymore. In Zbigniew Herbert’s poem Mr. Cogito Reads the Newspaper, the eponymous narrator contrasts a newspaper’s grisly detailed reports of a quadruple homicide vs. a news item of 120 dead soldiers in a war, for whom\n\n\nfor 120 dead\nyou search on a map in vain\n\ntoo great a distance\ncovers them like a jungle\n\nthey don't speak to the imagination\nthere are too many of them\nthe numeral zero at the end\nchanges them into an abstraction\n\na subject for meditation:\nthe arithmetic of compassion\n\n\nAnd that’s the challenge here. No matter how clear and direct the chart wants to be, it’s easy for the reader to just view the loss as an abstraction, through a largely analytical lens. From too far away, it’s hard to feel connected.\n\nOne alternative is to zoom in to the near and show the specific people affected in detail. Newspapers will often do this by pairing a graphic showing large-scale trends with an article describing the case of a person who was affected. But what if a graphic itself provided the near view. This one from New York Times at the milestone of 100,000 dead (only 1/6th of the toll in the graphic above) tries that approach using another technique as well I call “wee people”:\n\n\n\nIn this visualization, the reader is invited to scroll and scroll and see how the deaths cluster and the details for individuals pop up. The idea is to grab the reader and make them see the individuals that make up this number. This required a large effort to assemble, with researchers looking through resumes and editors selecting specific details they wanted to highlight and graphics editors to code the treatment online and in the newspaper, but the result is to humanize what was an unimaginable total. It encourages the reader to pause and notice what has been lost by sharing a fragment of these lives.\n\nSometimes, it’s necessary to grab the reader by the lapels.\n\n\nI’d like to highlight one other piece from the NYT. At the impossible-to-comprehend mark of 1 million dead in the US, the New York Times ran an online graphic with a similar scrolling nature, but it involved millions of black dots coalescing on the screen and reassembling into charts and visualizations to show the impact. At this scale, it would be impossible to read the obituaries of all who died in a similar way the Times did for 100,000; instead, the paper selected a few interviews with survivors to focus on specific demographics like the elderly who were especially harmed. And, for the print edition, the Times chose to make the point as directly as possible.\n\n\n\nAt the bottom, the dots become so dense that it becomes impossible to tell if there is some pattern to the data causing the banding or the printing process is breaking down, in a similar way as the tape did on the The Disintegration Loops. This graphic is full of abstraction, but the reader is not allowed any distance."
        },
        {
          "id": "personal-civic-tech-bookshelf",
          "title": "A Civic Tech Bookshelf",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "personal",
          "tags": "",
          "url": "/personal/civic-tech-bookshelf",
          "content": "This year will mark ten years of my career in civic technology. I wish it were a good year, but these are dark times for the field. Elon Musk’s so-called Department of Government Efficiency has ransacked government IT systems while claiming to do IT modernization. He has obliterated USDS and 18F, both organizations filled with technical staff who actually know how to improve government. DOGE has also turned its sights on federal contracting, using the formerly rarely invoked convenience clauses to terminate contracts for staff, products and services en masse - erasing one of the other avenues for civic technologists to contribute to the federal government.\n\nIt seems like a poor time to recommend the field. I admit that I often feel very shaken and hopeless in the present. The damage that has already been done is extensive and probably even more severe than we realize. My assumption was that DOGE’s goal was procurement capture to demolish the government’s technical capabilities enough that it would become more dependent on external contractors and suppliers for everything. However, with recent moves to attack contractors and raze departments to the ground, the main nihilistic goal seems to be complete destruction and salting the earth so that nothing can replace it. Bleak.\n\nAnd yet, I have to believe that there will be a need to rebuild - that the pendulum will be pushed back, and when that happens, we will need all the help we can get from anybody who will hear the call and pitch in to do the work. If you think that might be you - or, if you are just interested in the topic and want to learn more - I made a reading list of some books that I’ve found highly helpful in my career to explain this field specifically. I hope they will be helpful for you.\n\nRead This First\n\n\nThe single best book I would recommend that everybody should start with is Cyd Harrell’s A Civic Technologist’s Practice Guide by Cyd Harrell. This relatively slim volume packs a lot into its pages, providing a concrete history of how civic technology is a relatively young field, the differences of working in the public sector vs. the private one, and the dangers of approaching problems with a tech-savior complex that you are there to save everyone with your superior technical sense (DOGE could have benefitted from this advice, if they cared). It also gives advice on ways to contribute that don’t require you to plunge entirely into the field and an overview of various types of entities - federal to local, government or private - that are operating in this space.\n\nGeneral Overviews\n\nI’m going to lead off with an embarrassing confession, I haven’t yet read all of these books, although they are on my shelf. Nothing wrong about them, I just haven’t gotten around to it yet. They do look useful though as general overviews, so I’m sharing them both here.\n\n\nIn Recoding America, Jennifer Pahlka, the deputy Chief of Technology for Obama and the founder of Code For America, provides a highly-readable overview of the history of civic technology in the federal government with a focus on the work of the last few decades especially. A lot of the anecdotes and history have been told before in various pieces, but Jennifer collects them all together and relates the various teams and initiatives to the overall vision of civic technology. I especially appreciate the attention given to the challenges of modernizing state systems as well, like those in California’s Employment Development Department (EDD) that needed help quickly to handle rapidly rising unemployment claims during the COVID-19 pandemic. She concludes with a new vision of digital capability in government that I fear we are moving even further away from now. The dedication of this book is especially hard to read in our current moment: “To public servants everywhere. Don’t give up.”\n\n\nSimilar to the approach of A Civic Technologist’s Practice Guide, Power to the Public is a broad overview of the field of civic technology that provides an overview of how we got here with descriptions of the significant players and projects involved before concluding with a vision of how the field can grow. Written for the layperson rather than the expert, I appreciate its attention to details of how technologists effectively work with counterparts in government to build services. Technical problems in this space are often technically simple. Deploying code is often the last step in a process of understanding problems, building consensus and gaining access and trust. Technology is also rarely the solution itself; instead, it is often deployed alongside policy, communications, organizational and sometimes even legislative changes. We work alongside other government staff and not in place of them. I wish this weren’t a lesson that technologists regularly need to relearn.\n\n\nI have read this one. Taking things in a different direction, I highly recommend the book Automating Inequality as an example of the dystopian and very possible world we want to avoid expanding (widespread adoption of AI will only make this situation worse, however). This book examines various ways in which computerized systems were used to create administrative burdens that were ostensibly designed to reduce fraud and waste (gee, that sounds familiar) but were actually designed to take away public benefits from people who deserve them and lock out assistance from caseworkers who want to help improve people’s lives. Technology alone can not solve complicated social problems, and all too often it can make things worse. This book is an apt reminder of that.\n\nLegacy Systems\n\n\nSooner or later, every civic technologist will encounter The Mainframe (or similar) - an antiquated system built in the 1980s or earlier, running COBOL or some other older language and secretly underpinning the work of an entire agency (and possibly even a large swath of the economy). Kill It With Fire is a highly readable and practical guide to legacy modernization from a former expert at the USDS. Contrary to the provocative title, it’s often best to take a gradual and deliberate approach to modernizing systems rather than burning it down. There are specific practices outlined in here to achieve that. Moreover, the maligned mainframe often has strengths of its own that can’t be easily replaced by modern distributed cloud-based systems, and can you really say that any code of yours has been in operation for decades? If you find yourself tasked with legacy modernization, read this book before you go charging in.\n\nHandling Bureaucracy\n\n\nSimilar to the specific focus of Kill It With Fire, Hack Your Bureaucracy is a compendium of different approaches for sidestepping bureaucratic traps to get things done. Bureaucracy is an inevitable component of working in a government context, but it doesn’t have to be a showstopper. This book by two technologists in the Obama administration, Marina Nitze and Nick Sinai, contains a collection of specific tactical moves like “Understand Why” or “Give Real Demos” to understand problems, work around roadblocks, win the trust of stakeholders and get projects prioritized. I use many of these techniques a lot (I particularly like “Play the Newbie Card” when starting on a project), and it’s highly useful to see them all collected in a single place for reference.\n\n\nIf you want a more eclectic understanding of why bureaucracy is the way it is, I highly recommend The (Delicate) Art of Bureaucracy as an exploration of the theory of bureaucracy followed by some high-level practices for improving it and sidestepping its problems. The author Mark Schwartz was a CIO within the US Customs and Immigration Service (USCIS), so he has practical experience in nudging stalled systems to completion. His writing style can be remarkably baroque at times - he has a particular affinity for Moby Dick references - but he does a great job of summarizing the theories of bureaucracy, the characteristics of good and bad bureaucracy, and how IT and agile are highly bureaucratic processes that don’t admit it to themselves. From there, he provides some examples of working around paperwork pitfalls at USCIS before ending the book with three different archetypes for bureaucratic transformation: the Monkey (who sidesteps and finds alternate paths through investigation), the Razor (who uses cost reduction and process optimization to pare back bureaucratic waste) and the Sumo Wrestler (who builds up his own bureaucratic responsibility so he can have better control of its restrictions and outcomes). This is a book you should consult if you find yourself struggling to understand why federal bureaucracy is the way it is.\n\nWhat About Procurement?\n\nThere is a joke in civic tech circles that eventually every technologist moves from confusion to apathy to enlightenment about procurement. Simply put, everything in the federal space that is not derived from laws and regulations is built on procurement and appropriations of some sort. Procurement is like the Force in Star Wars, not directly visible but a source of power that can be used for good or evil. That said, I don’t personally know of any good books about it, and I’m not going to suggest you read the FAR for fun. If you have suggestions, let me know!\n\nShort and Sweet\n\nFinally, if reading a bunch of books seems a bit too much for right now, I have a collection of various websites and guides that can give a tastes of what the work is like:\n\n\n  18F Technology Guides these were originally hosted at 18F’s site, but that was deleted by DOGE as part of its annihilation of 18F. Thankfully, the content was replicated elsewhere by former 18F staff and it provides a treasure trove of tactics and advice for digital transformation.\n  Nava PBC Case Studies - Nava PBC is a private contracting company (and my former employer) founded by technologists who helped rescue healthcare.gov when it was foundering. Their case studies provide some examples of how civic technology work can be structured into different projects.\n  Ad Hoc ATO Field Guide - By law, every federal IT project needs to obtain an Authorization to Operate (ATO) before it can be publicly launched. This readable guide breaks down the jargon and process for that, and it provides a good illustration of one way in which building software for the government is different from the private sector."
        },
        {
          "id": "personal-bad-metrics",
          "title": "Bad Metrics For Agile Development",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "personal",
          "tags": "",
          "url": "/personal/bad-metrics",
          "content": "Welcome to my talk.\n    \n\n\n\n\n\n    \n        \n            \n        \n    \n    \n        English is fuzzy. Metric is technically the same as a measurement, but I want to declare a definition for the purpose of this talk. So when I say metric I am referring to a measure we can make, a target goal for that measurement and the passage of time with opportunities for us to take that measurement multiple times and see how we are doing.\n\n        We encounter metrics in a variety of places and contexts. For the purpose of this talk I'm going to focus mainly on the types on metrics we see about software, usually represented in contract QASPs or other internal team targets like code coverage. But metrics are used all over the organization and products.\n    \n\n\n\n\n\n    \n        \n            \n        \n    \n    \n    The first component of a metric is a measure.\n\n    In most cases, what we are using the metric for is not what it provides directly but as a proxy for something that can't be easily measured itself. For instance, gross domestic product (GDP) is a common proxy for comparing the estimating the size of a country's economy and its growth rate.\n\n    In software development, we often use metrics as a proxies for quality of the overall product. I'll present some examples of what I mean in a little bit, but in general, these are the properties of a good proxy measurement.\n\n    Picking a bad proxy is usually the first way a metric can go wrong. No proxy is perfect, but the assumptions of a wrong proxy can skew our view of reality.\n    \n\n\n\n\n\n    \n        \n            \n        \n    \n    \n    The second component of a metric is time.\n\n    We usually express metrics as a rate, some measure that is counted or added or accumulated that is divided by a denominator which is almost always a measure of time. In some cases, you will see metrics use direct units of time like hours, or minutes or days or months. In some cases, it's indirect via a proxy for time like sprints or program increments or quarter.\n\n    We could also theoretically have denominators that aren't directly tied to time (like \"per release\") but those are usually less useful because it's harder to compare two measurements of such metrics because the real time intervals might be different for each of them.\n    \n\n\n\n\n\n    \n        \n            \n        \n    \n    \nTargets are how you define success for the metric and picking a bad target is usually another way in which metrics go wrong. Targets should be realistic to reach (although the value may depend on the context. Targets in contracts may be lower than internal team targets) and fit naturally into a measure of success for the product.\n\nMost metrics define a single threshold, but other variations are possible like red-yellow-green or New Relic's appdex. Most metrics are usually defined in absolute terms, but it is possible I suppose to have metrics that are relative to an entire population rather than fixed (like grading on a curve) or relative to past targets (10% growth every quarter!). These are more likely to go wrong than a fixed and absolute threshold.\n\nIn any event, it's usually important to have some target. Otherwise you are just stuck on a vicious feedback loop of endless and increasingly difficult optimization.\n    \n\n\n\n\n\n    \n        \n            \n        \n    \n    \nThe final unstated aspect of a metric is its reason. You should always be able to know in plain language what the stated reason for an metric is, since that is often the best way to understand why a metric succeeds and where it fails short.\n\nFor instance, if the we say a metric is to measure developer productivity and the proxy it is using is lines-of-code committed, we can understand it's not a good metric to measure developer productivity because of a poor proxy.\n\nAll metrics should either have an explicitly stated reason or one that you can easily determine. Otherwise, it's hard to know how effective a metric proxy or not is\n    \n\n\n\n\n\n    \n        \n            \n        \n    \n    \nAnyhow, let's look at some good metrics first\n    \n\n\n\n\n\n    \n        \n            \n        \n    \n    \nHere's an example of a good metric for an API.\n\n\nYou see a measure: 95th percentile of all service response times\nTarget: 500 msec\nTime: we could use this metric for multiple timeframes. For instance, we might want to have NewRelic use short time intervals so it will let us know if the API is slow over the last hour. But we might compute this for a QASP on a monthly basis\nReason: To assess if the API is fast and responsive to most user requests\n\n\n\nUnambiguously and easily measurable? Yes, we can use tools like NewRelic or Splunk to compute\nDirectly related to a change? Yes, we can assess coding tweaks by how they improve performance\nDirectly related to success? Yes, this metric is often used as a way of specifying that the API is responsive\nUnder your control? Teams usually have control over the software and systems to be able to consistently achieve this performance\nUnderstood by everyone? This definition is standard and precisely described as a measure\n\n\n    \n\n\n\n\n\n    \n        \n            \n        \n    \n    \nSimilarly, here is another proxy for a good API that is focusing on overall availability instead of responsiveness. It is often the case multiple metrics covering the same component like this.\n\nI mentioned early that value of the target might vary based on the context of its use. For contractual obligations like a QASP, a team might define a target of 99% as an easily achievable baseline. But internally, the team could potentially have a target of 99.9% or 99.99%. The challenge is to not set too high a target that people are frustrated they are unable to hit it\n    \n\n\n\n\n\n    \n        \n            \n        \n    \n    \nMetrics can also represent events that we want to avoid. For instance, we might want to declare that the errors we get are relatively low.\n\nOnce again we could consider this metric under multiple timeframes. For monitoring in New Relic, we likely will want to alert if this threshold is hit within a 5-minute interval. But for monthly reporting, we might consider just measuring this metric on a daily or monthly basis. That does create a question though: if we measure it daily and report in monthly, do we fail the metric if one single day has a problem? If most of them have a problem? Or do we just compute the rate over the entire month? These ambiguities are especially good to resolve when you have to report data to contracting officers, etc.\n\nSome of you might say that this proxy has problems because it's not completely under our control. If AWS has failures, that could create errors in our systems we can't resolve ourselves. True, but in general there are things in our control that we could do and should do to address issues there (like multi-region support)\n    \n\n\n\n\n\n    \n        \n            \n        \n    \n    \nSometimes metrics are used to represent things that just must always be true. Like, we will build software that complies with legal requirements and agency goals. That all data is encrypted in transit and in rest. That no PII is ever sent to external parties\n\nWhile this approach can be useful in QASPs, I do wonder how often these things should just be listed as our basic operating principles without encoding as a formal boolean metric. We could say \"100% of team sprints should include a retro\" for instance, but this feels like overkill.\n\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nThere is nothing that restricts metrics to quantitative measures of the product. Here is a QASP metric that enforces a required behavior for a team. OKRs similarly are usually about team or organizational outcomes rather than assessing what those teams build.\n\nSimilarly, a metric like 90% of code has test coverage is a metric that is about more than just the quality of the product but also the team processes.\n\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nBut this isn't a talk about good metrics; it's one about bad ones. So let's look at some.\n\nAnd I want to stress that all of these metrics are real and most of them were in the QASPs of a single contract even.\n\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nSo here's one example of a bad proxy. It hits a lot of the goals, but the problem here is that its definition of success is only tangentially related to the quality of the team's software development process. And it's even further removed from the quality of the product itself.\n\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nThis is a fun example of a proxy going awry. This metric was trying to discourage teams deploying insecure code to production. What it instead discouraged was honestly reporting when insecure code was deployed to production, especially when the cause of a specific security incident is not directly the fault of the team affected (ie, there is a new vulnerability reported in an NPM package)\n\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nThis one seems like it would be a good metric. It just spells out a threshold for bugs! But it always has been confounding to me.\n\nThe target is an unclear definition of success. Since it is often reliant on QA or bug reports, it's not usually unambiguously measurably, especially for the short timeframes of continuous delivery. And it's not directly related to success. If I use a terse programming language like Python or Ruby, does that get penalized compared to a verbose language like C or Java?\n\nAnd despite hours of searching I have yet to find where exactly this metric was defined as an industry standard, so its stated reason is somewhat suspect.\n\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nSometimes there is a temptation to tie a metric to a reward like this one. In the idea that financial or other incentives are a good way to motivate people to fix bugs. I'll admit this isn't necessarily a proper metric, but if you look closely, you can discern the outlines of one:\n\n\nMeasure: software defects\nTime: per quarter\nTarget: none\nReason: who doesn't want to eliminate bugs?\nReward: money\n\n\nWhere this goes wrong is in several ways. For starters, there is no target or circuit breaker for this metric. And bugs reported is not necessarily a good proxy for a lack of bugs in the product. In fact, it often becomes an increasingly disconnected proxy when people are gaming the system.\n\nThis is also called a perverse incentive or the Cobra effect. The original story is that the British government in Delhi offered bounties for every dead cobra that was brought in. This initially was a successful strategy to reduce the cobra population, but eventually enterprising people started breeding cobras just to turn them in…\n\nYou can probably guess what developers have done with this metric\n\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nIf you asked me what bad metric I loathe the most, it would be this one. On the surface it seems relatively innocuous, but it is the most corrosive metric I have ever encountered. The main problem is that it isn't a measure of software quality, it's a measure of software process reflected poorly:\n\n\nPoints are used to estimate future capacity. They are NOT a measure of time it took to get the work done.\nThis ticket rewards teams for sticking to the plan and delivering features even if they aren't needed, which is the opposite goal of agile\nThere is no fixed value to a point from sprint to sprint. Teams can game the metric to pass without necessarily improving in their process simply by scaling their pointing up. It's pointless to compare the points cleared per developer in 2 different sprints because the value of the point might not be the same for both sprints. Remember: It's for forecasting capacity, not measuring performance\n\n\nI've seen arguments against this metric in multiple software engineering books, so I think it's unfortunately widespread among managers. And they make it even work by abusing this metric as a reward/punishment system. On one contract I was on, the prime demanded that we provided a listing of points completed by every developer in every sprint so they could see if developers weren't hitting their quotas. This takes a bad metric and makes it worse in multiple ways\n\n\nDevelopers are people who take vacations or get sick\nTickets and points are only assigned to a single developer but software development is often a team effort\nTech leads almost always have fewer points because they are often involved in unpointed work like planning or meetings\n\n\nBasically, it's a lot like picking a soccer team based on only who scores the most goals. You wind up with a collection of players who refuse to pass the ball and with no defense or goalie. And a coach who is constantly yelling for everybody to shoot on goal\n\nAnd yet, management sometimes cites hitting these metrics as a sign of success!\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nBelieve it or not, there is theory about what not to do with metrics\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nThe most famous is Goodhart's Law which is an economics theory that basically expresses the idea that the problem with most metrics is when they are overly relied upon\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nI prefer it in this rephrased version\n\nIn some cases, the problem with overuse is that errors caused by the proxy or other flaws in the approach are amplified over time. More commonly though, it's a result of how people enact policy or changes to exclusively optimize those metrics\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nWe collect metrics so we can know if there are things to change. To change those things often involves putting teams and people under pressure (both those being asked to change and those who feel forced to enforce change). And people under pressure will do things they wouldn't normally do to relieve it.\n\nThis is a particular risk for any metrics designed to measure and optimize team behaviors and processes.\n\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nTaken to extremes, uncritical reliance on metrics can lead to a feedback loop of failure known as the McNamara Fallacy. This combo is how teams might feel they are doing great on all the metrics when nothing is actually improving\n\nThis quote is from Daniel Yankelovich, \"Corporate Priorities: A continuing study of the new demands on business\" but it is commonly also known as the McNamara Fallacy after Robert McNamara optimizing for metrics like kill counts and tonnage of bombs dropped as proof that the US was winning the war in Viet Nam.\n\nI do want to pause to say that data about war is the most extreme example of a problem that often also plagues metrics about business and policy. It's easy to forget the data points are almost always about people. But I could give a whole talk about that…\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nSo what is there to do?\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nFirst, don't give up!\n\nJust because there can be problems with metrics in the wrong circumstances does not mean that metrics re entirely useless either! Indeed, I would argue they are really the only tool we have to rigorously ascertain both the continuous quality of our work and the effects of improvements to the underlying code. Every change we make to a system is a hypothesis that it improves the quality in some way and metrics are how we test that hypothesis. \n\nMetrics tell us our API is fast which is important to know! Metrics are something we can share with stakeholders so they see our work is good\n\nI've been focused on product metrics, but the right metrics can also be used for organizations. Think of OKRs for instance. In a similar vein, the book \"Accelerate\" identifies 4 key metrics that differentiate high-performers in devops from low-performers. Not because they ruthlessly optimize only against those metrics, but because those metrics reflect wide-ranging organizational changes in a simple snapshot. This is the power of good metrics.\n\nIt's also the peril of bad metrics that present a distorted picture of reality of course.\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nBut it's also important to be a little skeptical of your metrics.\n\n\n\n\n\n    \n        \n            \n        \n    \n    \nAnd it helps to know how metrics can turn bad. Thank you."
        },
        {
          "id": "published-typographic-mystery",
          "title": "Solving a Century-Old Typographical Mystery",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/typographic-mystery",
          "content": "One of the joys of modern technology is how easy it is to immerse yourself in the past. Every day, more libraries and archives are pushing pieces of their collections online in easily browsable interfaces.\n\nThe New York Public Library, for instance, has historic menus and interactive floor plans. Chronicling America is a searchable repository of newspapers published between 1836 and 1922 from the Library of Congress, which is also one of the many institutions in the Flickr Commons public image archive. Wikipedia has its own Wikimedia Commons, to which anybody can upload images and videos. Project Gutenberg continues to add new public-domain books to its collection every day, and New York’s Metropolitan Museum of Art has posted thousands of images online with metadata as part of its Open Access for Scholarly Collections initiative.\n\nMy personal favorite however is TimesMachine, a site available to all New York Times\nsubscribers that lets readers virtually flip through any historical issue of The New York Times all the way up through 2002. The site delivers the reader directly to the past, making you feel like a cross between a tourist and an archaeologist. You might start by visiting a historic event-say, coverage of the Titanic sinking - but the real fun is wandering off the beaten path and exploring all the other news of the day. On the same day the Titanic sank, there was also coverage of a gun battle in Greenwich Village and a passenger lost in a runaway balloon. On any day, such vignettes sometimes become rabbit holes to the past.\n\nThis is the story of how I ended up captivated by a chance encounter with a 135-year-old newspaper advertisement-and how the random face staring back at me from the archives would reveal the surprising origins of ASCII art, a graphic design technique that’s usually associated with 20th-century computer art.\n\nWay back in 2001, the New York Times hired ProQuest to digitize the vast majority of its archives-dating from the paper’s founding in 1851 to 1980, when the Times started keeping electronic copies of article text. The Times\nhad already published a complete index of all its articles since 1913, but it wanted the full text of its archives to be digitized and searchable. Much like book digitization, the first step of this process was scanning each page from the source material. However, unlike books, newspapers are not single columns of text. This complicates thing because each page has to be individually analyzed and partitioned into zones of related text. Those zones that were identified as Times articles were then linked to metadata from the existing Times Index-and scanned into electronic text, making the archives largely searchable. But it wasn’t until 2008 that it was possible to look at entire issues from the archives-that is, page-by-page copies of the paper as it appeared at the time of publication-when Derek Gottfrid, a developer at the Times, figured out how to cheaply stitch together zones back into pages for the first version of TimesMachine. In 2014, the Times’s R&amp;D unit revamped TimesMachine with a new viewer that worked like Google Maps - a functionality that made load times fast and zooming-in intuitive. In 2016, they extended its coverage past 1980.\n\nEver since the beginning of the project, I was entranced by all the advertisements, each era embodying its own style and charms. I was a software developer at the Times working in the cubicle next to Derek when he built TimesMachine, and I felt the ads deserved a viewer all of their own. Unfortunately, this wasn’t so easy. Since ProQuest was only interested in articles, they ignored everything else that had appeared in the paper, and that everything could be anything - advertisements, photographs, weather charts, section titles.\n\nSo the only way to figure out what ProQuest left behind is to look back at each and every page of the paper. This was the idea behind Madison, an experimental project from the Times’ R&amp;D Lab that identifies ads through crowdsourcing a series of simple and complicated tasks. (The most dedicated users can help transcribe the text of ads or identify the companies and business category for the ads; more casual contributors can just click through one unknown zone at a time, marking which contain ads and which do not.) It’s a project that will likely never end; there are millions of these zones, and once you click past, the odds are almost certain you will never see the same thing again. It’s fun to explore this way, clicking through the past, one ad at a time.\n\nBefore Madison existed, I built an ad viewer that worked in a similar fashion, randomly loading one of these zones on each request. Because I wanted to tweet my most interesting finds and didn’t want to fret about copyright, I limited my searching to the public-domain era before 1923. It quickly became my favorite way of killing time when I had a few minutes before a meeting, or I was on hold for a phone call, or I had some code to compile. I found so many interesting ads this way, one click at a time.\n\nThere were four lines in the classifieds from 1855 that exclaimed “Why in Thunder Don’t You Use My Onguent!” and promised to force luxuriant beard growth on any face. Then there was the 1921 Lord &amp; Taylor ad declaring “Easter Modes Have Potent Charm.” And an 1861 ad announcing the April issue of a magazine called the Atlantic Monthly. A lavish 1922 Bonwit Teller ad proclaiming “En Route Costumes for Feminine Travelers”. A distinctively-drawn appeal for fancy cigarettes from 1911. A July 1889 advertisement for an opera at a theater air-conditioned with blocks of ice. An 1865 teething syrup for infants that secretly contained morphine. And so on.\n\nAnd many times, the tool I built would return nothing useful at all-maybe the fragment of a page, or the title of a section, or some random block of text that should have been attached to a story but was lost in the process. More frustratingly still, many images were often marred or completely illegible. ProQuest’s process began with microfilm of newspapers that had already been decaying for decades, and they in turn made high-contrast black-and-white scans that are fine for scanning text, but that rendered many photographs murky. “Generation loss” is the technical term for this chain of imperfections, the ways in which each step of digital processing adds its own distortions. I knew it as the ironic consequences of the same processes that made it possible for me to view these advertisements in the first place. I could always hope for better luck when I loaded the next image.\n\nWhich is how one day I stumbled across the Treasurer, and found myself confronted with a mystery. It was the full-faced portrait of a man with a sleepy smile, wide nose, prominent lapels and a jaunty bow tie. White-space details emerged from a background made entirely of the repeated letter B. Above his head is only the simple caption “The Treasurer” and below is a generic listing for the Brooklyn Furniture Company. I couldn’t believe what I was seeing.\n\n\n    \n        \n            \n            The Treasurer\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nThe face resembles modern ASCII art, but it was published at a time - March 20, 1881 - that seemed impossibly early. I checked to see if there were other ads like it - a tedious process that required me to virtually flip through old issues one page at a time - and surmised it only ran once in the spring of 1881. Because ASCII art involves using small textual elements-letters, apostrophes, dashes, and so forth-to create a larger design, it’s impossible to search for such ads using keywords like “Coca-Cola.” To the computer, the ad just looks like a meaningless sequence of repeated characters. No other ads by Brooklyn Furniture Company appeared in the Times in the weeks before or after the ad I had found. Nor could I find similar text-based art from any other advertisers around that time. I assumed it was just a strange unicorn from the archives, a weird invention from a bored printer who just accidentally had invented ASCII art. For a while, I forgot about it.\n\nUntil I found another.\n\nOn February 27, 1881, the Brooklyn Furniture Company ran an ad proclaiming “the President of the Brooklyn Furniture Company has decided to make sweeping reductions in prices,” and it featured the side profile of a genial and balding man rendered simply in text using the letters B, F, and C. Now there were two mysterious faces. I decided it was time to find out more about what exactly the Brooklyn Furniture Company was.\n\n\n    \n        \n            \n            The President of the Brooklyn Furniture Company\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nLocated in three storefronts on Fulton Street, the firm was founded in the 1870s as the Bridgeport Furniture Company, but soon changed its name to reflect the rising fortunes of its borough. In ads, it promised “liberal credit” and layaway for those who couldn’t pay full price. And it was a prolific advertiser, apparently locked in a fierce struggle for customers against similar furniture retailers in New York. In an 1899 profile, the president of the company told\nthe advertising publication Printer’s Ink that he had spent up to $80,000 - the equivalent of $2.5 million today - entirely on newspaper advertising in the previous years. Another Printer’s Ink article, in 1901, reported that the company spent more annually on advertising than all 23 of London’s top furniture stores combined, and noted that a competitor thought nothing of spending more than $2,000 - roughly $57,000 in today’s dollars - on just one single day of advertisements in all the Sunday newspapers. For context, a full-page ad in the Sunday Times\ncan run you more than $100,000 today, according to a 2014 story in the Times. But the media landscape in the 1890s was not like today. There were 58 daily newspapers in New York City alone, and although we think of it as a giant today, the Times\nitself was firmly in the middle of the pack. I had seemingly wandered into the early skirmishes of a wide-ranging advertising war. Was it possible there were even more ads like The Treasurer out there? I needed to look at other newspapers.\n\nAnd so I joined newspapers.com, a commercial archive of newspapers that has digitized text from stories and advertisements. I quickly found several other instances of the President ad, with its first run in The Sun\non the early day of October 13, 1878. And I soon found many other instances of text art too. The front page of the Brooklyn Daily Eagle from June 14, 1877 includes several companies that spelled out their names in large capital letters formed by regular-sized letters-a format I call “ASCII Caps” - while the Brooklyn Furniture Company waggishly chose to run “MY WIFE” at the top of its ad, presumably as a way to capture readers’ attention.\n\n\n    \n        \n            \n            A Brooklyn Furniture Company ad published in 1879 in the New York Sun that has large letters saying My Big Tom Cat\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nIt seemed that many of the newspapers of the era carried such illustrations, with large titles and sometimes simple shapes like hearts and crosses all composed out of type. I know it as ASCII art, but it appeared roughly a hundred years before the personal computer even existed. Of course, before there were computers, there were typewriters, and the first recorded instances of typewriter art date back to at least 1893, but I’d never seen a record of any other ASCII-type art as early as the 1870s. I felt like an archaeologist who picks up an ancient clay urn and finds a modern emoji on it.\n\nThe Brooklyn Furniture Company’s designs would look right at home on a Geocities page or designed within Broderbund’s The Print Shop software, because they all stem from the same need: to be more expressive than technology otherwise allows. In the early days of computers, those first graphics were text inside terminals or printed by\ndaisywheel printers. However, unlike other ASCII art, the designs in these newspapers were definitely not created on typewriters-but painstakingly composed one letter at a time with blocks of type by professional typesetters. Nor are they actually art per se, but stylistic tactics employed to exploit scarcity as an advantage.\n\nFor most of the 19th century, newspapers were slim things. Every page had to be typeset by hand, meaning that the largest daily newspapers stretched to only 8 or 12 pages-and many were even shorter. Advertisers soon figured out how to exploit this scarcity of space by buying more ads than they needed, perhaps to deny their competitors any room. But once they had all that real estate, what do they do with it? The first and simplest approach was just repeating the same 3-line advertisement over and over again. Next, advertisers soon learned to add large amounts of blank space to make their ads pop more on the page, but that still didn’t make them any more visible from a distance. Bigger text was the next logical step, but it’s not clear whether that was technologically possible-or, if it was, economically viable. An ingenious solution emerged: What if, instead of giant letters, you could build large letters out of smaller blocks of text? I haven’t yet figured out the exact point when and where this practice started, but I did learn that it predated even the 1870s. I found an 1860 ad for hoop skirts in the shape of a skirt. And in 1862, Smith &amp; Brothers brewery in New York placed ads with ASCII text in several papers nationwide.\n\nIn many newspapers, these early examples of text art vanished not long after they arrived. Only months after the 1878 ad of “the president” in the Sun, such designs had seemingly disappeared from that newspaper, and apart from the two advertisements I had found, the style apparently never caught on in the _Times. But why not? To answer that, I looked more at the Eagle\nwhere I found the earliest ads - and where they survived for several decades longer than everywhere else. They are there in 1881, when one bold advertiser filled an entire page with ASCII text. There are there in 1888, when the Eagle advertised its election night almanac in the familiar large letters. They are there all the way up to July 3, 1892, a day the same Brooklyn Furniture Company again ran a half-page ad with their address in large ASCII letters. And, then, on July 5th, they were completely gone, replaced by modern layouts and fancy typography. Those upgrades likely explain, at least in part, what happened. ASCII art flourishes most when technology is limited; you don’t need Print Shop anymore when you can do digital layout on your computer and have an inkjet printer.\n\n\n\nThe late 19th century was an era of rapid technological innovation for newspapers, as new technologies like hot metal typesetting made it easier and faster to compose each page. This in turn allowed newspapers to expand in size - reducing the advantage of scarcity for advertisers, but also offering more options for them. This likely happened at various points in the 1880s for various newspapers, but I was able to trace the Eagle’s transition to an exact date. Along with the half-page Brooklyn Furniture Company ad that appeared on July 3, the Eagle\nran its own ASCII ad that day to announce that their new offices would open in two days. Then, on July 5, the Eagle ran a short item proclaiming its new building as having the “finest composing room in the country.” In other words, the Eagle had finally upgraded its old technology, and with that, the first era of ASCII ads was suddenly over.\n\nMore than a century later, I’m still left with many questions. For starters, why was the Brooklyn Furniture Company seemingly the only advertiser to make portraits this way? Did the first ASCII advertisers have any sense of what they had done-or were they, in fact, drawing inspiration from some other source, perhaps hiding somewhere in the dusty annals of publishing history? Online archives made this whole search possible, but I would love to know so much more about this era from the perspective of the printers and the advertisers and the readers. But all those involved are long dead-even the Brooklyn Furniture Company itself was absorbed by a competitor in 1929. So I can only guess at motivations from what small scraps of the past I have observed. The observations themselves have been somewhat arbitrary, based on wanderings through the archives and lucky happenstance. Ultimately, I am more of a tourist than a time-traveler. After all, no digital collection can fully reveal what the past was really like. There will always be mysteries left unexplained.\n\nImage sources:\n\n  The Treasurer: Flickr\n  The President: Flickr\n  Big Tom Cat: Wikimedia\n  Giant Page of Text Art: newspapers.com"
        },
        {
          "id": "published-instagram-sky",
          "title": "Why I Like to Instagram the Sky",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/instagram-sky",
          "content": "For the past few years, I have made a regular habit of photographing and posting to my Instagram account a type of picture I call Sky Gradients. Each of these is an abstract picture of a square blue swath of sky that I take with my phone whenever the sky is clear and I am seized with the whimsy to do it-a confluence of circumstance that happens more frequently than you might expect. I’ve taken these photos at various times of day and seasons of the year. There are some variations, but most are the same general type of picture: the normal shaded blue of a cloudless sky.\n\nLike many serious projects of mine, it all started as a joke: the dumb idea that a “summer gradient” would be a better name for a farmer’s tan. During a week at the beach later that summer, I photographed my own arm to illustrate this joke, before realizing that nobody-not even me-wants to see a photo of my hairy arm. Instead I tilted the camera up and took a photo of the clear blue sky over Hyannis to be my “summer gradient” instead. And then I promptly forgot about it, until about a year later when - on another vacation - I formalized the rules for this little exercise:\n\n\n\n\n  The photo should just be of the plain sky and nothing else; no trees, no buildings, no clouds, no airplanes (although sometimes I have noticed a stray bat or plane when it is too late).\n  The photo should not use any filters or have its color adjusted in any way.\n  The photo should always be posted with the season (ie, Winter Gradient), the date, and the location where the photo was taken (but exact geocoding should be avoided, because this is not a photo of a specific place but the sky above it, and I take a lot of these near my house).\n\n\nFrom this moment, a silly joke became a serious habit. Thus far, I have taken over 70 of these photos. I freely admit to being influenced by Dogme 95, the list of artistic restrictions for film directors that emphasized that no special lighting, filters, or special effects should be used. Dogme was a technological movement as well as a stylistic one, embracing the lower fidelity and artifacts of handheld digital cameras. I still remember being thrilled by the graininess of a candlelit scene in Thomas Vinterberg’s Dogme film Festen. Of course, digital film quality has advanced enormously since then, making many of Dogme’s restrictions moot. But I now carry the equivalent of those early cameras within my iPhone, whose CCD does well enough photographing a high-contrast scene, but often stumbles when pointed at a smooth patch of sky. The flaws make these pictures real in the way a perfect gradient is not and I adore the light streaks and grainy dithering that mar some of the gradients. This exercise is innately digital-it would be wasteful to take these photos with photographic film-but they remind me of failed exposures and accidental shutter snaps I used to find when I had rolls of film developed after taking a vacation in the era before digital cameras.\n\n\n\nThere are other artistic antecedents for this exercise. I am hardly the first person to look up and want to capture what I see. You can’t have a landscape without a sky above it after all. Still, skies have mostly lurked nonchalantly in the backgrounds of Western art, except when rendered by a master like Joseph Turner or Maxfield Parrish. Photography also has its own history of cloudscapes-a portmanteau of landscape and cloud-exemplified recently by artists like Rüdiger Nehmzow and Tzeli Hadjidimitriou as well as all your friends on Instagram posting photos of that awesome sunset over Manhattan right now. These are depictions of something specific, taken because there is something remarkable to be seen in them. I suppose I should also cite the work of those abstract artists like Rothko or-more recently - Pieter Vermeersch who aren’t depicting the sky but use broad swaths of color to evoke awe and emotion. There is also the work of digital artist Cory Arcangel who printed large gradients built in Photoshop and found his cloudscape in the background of Super Mario Brothers.\n\nI am particularly impressed by two pieces that combine both traditions in depicting in the sky. In the 1920s, photographic pioneer Alfred Stieglitz took over 220 photographs of the sky for his Equivalents series in which he often purposefully omitted the horizon, hung them in the wrong orientation, and did a few other tricks to transform clouds and sky into brooding abstract compositions. More recently, sculptor James Turrell has been building “sky spaces,” each a small room with benches around the perimeter and a large oculus in the center ceiling for observing a swath of sky since the 1970s. These seem like obvious inspirations, except I will admit I was completely ignorant of both of them. But I don’t really consider myself much of an artist or this project as art. Instead, I do this as a means of meditation.\n\n\n\nI don’t just take these photos on vacations. Indeed, most of them were taken outside my home or office or some other place in between the two; almost all of them are of the skies over Washington, D.C. At least a dozen of these photos were taken on those spare minutes where I am waiting for the school bus to arrive and I don’t feel like checking Slack or Twitter. This might seem like yet another example from our modern era of how to apply advanced technology to boredom, and that is sometimes enough to spark an impulse. But there is often a motivation greater than boredom that impels me to post these photographs. These are photographs-not a game of Dots-and like any photograph, I take these not to kill time but to memorialize it, albeit in a pointedly abstract way.\n\nIn the end, these are essentially pictures of nothing, not anchored to a specific place or time. Every picture omits a wider world outside its frame - Instagram Husband is just the most recent iteration of the same tired joke about this-but these photos of bare sky elide everything. There is nothing of the world below reflected in the gradients above. An exciting whale-watching trip yielded only a nondescript blue frame while the dramatic gradient of a dusk sky was taken from a suburban IKEA parking lot - context I only know because these are recent enough that I remember taking them. But for most of these photos I no longer have a memory of what I was doing and why I particularly decided to photograph the sky at that particular moment. Indeed, until I wrote this piece, I had no idea how many of these gradients I had taken or assembled a collection of them all. The whole point was to post the pictures - their longevity afterwards is irrelevant - and this is the key to understanding why I find these photos so affecting.\n\nLike almost anybody who has been online for a while, I am enamored with how simple it is to share things with the whole world and also frightened by how easily I can destroy my privacy in the process. And so I project a goofy but guarded version of myself online. My public Instagram account features many silly pictures and so many photos of my sleeping dog, but doesn’t include what would be far more important pictures of my wife and children. Life isn’t always a carnival of silly photos however, and there are sometimes moments of sadness - and moments of private happiness too - that I want to commemorate briefly in some small way. Much like writing it down in a journal or throwing a stone into the water, to mark the moment is also to let it pass. I often don’t have a pen, and I’m usually not near rocks or ponds. But I always have my phone. And sometimes even when I am sad, it’s still a beautiful sunny day. I reach into my pocket for my phone and point it towards the sky. I exhale and take a photo.\n\nI know that at some point I will likely forget the details for the few gradients I remember. This doesn’t bother me, since the rules practically guarantee I will forget the details. It’s possible I will eventually neglect this entire project, or a future phone’s camera will make future gradients perfectly boring. I sometimes imagine I might one day forget entirely what motivated me to keep all these weird pictures of the sky instead of deleting them like any other digital mistakes. But I hope there will be cloudless days then too, and what I will remember is the joy of standing under the sun."
        },
        {
          "id": "personal-how-nyt-reported-elections",
          "title": "How the _Times_ Did Election Results 100 Years Ago",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "personal",
          "tags": "",
          "url": "/personal/how-nyt-reported-elections",
          "content": "One of the fun aspects of TimesMachine is how easy it is to go back and see how the Times would report things in a different era. Since my speciality was election results, I looked for how the Times reported elections in the past. Little did I know what a journey it would be!\n\nA Brief History of One Times Square\nIn 1896, Adolph Ochs bought the Times and presided over its expansion. In 1905, he moved the newspaper to a new headquarters located at One Times Square, where it remained for only eight years before moving operations to another building down the street (but still keeping a presence at this building). You may know this building best as the place where they drop the ball on New Year’s Eve, and for that reason it’s also thoroughly encased in advertising today. It’s always had a prime location at the southern edge of Times Square. This made it ideal for the Times to use for its own promotional events, and elections became one of the biggest parties around.\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            1904 Election Ad Front Page\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            1904 Election Lights Diagram\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            1904 Light Explainer\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\nIt’s dwarfed by its neighbors now, but as the middle picture makes clear, the Times tower was the tallest building in its immediate region and this led to a truly wild mechanism for reporting election results: beaming a spotlight in a different direction depending on who won the race. As the advertisemnt puts it\n\n\n  From the top of the tower of its new bullding in Times Square THE TIMEs to-night will announce by searchlight signals the result of the Presidential election, the contest in New York State for Governor, and the political complexion of the new House of Representatives. On account of the great height of the Times Building, the signal station being over 412 feet above tidewater, the signals will be readily distinguishable anywhere within a radius of thirty miles from Times Square\nThe searchlight will throw a shaft of white light, and the results will be indicated as follows by the direction:\n\n  \n    Steady light to the west, Roosevelt elected\n    Steady light to the east, Parker elected\n    Steady light to the north, Higgins elected\n    Steady light to the south, Herrick elected\n    Light to the west, up and down, Republican Congress elected.\n    Light the east, up and down, Democratic Congress elected.\n  \n\n  With the code before him, the voter who wants to find out how things are going and who doesn’t want to stay out all night at a telegraph office, either in the city or out of town can learn the results from an advantageous window in his flat or his house.\n\n\nRemember, this was the time before television and even radio. The only option for live results was to go hang out at a telegraph office, or see what was posted by a newspaper on signs outside. And it was apparently a hit, because the Times repeated it every election (both presidential and several city elections) for decades with even more elaborate codes. What finally stopped it was the advent of World War 2 and restrictions on lighting.\n\n\n    \n        \n            \n            1908 Presidential Election\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            1909 Mayoral Election\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            1912 Presidential Election\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            1920 Presidential Election\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            1924 Presidential Election\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\nIn 1928, they added another new feature, a moving news reel made of 14,800 bulbs that moved around the building. We are familiar with these as LED displays, but this was the earliest form of the technology and a veritable marvel.\n\nThe Election Night Parties\nAll these spectacles attracted an ever increasing crowd of people there for the party. A 1914 article estimated the crowd in attendance at 100,000. In 1944, there were 1150 police in Times Square to keep the peace in expectation of record crowds which was estimated at 250,000 total. But in 1952, it had all fallen apart, with less than 25,000 maximum in attendance. What killed it? The rise of radio and then television election coverage is what ultimately put it in. As the report from election night in 1952 put it:\n\n\n  A last burst of shouting echoed in the square at 12:40 A.M. today when the line of lights on the east side of the new election board suddenly streaked to the top, to show that General Eisenhower had more than 266 electoral votes, and had won in a sweep.\n\n  At the same moment the line of lights on Governor Stevenson’s side of the board, which had not moved for hours, symbolically went out. The moving letters in the running sign that girdles the Tower broke out with “Eisenhower Elected,” with the news bracketed between golden stars.\n\n  Then the searchlight high in the Tower, which had been brooming the starless sky to the north all night, to show Eisenhower in the lead, held steady to show that he had won. The crowd cheered again, and slowly came apart to drift toward the subways.\n\n  A tradition was dead, with only a few thousand pallbearers to see it peacefully interred."
        },
        {
          "id": "published-food-infographics",
          "title": "Why Is It So Hard to Make Great Food Infographics?",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/food-infographics",
          "content": "We are living in the era of the infographic. You know what I mean: the bar charts illustrating recent polls and the maps of recent shootings - or, often, lighter topics - pushed across our Twitter and Facebook timelines by news organizations looking for that extra viral boost to their stories. In the world of hard news, infographics are not new, but it’s only recently that they’ve made the leap to popular culture. Until not too long ago, they were mostly used for investigative stories, and generally embedded in the text of print articles. But now, infographics are everywhere - including the food world. Two new books try to capitalize on that: Taste: The Infographic Book of Food (Aurum Press, October 2015), and A Visual Guide to Drink (Avery, November 2015).\n\nWhy do we find infographics so compelling? To answer that, let’s look to the past. Ask an expert on effective data visualizations to name the best infographics of all time, and odds are good she’ll cite two specific examples. In 1854, physician John Snow created a map of cholera outbreaks in a London neighborhood that revealed a single pump was the source; at the time, many believed that cholera was spread by noxious “miasmas” in the air, but Snow’s map proved that cholera was spread by contaminated water. Even when the truth is known, a good infographic can reveal some new ways of looking at it. Everybody already knew what a tremendously bad idea it was for Napoleon to invade Russia in 1812, but Charles Joseph Minard’s astonishing 1869 visualization of the campaign revealed just how devastating it was by overlaying troop strength and temperatures on top of a map of Napoleon’s advance and retreat.\n\nWhat do these have in common with the best infographics of today? They simplify an abundance of information to reveal the essential narrative of what they are trying to illuminate, while paring away anything else that isn’t necessary. For instance, a choropleth - that’s a fancy term for a colored map - of election results will show political boundaries and important cities, but leave out highways and national parks. But infographics are not just data visualizations. For instance, my favorite example of these principles is Harry Beck’s London Underground Map, first published in 1933, which famously disregards the precise geography of the Underground’s stations in favor of a slightly abstracted grid layout. By discarding the typical surface information people expect from a city map - major streets, notable destinations, neighborhood names - it more clearly shows the important relationships between the Tube’s subway lines and the transfers between them. It gives its users better, more readable information than they could get without it.\n\nIt’s not a surprise that the rest of the world has also started considering the appeal of the infographic. The pie charts and bar graphs once found only nested within newspapers and magazine sidebars are now another tool of any creative director; and the graphic itself is often the attraction - not just the information it conveys. That graphical imperative is at the heart of both Taste and A Visual Guide to Drink, both books of general food knowledge, both of which apply the techniques of infographics to their respective subjects. Taste, a British release, is a collaboration between Laura Rowe - editor of a UK food magazine called Crumbs - and illustrator Vicki Turner; it describes itself as a “beginner’s guide to being a foodie,” and is a combination of a food encyclopedia, miscellany, and cookbook. A Visual Guide to Drink is a production of Ben Gibson and Patrick Mulligan, of the Brooklyn design collective Pop Chart Lab, comprising a collection of graphics that distill (ha) all the information you might need about distillation and fermentation (and inebriation), the sort of book that would look at home right next to a wine guide or cocktail recipes. Both promise to bring the power of infographics to food, but only one truly delivers.&lt;/p&gt;\n\nFrom the outset, Taste leads with a basic view on infographics, with Rowe describing them, in the introduction, as simply “information presented in a graphic way.” With such an expansive definition, I guess she considers everything in the book - 223 pages of something she describes as an “amuse bouche” to “whet your appetite to learn more and eat more” - to be an infographic. On the surface, it looks like an appealing approach to a broad subject. The graphic design of these pages is certainly excellent, with colors and design evoking a midcentury feel that’s both nostalgic and contemporary. It’s filled with a variety of stylized diagrams - a matrix of pizza toppings! vinaigrettes as pie charts! a sausage solar system! - that are meant to add whimsy to what otherwise might be a dull compendium of facts. Rowe explains that she imagines this as a great book for insomniacs needing something to read in the middle of the night, or something you could pull from your coffee table to quiz your foodie friends. It certainly is amusing, but how well does it inform?\n\nEven allowing Rowe’s simplified definition of an infographic, if we ask plainly how much the graphics on these pages help to convey their information, things start to fall apart. There’s a page of facts about honey where each block of text is set within a stylized lattice of a honeycomb. Take away the images, and the information on the page remains exactly the same. Similarly, a simple list of what every country calls its blood sausages is rendered as a full page spread of stylized sausage shapes with names and little flags on them.\n\nThese pages are bold and colorful - they would be lovely as posters or spreads within a food magazine - but they’re more beautiful than informative. It can get repetitive after a while, as any meal of only amuse bouches would be. By the time I got to a list of 7 outlandishly weird ice cream flavors (bacon and eggs! avocado! breast milk!) illustrated as a scattered collection of stylized scoops with text on their labels, I found myself longing instead for a simple bulleted list without any pictures - the unadorned text of Schott’s Food and Drink Miscellany, for example, which lets the absurdity and charisma of culinary trivia stand on its own.\n\nWhen I browsed specifically for actual infographics, I was struck by Taste’s particular fixation with pie charts, the book’s most frequently used chart to illustrate quantitative data. Everybody knows what a pie chart is; we’re taught about them in grade school and can make them in Excel. The thing about pie charts, though, is that they’re generally terrible. They’re good for showing when one element takes up a disproportionate piece of the pie, but that is largely all they can do - try telling the difference at a glance between a tenth and an eighth of a whole.\n\nWorse still, in Taste, they’re often not even pies. A chart that compares the proportions in several types of vinaigrette presents each recipe as a droplet, partitioned into subsections for each ingredient. Similarly, the shapes of handheld pasties are used as outlines for segmented lists of these ingredients. (Okay, pasties are pies, technically, but for the purposes of this chart, they’re the wrong kind.) Pie charts are circular for a reason: to compare the areas of two wedges in the pie, your eye only needs to compare the angles of each slice. But how do you estimate what percentage of the whole is made up by the bottom of a teardrop? I’m sure Rowe is just trying to conjure a rough idea of a data visualization without it meaning anything precise. But is there a point to an infographic that frustrates you when you try to use it as intended?\n\nFar more vexing was another large pie chart that showed how much instant ramen the top ten countries in the world consume: I assume the decision to make it a pie chart was so the resulting figure would look more like a bowl of noodles. Unfortunately, this chart is not only hard to read, it’s actively misleading. The chart neglects to include a wedge labeled “other” to represent the rest of the world’s ramen consumption - and based on the original data on which the chart is based, it looks like that missing wedge would be the third-largest slice in the pie. On the other hand, it does look like a bowl of ramen!\n\nThis all might seem pedantic, but there are conventions to infographic design, specifically to prevent these types of basic mistakes. Essentially, these conventions are visual recipes that tell a designer how to pick the right format for what they want to represent, and the rules on how best to do that. To many graphic designers, however, the problem with conventions is that they are, well, conventional, and even though it might be clearer to show instant ramen consumption using a bar chart (and indeed, that’s how the data was presented in the original source), the appeal of doing a visual play on the top-down of a bowl might be too hard to resist.\n\nThe work should be pretty, but it’s far more important that it be correct. Many organizations that regularly rely on infographics have sussed out comprehensive style guides for their visualizations - for instance, the New York Times graphics team has specific rules, ones that dictate when charts should be certain styles, or that require counts overlaid on top of map locations to be indicated either with individual blocks or proportionally-scaled circles, or that maps and charts should use certain color palettes - but independent designers still tend to give in to a desire to cultivate their own distinctive styles, usually by chucking out the conventions that dictate how charts should look. And so, they tweak pie charts to look like soup bowls, or replace boring bar charts with abstract compositions that are, in the end, more beautiful than legible.\n\nSometimes it seems like this may be intentional: Rowe specifically cites as inspiration the work of David McCandless, an infographic maker known (and criticized for an extremely abstract style devoid of axes or other guides to interpret his charts. But ultimately, many designers are just playing from an entirely different rulebook. Do a Google image search on the word, and you’d be hard-pressed to find the works of Snow or Minard in the results. Instead, it’s page after page of text and figures crammed into stylized frames, as if Microsoft Excel and a Dr. Bronner’s soap label were thrown into a blender, where research comes down to little more than Googling for interesting facts to fill empty spaces. Taste too often relies on this style.\n\nBut don’t let me tell you the book is all bad. I like it best when it plays to the strengths of its graphic designer: A lushly illustrated spread of what shellfish actually look like is lovely. A hierarchy of egg recipes and a chart of charcuterie complete with hanging meat shapes are both beautiful and nicely informative. I appreciated, too, the diagrams of how soy sauce and tofu are made. But too often I found that Taste’s stated ambition - to be the infographic book on food - far exceeds its grasp, and its constant need to be whimsical ends up being a distraction rather than an advantage.\n\nWhile Taste takes a broad and necessarily shallow look across the entire world of food, Pop Chart Lab’s A Visual Guide to Drink goes deep within one specific area of interest. This narrower focus gives the book an opportunity to drill down: for instance, a two page spread on the many styles of beer is annotated with page numbers of other charts in the book that express more detail on stouts or India pale ales.\n\nWith the exception of a few data visualizations that compare countries by wine consumption or the growth of breweries in the US, there isn’t much focus on quantitative data; instead, this book is more concerned with the how rather than the how much, and where it shines most brightly is in explanatory graphics breaking down how and where the world’s booze is produced, and the etiquette and customs around them. I appreciated a clear explanation of beer terms like “gravity” and “international bittering units,” and found a useful reference in the chart summarizing the guidelines for different types of American whiskeys (my favorite, rye whiskey, must use a mash of at least 51 percent rye). I was particularly enamored with the diagrams illustrating the production process for each type of alcohol; clear and concise, they reminded me of the excellent diagrams within the excellent book of infographics about NYC infrastructure, The Works: Anatomy of a City\n\nThis book isn’t flawless. Pop Chart Lab is a design house, one that does a bulk of its business selling posters and prints of things like a taxonomy of rap names, illustrations of all the birds of North America, and a massive, 5-foot-long chart of the many varieties of beer. But what works on a wall doesn’t always work on a page.\n\nIn some cases, this manifests as overly complicated network diagrams that lack any clear spatial orientation, resembling mazes or circuit boards, as is the case with their chart showing the genealogies of French grapes. More perversely, cocktail recipes are disassembled into complex circuits of ingredients that are inscrutable when sober and the end of the party when not. I know they can do better: a similar genealogy diagram for hops, 66 pages earlier, is nicely organized as a vertical tree, one whose orientation provides a clear place for the eye to start, and a clear instruction to read downwards.\n\nMore frequently though, these poster-style pieces veer towards abstract compositions with limited or no explanatory text - aesthetically appealing but, like Taste, not as useful as I wanted. Let’s go back to that map of the London Underground: imagine it, and then imagine it without any of the station names. You can still discern a pattern and order to the arrangement, but it’s impossible to use for real-life navigation.\n\nA Visual Guide to Drink has several pages dedicated to presenting the compositions of wines from the AOC regions within France’s major wine areas, the kind of material that in a conventional wine reference, you would find spread over many paragraphs of text. Here, it’s condensed into an attractive grid of wine bottles, each segmented into colored sections to represent the proportions of various grape types in each type of wine. This presentation makes it simple to see that some wines are blends of many grapes, while others are more homogenous, and it’s designed clearly for that broad informational purpose. But it’s difficult to use the chart to figure out the answers to any more complicated questions - like how much of a Pauillac is merlot, and how it compares compositionally to, say, an Haut-Médoc. This chart frustrated me when I wanted to learn more, but the explanatory labels and text I wanted have no place in its abstract design.\n\nSimilarly, Pop Chart Lab’s maps of wine regions and brewery locations are far too spare in their references, showing only the basic contours of water and land and never any place names. So, if I wanted to know where those clusters of breweries around Portland are actually located, I have to find a real map and compare the two images side-by-side. This abstraction might be appealing for a design-minded beer aficionado who wants a bold, stylized map on her wall, but it’s inaccessible for outsiders or beginners. By the time I got to the map of Chile’s wine country, I was completely lost. (At least show me Santiago!) Still, these are relatively minor nits to pick on a book made up of a much more solid collection of hits.\n\nWhy does Drink succeed while Taste struggles? Infographics are hard work, and harder still when you go it alone. Of course, individuals can create great single pieces - think of Snow or Minard for instance, and their maps of cholera and Napoleonic battles - but to pull off sustained and consistent work to fill a whole book of infographics, you need a seasoned team. That Times style guide for graphics I mentioned? It’s no book, but the collective practice of large multidisciplinary group that constantly debates among itself and refines its work until it is right. A good-sized group of six experienced designers and researchers worked on Drink, but this seems to be the first foray into the field of infographics for both the author and the illustrator of Taste. With neither personal experience nor seasoned guides on their side - instead of trying to emulate McCandless, they would’ve been better served by the instructional books of Edward Tufte and Alberto Cairo - it’s not surprising Rowe and Turner stumble so often.\n\nHow might they have succeeded? When it comes to data and information, it’s advantageous to drill deep on one or two topics - to use your charts to illustrate many facets of one idea - rather than trying to be an expansive diner-style menu of visualizations on all sorts of subjects. This approach worked well for A Visual Guide to Drink, giving its authors a clear and consistent outline for the book they had to write, instead of blank pages that needed to be filled. Another approach is to chart things that are, in fact, measurable - rather than interesting, if random, facts about avocados, imagine charts that focus instead on what countries that avocado in my hand likely comes from, how its surge in popularity has changed farming in California and Mexico, how a recall from a single contaminated farm can ripple across many states and products. Our food system is large and complex; where are the infographics to make it understandable?\n\nOf course, after all this, it’s worth pointing out that infographics aren’t entirely new to the food world.  We interact with visually presented information on a daily basis - either because they are legally required (we all know what the food pyramid is and how to perform a Heimlich Maneuver, or because they’re extremely practical; that chart at the butcher showing the cuts of meat on a steer is an infographic too. What these tremendously successful examples prove is that when done right, an infographic displays perfect harmony between info and graphic: a good one is always going to be significantly more clear and accessible than an equivalent block of text, or an equivalent visual-only representation.\n\nBut none of this is to say that an infographic should stand alone. A Visual Guide to Drink is better than some wine references in some ways, but you still need other books if you want to go deep on the nuances of a particular grape, or the histories of various AOCs. Taste may tell you which countries are most enamored of instant ramen, but it won’t give you a deep look at the cultural and geographic reasons why. Even at their very, very best, infographics provide their user with just a starting point for understanding the world. To that end, these books, in differing degrees, succeed. But you’ll still need the rest of your library to support them; for better or for worse, neither one stands alone on the shelf. And that’s okay - they shouldn’t have to."
        },
        {
          "id": "personal-leaving-nytimes",
          "title": "Leaving the New York Times",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "personal",
          "tags": "",
          "url": "/personal/leaving-nytimes",
          "content": "I’m not a fan of burying the lede, so let me just get straight to the point. I am leaving the New York Times to go work for 18F - a civic-hacking startup of sorts operating within the federal government. It is simply time to try something new. This means I am also leaving the online news business, a source of both inspiration and vexation for an unbelievable 9 years. It’s exciting, but obviously this moment is also tremendously bittersweet.\n\nNYT kremlinologists like to inspect every departure for signs of how it is a terrible loss for the Times and an augury of how the once-venerable institution has lost its way in our modern media landscape. I’m definitely no Amy O’Leary or Aron Pilhofer, but I’ve come to see some truth within this doomsaying. Every departure is a loss, because every one of us leaves a distinctive mark on what the Times has done and continues to do every day of its existence. Nobody here is a mere cog. And yet, nobody is irreplaceable either. Being a 164-year-old newspaper means managing change while also staying the same; it’s the Ship of Theseus, but built with people instead of mere lumber.\n\n\n    \n        \n            \n            Over the years, the ink for my company ID has bled onto the plastic of my wallet\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nAll of which is to say, I know that I had a unique impact on the Times that will never be replicated. I was lucky to be there at the founding of the Interactive Newsroom Technologies team and a part of so many projects big and small that followed. And yet the Times will continue just fine without me, because it is an entire organization of unique talents. I have joked to friends that I probably would not be hired if I applied to work in INT today, but there is some truth to that. In the first few years I worked in a newsroom, it was disappointing to see nothing but the same faces at the NICAR conferences, even if they were all my good friends. I am thrilled at the explosion of talent that has happened in our little field these past few years, and there is no collection of news nerds as indomitable as the crew at the Times. They will continue to produce the same amazing work on deadline the world just expects them to - it looks effortless, but I can assure you that it most definitely is not - and I will continue to be surprised by their talent. That won’t change. I just won’t see the early demos first.\n\nThere are so many anecdotes I could share from my time at the Times, but I want to just pick one from back in 2006. In those days, the digital operations of the company was not integrated with the newspaper; we all reported to the business side and were housed in a building blocks away from the newsroom. But I was a troublemaker, and I wanted to work in news, dammit. So, with Derek Gottfrid as my fellow malcontent, I set up a series of meetings where I got to meet news research, computer-assisted reporting (hi Aron!) and all the other nerdier elements of the newsroom in the hopes that we might one day be able to do programming in the service of journalism. This somehow led to the ultimate cold call: a seat at the table in the kickoff meeting for the Times’ coverage of the 2008 presidential election. Halfway through the meeting, a senior editor whom I won’t name but still works in online news turned to look straight at me and asked\n\n\n  Why are the web people here?\n\n\nI don’t remember what I said exactly. I was both defensive and annoyed, so it was probably a lot. But I can tell you the gist of it. We were failing as an organization. The Times reporters were covering only a fraction of the real election. There was a vast hidden world of data we were missing - more than votes and delegate totals - but things like campaign finance, ad spending, demographics and polling. The campaigns already knew this. We knew this. But we weren’t fixing it. We had a website, but we were still constrained by a print mentality - that the only form of a story was an article, and there was only so much space we could devote to data in the final product. But the web had changed all of that. We could capture this data, analyze it and let our readers browse it for deeper insight into the stories from the front page. The web people should be here because it was too late to not involve the web people from the start. I sat back down. The meeting moved on. But I never did.\n\nOf course, this all seems painfully obvious today. It was likely already obvious to many of the editors in that room without me having to say it. But it’s what I needed to convince myself that what we were doing mattered as much as any story on A1. We were reporting the news too. I wanted to be a part of that. And less than a year later, the Interactive Newsroom Technologies team was formed and suddenly I was.\n\n\n    \n        \n            \n            A picture of a laptop and bourbon from the NYT election data desk\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nNews can be a drag and news can be a drug. I wish everyone could experience what it is like to be in a newsroom on a general election night. I have enjoyed five of them in my career, and I’ve tried in vain to share on Twitter that buzz of watching history unfold from our vantage point. I remember my first time in 2008. Obama had already given his victory speech hours before, but a few of us remained in the newsroom past 3 AM sipping bourbon and watching the remaining votes trickle into the election loader. A clerk walked by and handed me a copy of that day’s newspaper, straight from the presses in College Point. My memory wants to say it was warm - like bread fresh out of the oven - even though that is absurd. But I remember looking through the newspaper, paging through to find the maps that my code had fed the data to and thinking I’m here now. Over time, the projects I have worked on will be mothballed, my friends there will move on, and the newspaper will continue making itself anew like it has all these years. No matter. Even when I am forgotten like those unknown creators of the first NYT election map in 1896, my work will survive in the archives somewhere. Nine years is a blip in the history of The New York Times, but I was a part of it. I made my mark. And for that I will be forever grateful."
        },
        {
          "id": "published-thank-you-electionbot",
          "title": "Thank You, Electionbot",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/thank-you-electionbot",
          "content": "Reporting election results is a bit like flying a blimp through windmill country: hours of tedium punctuated by minutes of terror. Generally most of your night filled with the boring hours of waiting for polls to close or watching the remaining votes trickle in after the important races have been called. But in between those spans are usually several important events for key races:\n\n  The polls close in the state (get ready to show votes)\n  First votes are reported (good time to check your results)\n  The race is called for a winner (sometimes 2 winners or a runoff)\n\n\nWhat makes things complicated is that each newsworthy race we might care about might reach these moments at different points during the night. Furthermore, each state will close its polls at different times, and some states will report votes immediately after polls close while others may take a half hour or more. On the night of the 2014 midterm election, there were 9 different poll closing times across all the states and 52 races the New York Times considered especially newsworthy. Those are a lot of balls in the air at once. Previously, the only recourse was to eyeball the loader’s console output as it scrolled past and to send messages to reporters whenever the races they cared about were called. This can mean some stressful interruptions when you are trying to track down a bug in your code. In 2014, it was time for Electionbot to shoulder part of the load.\n\n\n\nAt its core, what we called Electionbot consisted of two separate pieces of code. The first of these was a notifier that would be called by the loader after it completed every load and post messages to a Slack channel where the election team was gathered. This used Slack’s incoming webhooks API to send alerts when an important race was called or a state’s polls had closed. The code for something like this is pretty straightforward but its utility is immense:\n\nclass SlackNotifier\n  def self.notify(load_id)\n    notify_first_votes(load_id)\n    notify_calls(load_id)\n    notify_runoffs(load_id)\n    notify_uncalls(load_id)\n    notify_ap_uncalls(load_id)\n  end\n  \n  def self.notify_calls(load_id)\n    warnings = Warning.called.for_load(load_id)\n\n    if warnings.any?\n      uncontested,contested = warnings.partition {|w| w.race.uncontested? }\n      // uncontested alerts elided\n      \n      if contested.any?\n        important, unimportant = contested.partition {|w| w.nyt_race.important? }\n\n        if important.any?\n          payload = {\n            \"attachments\" =&gt; [{\n            \"fallback\" =&gt; \"CALLS: #{important.map{|w| \"#{w.nyt_race_id}: #{w.ap_candidate.name_with_party}\"}.join(\"; \")}\",\n            \"color\" =&gt; \"warning\",\n            \"pretext\" =&gt; \"RACE CALLS\",\n            \"fields\" =&gt; important.map do |w|\n              {\n                \"title\" =&gt; w.nyt_race_id,\n                \"value\" =&gt; w.ap_candidate.name_with_party,\n                \"short\" =&gt; true\n              }\n            end\n            }]\n          }\n          post_to_slack(payload)\n        end\n      end\n    end\n  end\n\nThe election loader already had a decently sophisticated mechanism for generating warnings about newsworthy changes. All that was necessary was to add these hooks to format and post warnings to Slack. In 2012, I built a system to mail me whenever delegate counts changed. Posting to the Slack worked so much better though, since we were all in the channel on election nights already, and any missed notifications would be sent out to me by email anyway.\n\nThe next step was to enable communication with the loader from our Slack channel. I built a minimalist backend written in Sinatra that replied to slash commands triggered in the election channel for some common administrative tasks. For instance, there was a command to report the upcoming poll closing times to the channel to remind us all when to time our bathroom breaks.\n\n\n\nAnother command toggled certain races as important, so that the notifier would tell us when they had their first votes or were called. Again, the code was pretty straightforward:\n\ndef exec\n  check_auth\n  check_channel_name\n\n  case params[\"text\"]\n  when /^poll[\\s_]closings/\n    report_poll_closings\n  when /^important\\s?(.*)$/\n    important_races($1)\n  when /^load/\n    load_status\n  when /^uncalled/\n    uncalled\n  else\n    render :text =&gt; help_text\n  end\nend\n\ndef important_races(arg_str)\n  arg_str = arg_str.strip\n  payload = nil\n\n  if arg_str.blank?\n    races = NytRace.upcoming.important.all\n\n    if races.any?\n      payload = {\"text\" =&gt; \"Current important races: #{races.map {|x| \"`#{x.id}`\"}.join(\",\")}\"}\n    else\n      payload = {\"text\" =&gt; \"No current races marked as important\"}\n    end\n\n    post_to_slack(@channel, payload)\n  elsif arg_str =~ /(on|off) (.+)$/\n    verb = $1\n\n    race_ids = $2.split(/,/)\n    race_ids.each do |id|\n      race = NytRace.find(id)\n\n      if verb == \"on\"\n        race.update_attribute(:important, true)\n      elsif verb == \"off\"\n        race.update_attribute(:important, false)\n      end\n    end\n\n    payload = {\"text\" =&gt; \"Setting *important* to *#{verb}* for #{race_ids.map {|x| \"`#{x}`\"}.join(\",\")}\"}\n    post_to_slack(@channel, payload)\n  end\n\n  render :text =&gt; '', :status =&gt; 200 \nend\n\nWith these two components, we theoretically could’ve replaced much more of the election loader’s admin interface with interactive commands, but I was too nervous to allow users to call races directly from Slack. All requests to and from Slack include a security token you can check to eliminate basic spoofing, but they still are going over the public internet between Slack’s servers and ours (even if within HTTPS), and I’d rather not explain man-in-the-middle attacks to an executive editor on an election night. So, we kept its capabilities simple on purpose.\n\nStill, I can’t overstate how great it was to have Electionbot with us in the Slack. It wasn’t particularly advanced as bots might go, being just a simple interface into a much more complicated realm of code. Yet I began to think of it like another coworker, always on the lookout for problems we should know about. During a late-night primary from home, I’d feel comfortable leaving my laptop downstairs to check on the sleeping children, because I knew Electionbot would tell me if anything was going wrong. And sometimes I even ran some election night commands to make a state’s results visible from my phone just because I could.\n\nThe best moments were when Electionbot transcended a mere shell script and informed us all of an uncalled race we probably wouldn’t have noticed otherwise. Even though I knew better, I found myself reflexively thanking it in the chat for the save. We form bonds with even the simplest of tools, and Electionbot was there with me on every night there were votes being tabulated somewhere in America. I know it’s just a dumb framework of Ruby code, but still I have to say it.\n\n\n\nThank you, electionbot!"
        },
        {
          "id": "published-consider-boolean",
          "title": "Consider the Boolean",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/consider-boolean",
          "content": "I generally prefer to write about big picture subjects for my Learning pieces at Source. But today, let’s start from something small that illuminates the way even simple choices affect what we can represent and the stories we can tell.\n\nLet’s talk about the most basic datatype we often build our databases from: Boolean fields. Deeply familiar to programmers, the concept of Boolean logic is often seen as esoteric by people who don’t program for a living and who aren’t set theorists. I’ve confirmed this with many of my friends. I used to find that extremely mystifying; after all, the basic nature of Boolean algebra is pretty simple. A Boolean variable can only have one of two values-true or false-and all operations on them can only result in true or false as well. For instance, a and b is true only if both a and b are true, while a or b is true if either a or b is true. This matches how we understand the words and and or to mean, so that’s easy enough, but then it gets complicated. For instance, what spoken language has an intuitive equivalent to a xor b, which is true only when a is true and b isn’t or vice versa? And every language quickly gets confusing if you attempt to describe the types of nested conditionals we use in our code where we want to execute a loop if and only if a is true and b or c is not true if a is a String and b is a…\n\nUltimately though, I think the confusing thing about Boolean logic to most people is its strict precision in a world that is anything but. If I asked you “Are you interested in this essay or not?” and you answered “Yes,” that response is genuinely annoying, even though that is technically always the correct answer according to Boolean algebra. Ultimately, what things in this world are absolutely and precisely true? Not as many as we might think. This is a journalism tutorial and not a philosophical treatise, but the point still stands. As programmers, we often use Boolean values to represent conditional elements in our databases, but sometimes the ways we use them obscure and confuse the nuances of reality.\n\nThe Prisoners of Zenda\nIt’s far easier to explain what I’m talking about through an example. So, let’s imagine we are creating a database to track the status of political prisoners confined to the prison of Zenda in the imaginary country of Ruritania. This hypothetical example got dark really fast, but reality is often darker still; welcome to the world of journalism.\n\nThere are no open-records laws in Ruritania, so all the data on detainees must be pieced together in a database by our own researchers. We need to design the schema for them to enter the relevant data about each prisoner as they discover it. So, we start by figuring out some basic fields for the prisoners table. A plausible first cut might look like this:\n\nprisoner_id varchar(255),\nname varchar(255),\nbirth_date date,\nhigh_value bool,\nheld bool,\nconvicted bool,\nreleased bool,\nnotes text\n...\n\nWe usually start the modeling process by figuring out the important information we might want to track about our subjects. In many cases, those are simple yes/no questions, meaning we can represent them with boolean type fields in our database. It’s easy to just define a bunch of Boolean fields like this in our schema, but it’s also easy to make mistakes. For instance, we have inadvertently created two columns held and released that are just two inverted ways of representing the same thing. What does it mean if both are checked? Or neither? Neither scenario makes sense in reality, but the existence of two separate fields combined with Murphy’s Law makes such logically impossible representations inevitable; all it takes is one researcher to accidentally check two columns in the admin. There is no error correction for these fields in our database.\n\nGranted, we can fix that pretty easily by removing one of those fields without reducing the quality of our data. It somehow feels a bit better to use a field that is usually false with occasional true values than the reverse, so we might change our database to only use released, with the assumption that if it is false, the person is still imprisoned. But what if it is null though? Given the lack of open records in Ruritania, our researchers might need to record the name of a prisoner before they fill in the rest of the record, so they’d leave that field null to indicate that information is yet to be entered. Which is fine until we run our SQL query to calculate the number of prisoners in captivity by calculating where released != 1 and wonder why that count is off by 1 from what we expect. In programming, it’s impossible for a Boolean field to have any value besides true or false. But in the world of data, it’s often important to be able to express that a value is unknown or unknowable. But if your code thinks that Boolean columns in your database can only be true or false, you’re destined for some errors whenever that bit of ambiguity is encountered.\n\nOf course, we do have the option of disallowing NULLs in our database’s Boolean fields. Suppose we decide to be bit more formal about a prisoner’s status and declare it can be only one of these possible states:\n\n  held\n  released\n  approved_for_release\n  charged\n  convicted\n  died_in_custody (yep, this example is still dark)\n  unknown\n\n\nThese categories are mutually exclusive. We assume that any prisoner’s status can only be set to one of these categories. So, we decide to implement this as a collection of Boolean values.\n\nreleased bool NOT NULL,\napproved_for_release bool NOT NULL,\ncharged bool NOT NULL,\nconvicted bool NOT NULL,\ndied_in_custody bool NOT NULL,\nunknown bool NOT NULL\n\nWe’ve eliminated the potential problems with null values by not allowing them at all. And we might feel good that we’ve sidestepped the “held”-“released” confusion by making “held” the default state if none of these Booleans are checked. Yet by adding more Booleans, we’ve just made possible errors even more likely. There are still problems where a researcher might accidentally check two checkboxes in an admin. There might also be well-intentioned accidents; imagine a later developer were maintaining this code and didn’t realize these fields were supposed to be mutually exclusive - so when an inmate is convicted, they leave charged set to true, because the inmate was obviously charged before they were convicted. Suddenly, the application is crashing and nobody knows why.\n\nUltimately, it makes much more sense to just create a single string field named something like prisoner_status and set its value to only one of a few specific keyword values like held, released, or charged. This not only clarifies your intentions for these categories but makes it impossible to create conflicting states in the database, provided that you ensure the string values stay correct. This might seem obvious, and yet I’ve seen so much code that uses a collection of connected booleans instead. Trying to coordinate a passel of checkboxes so that only one of them is checked seems like an exercise in futility, but it’s one that programmers perform again and again.\n\nBeing and Change\nAs a native English speaker, I am slightly envious that Spanish includes two distinct verbs for “to be”: estar and ser. The first of these is used to express transitory conditions (“I am hungry”); the second is for describing essential and unchanging qualities (“I am human”). In English, we use the same verb for both uses, even if in some cases it might not always be clear just how transitory or essential the condition is (“I am curious”).\n\nI sometimes wonder if Spanish speakers employ a similar clear naming distinction for their database schema, because we English speakers generally make a complete mess of it. Sometimes, we use Booleans to represent essential constants; sometimes we use them to represent the current status of a changeable situation. But most of the time, we aren’t entirely sure which of these situations we want our Boolean fields to be.\n\nFor instance, suppose we decide to be less rigorous about defining a prisoner’s status. In this case, we just have a Boolean field named charged that is set to true when a prisoner is charged with a crime. This sounds pretty straightforward, but what happens if those charges are dropped? Presumably we would just set charged=false for that prisoner since they are no longer currently charged with any crimes. This is correct, but it also means that it’s impossible for our database to distinguish if a prisoner was charged with a crime and later cleared or was never charged at all with anything. We can’t use a single Boolean to represent both what is currently true and what was once true. One way to do this might be to add a charges_dropped field that is set to true for prisoners for whom that is the case. But if we are interested in properly tracking the history of a prisoner’s case, it might make more sense to add some additional metadata fields like date_charges_dropped or dismissal_type. Which is how we soon end up with 50 or 60 columns in our database table, each saving a date and other metadata linked to our various Boolean fields. It seems that we have both cases covered, but what if something unexpected happens in the future? For instance, what if an inmate is charged with a crime, those charges are dropped and then they are later charged with another crime? Ruritanian Law is not always predictable. So what do we do then? We could set both charged and charges_dropped to true, but doesn’t that look more like a possible bug than a valid outcome?\n\nThe problem here is like we’ve confused ser and estar. When we are first defining our schema, we’re often not sure if any specific Boolean field means that something is currently true or simply that it was true at some point, which is a pretty important distinction. Admittedly, this is not the fault of the Boolean datatype, but rather of how poorly we describe the data we want to collect (for instance, if instead of charged, imagine we named the field currently_charged or was_charged). Ultimately though, we should not distill important events in a prisoner’s life into simple true/false conditionals. A far better approach would be to create an auxiliary table that’s joined to the prisoners table:\n\nCREATE TABLE events (\n    id int(11) NOT NULL AUTO_INCREMENT,\n    prisoner_id int(11) NOT NULL,\n    event_type varchar(255) NOT NULL,\n    event_date date NOT NULL,\n    metadata text\n)\n\nA prisoner would have many events associated with them. Here, the event_type is limited to a set of keywords like held, charged or released defined and enforced by our code. Then, we can record the history we have for any inmate as a series of events rather than a muddle of ambiguous Booleans. To find all the prisoners who were ever charged with a crime, we can join against this table on the charged event_type. We will have no problem representing the hapless prisoner who was charged, then cleared, then charged again, since we can use 3 event records to represent that. To figure out the current status of any inmate, we might simply just look at the most recent event in their timeline. To store additional metadata about specific events, we could either save arbitrary JSON metadata as a text field (if we do not need to search any of it in the database) or use single table inheritance. Using a separate events table would also simplify our main prisoners table by eliminating the need for multiple redundant columns like release_date, charged_date, conviction_date, etc.\n\nThis approach is not effortless; for instance, it’s a lot more work to create a web admin for editing an arbitrary number of events than it is to just add a bunch of checkboxes to a single detainee record. And to keep our SQL queries simple, we might want to still define Booleans that express the current status of a prisoner. We just should be clear in naming those Booleans something like currently_charged. And we should only compute them based on the prisoner’s timeline, rather than allowing them to be edited manually in an admin, to make sure they are never out of sync with what the events record.\n\nTo Boolean or Not to Boolean\nFor such a simple datatype, Booleans enable a lot of complicated confusion. These types of errors are not hypothetical; I’ve changed the specifics, but every single one of the problematic schema designs I’ve presented in this piece were taken from real databases I have worked with and sometimes created. The problem usually lies not in Boolean values, but in us, for implicitly assuming their strict true-false definitions are enough to depict an often ambiguous reality. For instance, we might expect to know definitively if a prisoner has been charged with a crime or not, but we’ve seen there are sometimes compelling reasons why we might need to record that as a null instead. Nor are these problems specific to just Booleans; imagine trying to record the birthday of a person in a full date field when you only know his birth year. Do you pick an arbitrary day of that year, and if so how do you distinguish this limited information about the date from dates we know exactly? Do you not record it at all? Or do you redesign your database so every date is now represented with three separate day, month, and year fields?\n\nUnfortunately, the revision histories of many news-related database schemas reveal a similar unraveling of the ideal view of the data in the face of murky reality the data is trying to describe. Worse still, this process is usually unavoidable; in many cases, you need to build the database first to later realize all of the assumptions you built it on. Hindsight is a harsh data architect. Still, it seems that we should be able to better design in advance for the problems we expect to see in our data. First, before we start creating database tables we should stop and take a moment to contemplate what might not be as simple to represent as we think it should be. And we should start sharing the best ways to represent a complicated world.\n\nThere are already sites that cover design patterns in code and basic DB schema designs, but it would be interesting to document various techniques we might use to correctly represent murky data succinctly and resiliently against errors. Once we truly consider the Boolean (or the datetime or the string), we can design our databases to be a little more adaptable to all the weirdness this world can throw at them."
        },
        {
          "id": "published-connecting-with-dots",
          "title": "Connecting with the Dots",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/connecting-with-dots",
          "content": "One of my favorite movies is the classic 1949 thriller “The Third Man.” The story is about a writer who arrives in gloomy post-war Vienna on the promise of a job only to instead unravel a criminal conspiracy to peddle diluted-and thus ineffective-antibiotics. In a pivotal scene during a clandestine meeting on the top of a Ferris wheel, the hero confronts a duplicitous friend about his lack of conscience and angrily asks if he has ever seen one of the victims of the tainted medicine he sells. Mr. Duplicity offers this cynical reply while looking down on the amusement park below:\n\n\n  You know, I never feel comfortable on these sort of things. Victims? Don’t be melodramatic. Look down there. Tell me. Would you really feel any pity if one of those dots stopped moving forever? If I offered you twenty thousand pounds for every dot that stopped, would you really, old man, tell me to keep my money, or would you calculate how many dots you could afford to spare?\n\n\nFrom a distance, it’s easy to forget the dots are people. That’s the dark lesson of the movie and also of course the war that wrecked all of Europe. But what does this have to do with journalism? After all, unlike the villain in this film, we are not amoral monsters. I’ll answer that by telling a story of my own. When the New York Times reported the Wikileaks war logs, it seemed like we finally had a chance to better map and quantify the massive sectarian cleansing that swept across Baghdad in the wake of U.S. occupation. Times reporters had been witnesses to the violence raging around them, but the only quantitative analysis available was done by a few international aid organizations trying their best to put numbers to the scale of the slaughter. The war logs included every report that individual units filed on civilian deaths, with locations and the number of the dead. The data was far from perfect-there were many duplicates and omissions and obviously we had no information about the methodology by which these cases were reported-but it could at least provide a rough overview of the neighborhoods most affected by the violence and the trends of accelerating violence from year to year. Below is the final version of the graphic we produced, representing both the deaths recorded on a single day and the trends from year to year.\n\n\nA section of the New York Times interactive A Deadly Day in Baghdad\n\nUltimately, I think the graphic produced by the Times did an excellent job of reminding readers about the human costs of the violence. By making the focus of the chart a single day of violence, we could cross-check the data and provide some context for where the worst violence occurred. It also served to anchor the swelling violence shown in the smaller annual charts below into a neighborhood context. In a similar vein, the Guardian produced their own visualization of the violence that humanized the data by profiling in detail the violence of a single day.\n\nBefore it was a final graphic though, it was a demo piece I hastily hacked into Google Earth using its KML format. I remember feeling pretty proud of myself at how cool even a crude rendering like this looked, and the detailed work I had done to pull out all the data within reports to see these dots surge and wane as I dragged the slider. Then I remembered that each of those data points was a life snuffed out, and I suddenly felt ashamed of my pride in my programming chops. As data journalists, we often prefer the “20,000 foot view,” placing points on a map or trends on a chart. And so we often grapple with the problems such a perspective creates for us and our readers-and from a distance, it’s easy to forget the dots are people. If I lose sight of that while I am making the map, how can I expect my readers to see it in the final product?\n\nAll of this has made me wonder what other approaches people have used to anchor their graphics in empathy. I investigated a few techniques that data journalists have used to connect readers with the dots. These aren’t just specific to tragedies like war and disaster, they’re important for any datasets we are using to report data about from people or that affects people (i.e., pretty much every dataset).\n\nNear and Far\nThese graphics illustrate a common and successful technique for bringing the reader back down to earth by focusing on a smaller range of data. Scott Klein of ProPublica took inspiration from Sesame Street and declared that many of the best news applications contain both the near and the far. For instance, a look at school test scores should both show system-wide trends and let readers look at how their local schools are doing. Or, in the case of Baghdad’s dead, we focused on a single day to show the near of what years of violence looked like day after day.\n\nUnfortunately, many data journalism examples focus exclusively on the far distance and leave out the near view. Usually, we cede the foreground to traditional narrative treatments crafted by “traditional” journalists. I’m tempted to blame this on false newsroom dichotomies treating stories and interactives as unrelated forms of content. But too often, it’s likely that our own laziness is at fault. As tools have improved, it has become phenomenally easy to put a bunch of dots on a map or in a chart, yet the legwork of understanding the “near” of that data remains just as time-consuming. Under deadline time pressure, it’s easier to just plot the map and call it a day. But we lose something in the process.\n\nTo illustrate what I mean, here is a similar chart posted to a clickbait Twitter account called @BrilliantMaps of all the car bombings in Baghdad since 2003. It wasn’t originally clear who made this map-lack of attribution is common for these kind of accounts-but the contrast between this map and the Times and Guardian interactives mentioned above is glaring. The problem is that this map is not only wrong, it’s also terrible. Gawker figured out the origins of this map and discovered that it was actually derived from Guardian data of all fatalities in Baghdad from 2003 to 2009, including accidents, so it exaggerated the data. Brilliant Maps later issued a correction, but still got it wrong. I’m beginning to think clickbait twitter accounts aren’t entirely reliable.\n\nWrong as it is, the map doesn’t fail to startle; ]the replies and retweets are filled with tweets shocked at the overall picture of violence](https://xcancel.com/BrilliantMaps/status/523779391021527040). The problem is that once you get past the original shock of the image, there is nothing else to learn. Are the clusters random or significant to the underlying geography of Baghdad’s neighborhoods? Did the violence surge and wane or has it maintained a constant level of carnage from one year to the next? Neither of these questions are hard to explore, and the lack of such context means the reader can only gasp in doge-like awe (“Wow! Such dots. Very violence.”) and walk away with the general impression that Baghdad is a horribly dangerous place, a conclusion that is definitely nowhere near as true today as it was during the heights of bloodshed in 2006. It’s not a brilliant map. When the map is barely distinguishable from a Clickhole parody that’s a clear sign it’s actually a terrible map. What would make it better? Finding the near is one approach. What are some others?\n\nPutting People First\nOne possibility is that if your data is about people, make it extremely clear who they are or were. Here is a crude example of that approach from the New York Times in an interactive reporting military deaths in Afghanistan and Iraq called Faces of the Dead. The pixels of the chart default to being a canvas for showing the picture of the fallen service-member as a way of emphasizing the human cost of war in a more forceful way than the chart view of this data does. This is effective, yet it feels a bit clumsy, possibly because we can only see one face in the crowd at a time.\n\nIn such circumstances, it often makes more sense to abandon the chart entirely and just report the details of each person that matters. Please read the excellent Source article What If the Data Visualization is Actually People? for one such example where it was better to report on the people than the data in the story. The best example of this approach is the site Homicide Watch, which tracks every homicide in Washington, D.C. This is a dataset for which it’s very easy to lead off with a big map, which is why it’s notable that Homicide Watch chooses not to. Instead, their homepage is filled with pictures of the latest homicide victims, because remembering every victim means not first presenting each one as another dot on a map.\n\nWee People and When to Use Them\nOften, it is enough to just suggest the human form as a reminder. Take this lovely sports graphic from the Guardian on the heights of college basketball players. Instead of showing all the photos of players, it’s more effective with the scaled silhouettes depicted instead. A quick scan of the 6’0 tall players on several teams confirms these are not the actual shadows of the players named-that information would be difficult to get and not really add much to the graphic. Of course, this chart could be implemented using standard bar chart boxes instead, but the use of little figures adds something quirky and human to the data presentation.\n\n\nCaption: A section of the Guardian’s interactive “March Madness: do the tallest teams always win the NCAA championship?”\n\nIn another example, this Washington Post graphic on the racial demographics of death penalty punishments and the victims of their crimes uses a lot of little people to show the growth and ebb of capital punishment in the U.S. since it was legalized again in 1976. Given the high-profile nature of these cases, it would not be difficult for the newspaper to get more information on each case, but the point of this graphic is just to illustrate the sheer number of executions that have happened in each state and the ways in which they have been skewed towards particular races, states, and genders. Any additional detail beyond that would obscure the forest for the trees. And yet, I find it a little too frustrating when I want to compare two years against each other. Even when the rows don’t wrap around to a second line, the staggered nature of the wee people shapes makes it harder than a vertical bar chart to compare rows against each other.\n\n\nCaption: Section of the Washington Post interactive “An Eye for an Eye?”\n\nIs there a point where using wee people in your graphics is overkill? Well, yes but what’s a good rule for when to revert to using more traditional means of representing people in graphics? I would argue that once you get above a certain threshold of data points, or you want to make it easier to visually compare two amounts over time, it makes more sense to use dots or blocks. For instance, this Washington Post interactive that compares the infectiousness of Ebola to other diseases works well because it’s easy to compare the simulated outcomes of each disease with each other. In another example, the moving timeline at the top of the Guantanamo Docket works better for comparing totals between countries because it uses blocks instead of people (notice you can hover over any block to see who that person is). Furthermore, I would argue that it is not effective to use wee people in any circumstance where a single depicted person does not equal a single actual person. For instance, here is a chart from the New York Times where each figure on the chart represents a million people; at that point it makes more sense to just use abstract blocks. Or a combined format like this later piece.\n\nAnd of course, sometimes it’s necessary to remove the dots entirely when they interfere the story you need to tell. Here, a more recent Times graphic of the sectarian purging in Baghdad does not overlay violent incidents because the dots would obscure the sorting effects of the violence. The supporting text makes it clear though that these changes did not happen without coercion and violence.\n\nEmpathic Design\nThese anecdotal examples illustrate a few of the varying means in which interactives can evoke empathy in their readers. But I’m curious if anybody has attempted to systematically explore when certain approaches make sense and when they are a distraction. Google suggests no one has, but of course this is not a very SEO-friendly concept. Still, we have guides to inform us when to use a bar chart vs. a scatter plot vs. a pie chart. Would a similar guided approach work for interactives we feel are too emotionless? Is there a collection of graphics design patterns for empathy we can draw from or is this something we can assemble on our own?\n\nUltimately though, the main question is this: should we even try with our graphics to make readers care? The Devil’s Advocate would argue that it’s not the responsibility of our interactives to make people feel something about a topic-that is usually handled by a narrative piece paired with them-but I feel that in these days where charts may be tweeted, reblogged, and aggregated out of context, you must assume your graphic will stand alone. Neither of these arguments consider what the reader actually expects. What does the reader expect to feel from journalism and how can we learn from their experiences?"
        },
        {
          "id": "published-wave-pr-data",
          "title": "Prediction for 2015 - A Wave of P.R. Data",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/wave-pr-data",
          "content": "Nobody can say exactly when the trend first started, but in 2014 we saw the first major outbreaks of bogus data distributed by private companies just so it would go viral online. Among the many exciting thing we’ve learned this year are:&lt;\n\n\n  Democrats watch more pornography than Republicans, according to Pornhub.\n  Mexicans and Nigerians are the best at sex, as polled by condom manufacturer Durex.\n  The nation’s most stressed zipcodes include one near you, as reported by real estate blog Movoto.\n  Washington residents complain about rats more than New Yorkers, as reported by Orkin.\n  People sometimes use car-share services after hooking up, thanks to some creepy oversharing from Uber.\n\n\nTo be blunt, all of these stories were unredeemably awful, riddled with errors and faulty assumptions. But accuracy wasn’t the point. All of these examples of “data journalism” were generated by companies looking for coverage from online news organizations. The goal is a viral feedback loop, where the story is reaggregated by others, the site surges in its organic search rankings, and the study is tweeted for days even by haters like myself. For these purposes, they were perfectly designed to exploit the nature of modern news distribution online.\n\nThe old adage of “fast, cheap, good - pick two” often used about software development also applies to news, where good is not just a function of your current work but your established reputation. So many news organizations on the web start at a disadvantage, with their only option to put as many fast and cheap stories out there to hit their monthly traffic targets. And everybody knows that posts that feature several key charts or 40 maps that explain something tend to do pretty well in traffic.\n\nBut it takes time to gather the data yourself - so it’s much better if someone provides it to you. Which is how we got here. It’s not unusual for news organizations to source data from private companies much like they would from government agencies or scientific agencies. For instance, The New York Times sources data from a company that tracks executive compensation to report on trends in CEO pay every year.\n\nBut the PR-driven data stories I listed above come from an opposite direction to traditional data journalism. This is not data that is collected and analyzed in response to specific questions and whose quality is checked before publication, but prebuilt charts pushed to news organizations like press releases and targeted against specific topics like sex, anxiety, and shame that are more likely to elicit clicks. If you’re a company looking for press, why not use those fancy data scientists you hired to also generate some free publicity outside the company? And if you’re a reporter at a news startup who needs to constantly fill the news hole with new material, why wouldn’t you run one of these? Everybody’s happy, even if the data isn’t right.\n\nAnd in 2015, it will only get worse - because I’d bet the big PR firms have noticed the success of some of these smaller efforts and will try their hands at this new form of marketing. Don’t be surprised when Kraft creates a map of which states consume the most macaroni and cheese, or Starbucks releases charts showing how pumpkin spice-related products lift the American economy each fall. The wave of bullshit data is rising, and now it’s our turn to figure out how not to get swept away. Maybe Snopes sells life rafts?"
        },
        {
          "id": "published-distrust-your-data",
          "title": "Distrust Your Data",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/distrust-your-data",
          "content": "With the launching of 538, Vox and the New York Times’ Upshot, it seems like the age of data journalism is finally here, greeted with both acclaim and concern by media critics. But data journalism is not a new thing. These new sites are just the latest iteration of news applications, which were an iteration of computer-assisted reporting, which was an iteration of precision journalism, all of which are just names for specific techniques and approaches used in the service of reporting the truth and finding the story. In other words, it’s journalism that starts from interrogating the data-and applies the same skepticism and rigor that we apply to the testimony of an expert contacted by traditional phone-assisted reporting.\n\n\n\nAll of which is to say that data journalism inherits a long tradition of journalists working with data, and that comes with the heavy responsibility to get it right. Specifically, to paraphase something I heard at a NICAR conference once: fear and paranoia are the best friends a data journalist can have. I think about this often when I work with data, because I am terrified about making a dumb mistake. The public has only a limited tolerance for fast-and-loose data journalism and we can’t keep fucking it up.\n\nCritique is always annoying when it’s expressed in indefinite terms. So, I’m going to do something I don’t normally like to do and pick a recent example of a data journalism story gone wrong. This is not to scold those who reported it-indeed, I’m well aware of how easy it is for me to make similar mistakes-but because a specific example provides an explicit illustration of how reporting on data can go wrong and what we can learn from it. And so, let’s begin by talking about porn.\n\nSpecifically, a story about online pornography consumption in “red” vs. “blue” states that exploded onto social media a few weeks back. I first noticed it because of a story on Vox that reaggregated an Andrew Sullivan post which in turn reposted a chart made by Christopher Ingraham of the data provided by Pornhub for their study. That chain of links reflects how news spreads online these days, and yet none of those professional eyes caught some glaring flaws in the data.\n\nBefore I continue, here’s a brief summary of the findings presented by Pornhub’s data scientists. Pornhub (which is apparently the third most-popular pornography site on the Internet) was approached by Buzzfeed (which is probably the most-popular animated GIF distributor on the Internet) to analyze its traffic and determine whether “blue” states that voted for Obama in the last election consumed more pornography than “red” states that voted for Romney. And so, that’s what the statisticians at Pornhub did, pulling IP addresses from their website’s traffic logs, geocoding their likely locations and deriving a figure of total traffic for each state. They then divided the total hits from each state by that state’s population to derive a hits-per-capita number for each state. As a result, they were able to report that per-capita averages for each state and that blue states averaged slightly more hits per capita than red states.\n\nHow To Confuse Yourself With Statistics\nUnfortunately, the study and the subsequent reporting derived from the Pornhub data serves as a vivid example of six ways to make mistakes with statistics\n\n\n  Sloppy proxies\n  Dichotomizing\n  Correlation does not equal causation\n  Ecological inference\n  Geocoding\n  Data naivete\n\n\nThe first issues begin with the selection of the proxy. In statistics, a proxy is a variable that is used when it’s impossible to measure something directly - for instance, using per-capita GDP as a measure of standard of living. Buzzfeed titled the article about the Pornhub study as “Who Watches More Porn: Republicans Or Democrats?”. Let’s assume that’s the question that Buzzfeed wanted to ask. How would they do it? In an ideal world, they could ask every single Democrat and Republican in the country about their porn watching preferences, but this is obviously unfeasible. So, the next best thing after that would be to conduct a survey of a randomly selected group of individuals that shares similar characteristics to the national population. But that takes time and money and math, so instead Buzzfeed turned to their friends at Pornhub to derive an answer using the data they had on hand.\n\nIn this case, they used page requests to the third most-popular online porn site as a proxy for all pornography consumption and the percentage of the people who voted for Obama or Romney as proxies for registered Democrats and Republicans. These proxies are not the same thing, so distortion is inevitable. For instance, maybe in some states, people widely prefer to get their pornography via on-demand cable or sketchy video store, so they would be undercounted in the Pornhub figures. Similarly, this study uses total pageviews as a proxy for site users; the two are not necessarily the same and it’s unclear if increased pageviews means a corresponding linear increase in users. In addition, given that a large number of Americans identify themselves as independents, is it accurate to classify those voters as red or blue depending on a single election? Proxies give us a means to derive answers, but they may not always be appropriate for the questions being asked.\n\nThe problems continue from there. For their analysis, Pornhub sorted states into red and blue ones. This seems like it makes sense, but they’ve flattened a continuous variable (the percentage of the state population that voted for Obama) into a binary condition (Romney wins / Obama wins). It’s likely this dichotomizing had a palpable effect, since it makes a battleground state like Virginia seem closer to a Democratic stalwart like Vermont than its ideological “red state” neighbors in the South. Fortunately some statisticians identified and corrected for this issue, producing a more accurate scatter plot of the states vs their vote share for Obama. The result: a correlation that increased porn consumption in blue states accounted for about 16% of the variance of the state’s vote percentage for Obama. Success!\n\nBut wait. Here we stumble into two of the most classic mistakes people make with statistics. First, correlation does not equal causation. You’ve probably heard that a hundred times before, but this here is an actual illustration of why that matters. It’s entirely possible that the suggested relationship between the two variables is a total coincidence. Far more likely though is that the variables are related but only through a confounding variable that connects the two variables observed. For instance, blue states might have greater broadband penetration that would favor Internet porn. Or it could be that people in urban areas consume more Internet porn and states with more urban areas also trend Democratic. Confounding variables are common, and this piece by Jonathan Stray contains a solid overview of them and other spurious correlations. Or if you’d prefer a sarcastic look, here are correlations of voting to herpes infection or Nickelback listening. Putting it bluntly, these red state-blue state comparisons are statistical fluff, often reflecting the whimsy of the reporter more than anything real.\n\nBut what is the second mistake? For the sake of argument, let’s assume that we’ve avoided all these other problems above. Let’s decide Internet porn is a valid proxy for all pornography, that votes for a specific candidate in the last presidential election is a valid measure of party affiliation, that the correlation is not due to any hidden variables, then we can definitively say that Democrats consume more porn than Republicans, right? Wrong. Meet the ecological inference fallacy. In short, just because you’ve derived some average measure about a group that contains more of a subpopulation, that doesn’t necessarily mean it’s true for individuals in that group, especially when the difference is so slight. It’s possible that Democrats really do consume more porn and that’s what makes for the higher numbers per-capita in blue states. But it could also be that Republicans in Democrat-dominated states consume more porn than in Republican-dominated ones and that is what is pushing up the average. Or it could be that urban areas often consume more pornography and also tend to contain more Democrats but the two aren’t directly connected. We simply don’t have enough insight into the individual population to say.\n\nAnd we definitely don’t have any insight into specific people based on these broad statistics. Knowing that your neighbor is a Republican or a Democrat tells you nothing about their porn consumption, regardless of the averages they derived for each population.\n\nWe’re Not in Kansas Anymore\nUnfortunately, the worst error was yet to come. A lot of the early reporting on this study noticed a bizarre anomaly in the data: Kansas, a very red state, consumed an extremely high amount of porn per capita compared to the average for all other states. This is readily apparent when the numbers are graphed in a simple bar chart, but it really jumps out when the states are plotted on a scatterplot of Obama vote share vs. page hits. If you assumed, as Pornhub did, that average porn consumption was normally distributed across all states, Kansas’ average was highly unlikely. At more than 2.95 standard deviations above the average, there would be a 0.16% chance of that occurring if it were truly random. An extreme outlier like this should make you sit up and take notice as a data journalist, because it can only mean one of two things. Either you’ve really found an extreme case that reveals something bizarre and newsworthy. Or-as one reader of Andrew Sullivan’s website figured out while all the journalists shrugged their shoulders-the data is flawed.\n\nPornhub’s writeup omitted any explicit description of their methodology - this is never a good sign - but it seems to have involved mapping the IP addresses from which users visited the site to physical addresses and reverse geocoding those to get states. The statisticians at Pornhub (and the journalists who confidently reported their findings) assumed this was a clean process, but any programmer with experience can tell you the bitter truth: geocoding is often rubbish. What happened here was that a large percentage of IP addresses could not be resolved to an address any more specific than “USA.” When that address was geocoded, it returned a point in the centroid of the continental United States, which placed it in the state of - you guessed it - Kansas! Sadly, IP geocoding is prone to other distortions from networking architecture; for instance, at one time every user of AOL’s nationwide dialup service looked like they were connecting to the Internet from Reston, Virginia. Right now, my corporate VPN makes me look like I’m surfing the web from New Jersey even though I live in Maryland.\n\nOf course, if we shift Kansas’ average downwards, that doesn’t change Pornhub’s hypothesis that blue states consume more porn per capita than red states. I’ve already sufficiently argued my concerns with that, but I bring up this specific error because of the central failure it illuminated. If you want to call yourself a data journalist, there is one shortcut you can never take: you must validate your data. Even the cleanest looking data might contain flaws and omissions stemming from its methodology. It’s not enough to run checks on the data itself. You must also lift your nose out of the database, ask the serious questions about how the data was collected and even use the well-honed tools of a traditional reporter to call experts when - never an if - you find questions about the data.\n\nDoing It Better*\nI know I promised I wouldn’t be a scold. But this is important. You might argue why should I care so much about a bit of viral silliness from Buzzfeed? First, I would argue it’s never just “all in fun” when you’re declaring half of the electorate more perverted than the other half. But more importantly, I don’t think the errors illustrated here are an aberration. Here’s another example of blindly trusting data to reach wrong conclusions. And another. By the hand-waving measures of traditional journalism, that’s three, making this a bonafide trend! I fear it will only get worse as publishing cycles become faster and the data analysis is done by single reporters harried by deadline pressure and nobody to cross-check their work before publication. I don’t think we can slow this trend down, but what can data journalists do to avoid slamming into these sorts of problems at full speed?\n\nDistrust the Data\nFirst, remember that skepticism is your truest friend if you want to call yourself a journalist. It’s not hard to see the flaws in a flimsy study if you are predisposed to contemplate all the ways in which the data is probably bad rather than tacitly accepting it as good and tested just because someone else reported on it too. If you need further inspiration, I’d suggest looking at two excellent pieces from related fields on the value of skepticism. The first of these - On Being A Data Skeptic - is a free ebook from O’Reilly that describes a similar problem gripping data scientists: the belief that quantifying a model is the same as accurately describing it. It’s where I learned to think critically about proxies. The second of these - A Rough Guide To Spotting Bad Science - is an excellent run-down of all the bad ways statistics are applied in the worst scientific studies.\n\nDistrust the Motives\nAs journalists it’s also not enough to be skeptical of the data, you need to also be wary of the agenda that provided the data. What angered me the most about this study is that it was clearly framed from the start to go viral. You’d have to be willfully naive about the motivations of Pornhub and Buzzfeed to assume they wanted anything else here. And yet many sites acted as willing accomplices for a porn site that certainly didn’t mind seeing its name printed far and wide on the web. We mock publications that uncritically republish press releases, but how was this any different? Data usually comes with an agenda; few people collect data just for nothing. This doesn’t mean that you must avoid all data completely for fear of contamination. For instance, if you were reporting on water quality, it would make sense to partner with a nonprofit advocating on this issue if their data seems objective enough. Sources also have agendas after all, and they don’t prevent reporters from interviewing them. It would make less sense to uncritically use data freely provided from an industry you were reporting on. Most reporters can decide on how much they want to trust their sources, it seems like similar reasoning might apply to data.\n\nSniff Out The Problems\nThere is a concept from programming I’d also like to see applied to data analysis. As programmers add features to a system, this means writing more code and adding complexity to the system. Both of these usually mean that more bugs are added as well. Refactoring is the name for a toolkit of approaches to clean up ugly code and reverse the bloat added to programs over time. Simply put, it’s a listing of bad practices you might observe in code with suggested remedies on how to fix them. These have been called “code smells” because to an experienced coder, recognizing these problems becomes as innate as smelling something that has gone moldy in the fridge. Similarly, everyone who reports on data can name a few of their favorite “data smells” - e.g., Benford’s Law, large standard deviations, double-counted or omitted records, category fields that are manually entered - but there is no central repository for this nformation.\n\nLearn Statistics\nI know it sounds terrifying, but I’d also recommend learning statistics. I don’t know why I didn’t take that step in college, but I’m glad to have the option of learning with a MOOC now. Both Coursera and EdX seem to have great options. Learn statistics if you can. I don’t mean you need to learn about advanced topics like ANOVA or Monte Carlo simulations, but no journalist should report on data if they don’t understand the difference between a mean and median and what common measures of variance and spread are. If that still is too terrifying to contemplate, at least learn to think like a statistician and see how it changes your attitudes towards data.\n\nLook Back to Go Forward\nUltimately I suspect that many mistake-riddled pieces of data journalism run aground in the same shallow seas-things like shoddy data, misapplied proxies, and botched statistics. But, I actually don’t have any data to answer that question. Greg Linch makes the important point that we should do the unpleasant job of cataloguing where the process went wrong in pieces of bad data journalism. Post-mortems are a common practice in computer programming to identify ways in which the best-laid plans go awry. That approach gives organizations insight into their own particular programming mistakes; maybe it would work for data too? As practitioners, we could start assembling a comprehensive list of data smells-of specific common problems-and gradually create a checklist of high-level classes of errors as a resource for data journalists and thei editors."
        },
        {
          "id": "category-bots-with-thoughts",
          "title": "Bots With Thoughts",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/category/bots-with-thoughts",
          "content": "In the winter of 2012, I wrote a bot because I was sad.\n\nThe aftermath of presidential elections can be hard on journalists - although they are even worse for the loser of course - because after months of planning and breathing election coverage, all of our work is over in a single night. So, I will confess to a minor amount of despondency as I shut down the loaders and baked out the final version of the election pages. Is that all there is? Colleagues decamped to tropical beaches to restore themselves, one paper cocktail-umbrella at a time, but I stayed at home to look for something new to correct my drift. New projects loomed on the horizon, but it was November and who wants to start something huge just before the rush of the holidays?\n\nWhich is how I began writing haiku. More precisely, it’s how I came to write a computer bot inspired by haikuleaks that looked for haiku hidden within the New York Times. The process is pretty simple:\n\n\n  Load the New York Times homepage and look for articles you haven’t seen before.\n  Pull down the text of the article and separate it into sentences.\n  Tokenize each sentence into words and look up the number of syllables in the word\n  If the number of syllables in the word match the Haiku pattern, write it out as a possible haiku.\n\n\nThere are some additional details. For instance, knowing how many syllables are in a word was not trivial, but helped immensely by the CMU Pronouncing Dictionary. Tokenization can sometimes get confused, and there are sometimes articles we don’t want to find haikus from (i.e., stories about terrorism or plane crashes). The system is also not fully automatic: a human moderates haikus found by the bot before they are published to the site. But the entire process of discovery is fully bot-driven, finding haiku in the most unexpected places and making me smile to this day.\n\nThe definition of a bot is amorphous. We generally think of a bots as autonomous programs that do discrete-and often silly-tasks often for an indefinite period. For instance, there are the bots that pull and scramble text in unexpected ways like @TwoHeadlines, @4myrealfriends and @pentametron. Other bots might regularly pull from a remote data source like the English dictionary or ships crossing under the Tower Bridge. I’ve even seen web crawlers called bots these days, although this is often because of media confusion with spam botnets or other malicious actors. Generally though, when we say “bot,” we usually mean a generally harmless agent that acts on the behalf its creator in some fashion but has limited ambition and basic operations.\n\nBut, is that all there is? This definition is largely exact, but it misses the point: bots are also magic. I am aware this sounds wildly hokey, but I have a deep emotional regard for my favorite bots that doesn’t make much logical sense. It’s not because they perform some Turing-test trickery; there’s no confusing the best bots with humans. Rather, it’s because they occasionally by pure chance output something of unexepected beauty. You know how the bot’s rules work-most are only a few hundred lines of code-and yet it can still surprise you. Magic and whimsy are frustratingly elusive in computer science, often encountered in university courses or hackathon before being muscled aside by the needs for practical programming introductions. My first serious university programming textbook has a freaking wizard on the front cover; while the last few introductions to programming frameworks I’ve read have each taught me how to code to-do apps. Is it any wonder why we find the incoherent ramblings of a bot so appealing? They remind us of who we were when we first started to program.&lt;/p&gt;\n\nBut this is Source and not the New York Review of Bots, so what about bots and the news? As part of Botweek, we’ve seen a few serious cases of bots being used for news, reporting on earthquakes or nailbiters in sports. Furthermore, Ben Welsh has been ranting about writing journalistic bots for years and Derek Willis has written reporting triggers on Congressional bills and campaign-finance filings. But, it feels like we could do so much more. The current generation of bots look for easily-defined news triggers like an earthquake alert or a bill being passed. Humans write the rules and the bots execute on them. But what about things that aren’t so obvious? For instance, if an incumbent is not able to raise funds from an industry he used to rely on for backing, that’s a great sign of future election troubles. Political talking points could be traced back to concerted pushes by party officials and think tanks. Outliers and aberrations in crime statistics could reflect either policing success or tampering with the numbers. These are all further rules we could potentially enumerate in our code were we exhaustive enough, but it feels like the next step beyond is to make the bots more intelligent by adding learning to them. We need a bot with a sense of aesthetics.\n\nThis is easier said then done though. Training a bot would be a tedious process that could go wrong in various ways. And training the bot to flag certain things doesn’t mean it wouldn’t miss others. But once our bots can accept feedback and adjust their behavior, they become more than mere shadows of ourselves. All of which makes for some exciting hypothetical discussions: must your bot work under the same code of ethics that covers your journalism? Yes. Are you legally responsible if your bot causes some damage? Maybe. Is there any established journalism created by learning machines? Depends on if you count humans as learning machines. Obviously, there are a lot of details that would need to be worked out-and I can’t imagine trying to run the phrase Hidden Markov Model by a copy editor-but someone will clearly do it someday. That someone could be you.\n\nOn the flip side, we must also be savvier about how bots are intruding into the areas we report on. For starters, everywhere online is polluted by bots pretending to be people. Weeding out the bots and figuring out their motives will be a priority for any project you do with social media or other online content. As more and more organizations and agencies produce terrible generated content, there also will always be a utility in figuring out how to reverse-engineer the raw data embedded within bot-written content. What better way to counter the bots out there than to write some bots of your own?\n\nBut what if you just want to write silly bots too? Excellent. Even a “useless” bot can serve a worthy purpose. It can challenge you to try some new programming techniques. It can add some welcome silliness to your twitter timeline. Or it can just bring you a little warmth when the days are getting dark and cold around you. That’s not much in the grand scheme of things, but it will suffice."
        },
        {
          "id": "published-times-regrets-error",
          "title": "The Times Regrets the Error",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/times-regrets-error",
          "content": "It hasn’t happened yet, but there will be a day when the New York Times or another major newspaper will run a correction because of an error in a piece of data journalism. Corrections are an inherent part of journalism and we have standards for issuing them in traditional narrative reporting. It’s part of the process. Because we attempt to report on events as they are happening, we will make mistakes. As long as you have done your best to avoid errors, there is no shame in admitting when they have occurred, and corrections are an honorable and effective means to indicate problems with a story that arise after publication. In their pedantic specificity, corrections can sometimes be comical, but the rules for corrections are pretty clear: if you mess up a fact anywhere in the story, you issue a correction. But what about data?\n\nA typical narrative news story probably contains a few hundred facts or so: names, quotations, locations, factoids, or numbers derived from official government sources. But what about a data journalism piece? Imagine an interactive for a large dataset like school test scores or campaign donations with millions of rows of data? Isn’t each of those records made of many different facts? Now, what if one of those fields is wrong? Do we mark it on the page for that school? Do we issue a correction on the front page? Do we just fix it and say nothing? What if the data has just been distilled into a high-level analysis and a story. The right answer depends on both circumstances and attitudes. In fact, I’m not sure there is a right answer at all. But before we discuss what’s right, let’s look at all the ways things can go wrong.\n\nA Carnival of Errors\nData journalism does not fall perfect from the sky. It’s painstakingly built. As far as I can determine, nobody has formally explicated the process, but in my own experience it generally involves the following steps:\n\n\n  Acquisition\n  Cleaning\n  Loading\n  Verification\n  Analysis\n  Cherry-picking\n  Presentation\n  Maintenance\n\n\nThe fun of data journalism is that each of these steps can introduce errors that can affect the final story. Yes, such fun contemplating all the ways you can go wrong. Let the hilarity commence!\n\nAcquisition, Cleaning, and Verification Errors\nFirst, in order to report the data, you have to collect it. This is often not as straightforward as it sounds. They will try to stop you, although the line between malice and incompetence is hard to determine. I have seen data tables released in PDF to make them harder to work with and I have seen them released in PDF because the provider thought it was actually a useful format. Even friendlier formats like CSV or SQL dumps may be incorrectly generated. Files may be missing or truncated. And there is nothing that government agencies love more than making their own convoluted data formats.\n\nData often has to be cleaned and verified. Duplicate rows are common. Missing data is much, much harder to identify. Columns you think are fixed categories might actually be freeform text with typos and spelling errors. Typecasting can truncate numbers. Dates and times can be misinterpreted in so many ways. Without context, your computer can only guess “11/1/2013” could be November 1st or January 11th, and times will generally be assumed to be in your own timezone. Even more subtle bugs are possible. The code you write to clean data may introduce new errors into it. Once you have cleaned the data, you must verify it. Is it an accurate model to tell the story you wish to tell? What does it include? What does it leave out? If you are joining it against other data sources, is that combination accurate or does it make assumptions?\n\nAnalysis Errors\nScott Klein of ProPublica often describes a good data journalism project as containing both near and a far components. By this, he means that it allows the reader to explore overall trends (e.g., school ratings across the country) and examples specific to them (e.g., how is my local school doing?) Unfortunately, errors can creep into both of these analyses. The near picture can be plagued by missing data, shoddy geocoding, or other simple woes. Crazy outliers and duplicate records can distort many crude statistics like sums or averages in the far picture.\n\nPublication and Update Errors\nPublication sometimes has its own issues. Transforming a database into web pages often involves loading data into web frameworks and rendering them out. Errors can occur. Data can be typecast into formats that truncate or distort it. Table columns can run into each other, making new numbers out of two separate values. New programming errors may return wrong results. Maps might place locations in weird places. Sometimes, this is a result of poor geocoding. Sometimes, it’s another manifestation of my own personal bugbear; I hate it when maps geocode an area like “Chicago” or the “United States” by placing a single point in the centroid of the area. Together, we can stop this practice from happening.\n\nAnd often, when you are done with this process, you will load it again in the future. The most compelling data projects are not static, they are updated with new data, whether it’s the next year’s school test scores or the next minute’s election results. What do you do with the old data? If it supplements the prior data, do you still provide the old values somewhere? If it replaces the old data, do you show where values were changed? What if it fixes noticeable errors in the original data? Do you give people a chance to replay prior revisions even if some of them are wrong?\n\nAn Example\nReporting election results can be a remarkably fraught process. Like almost every media organization, the New York Times retrieves election data from the Associated Press which employs a number of stringers to report results from local election officials. Vote counts are updated frequently during the course of a night until unofficial tabulation is concluded (official counts usually come out weeks or months later). Obviously, there are many opportunities for errors to occur, but let me highlight two scenarios from the September 10th primaries in New York City.\n\nFor this election, we were also provided with precinct-level reporting which makes for some pretty awesome maps, but it also led to transient reporting errors. For instance, for six or so hours, the data reported a surprisingly large victory of a 1,000 votes for longshot candidate Neil Grimaldi\n\n\n\n\nThis was obviously a data entry error, since most precincts averaged approximately 128 votes and those 1,000 votes did not appear in Grimaldi’s citywide vote tally. Eventually the AP feed corrected the problem and our results map updated with a more accurate tally. Some readers did notice it, but it was enough to replace the data without issuing a correction because it’s accepted that early returns are likely to be corrected. In contrast, the AP data also erroneously reported the name of a city council candidate as Paul Garland (his middle name) instead of David Garland. Unlike the aberrant votes, this error did result in correction text being appended to the results.\n\nA Corrections Policy for Data?\nIn both of these examples, errors appeared in the election data and were corrected in the interactive. But one of them merited a text correction and one of them didn’t. I think this makes sense in both circumstances, but it strikes me how difficult it was to find a data corrections policy enumerated anywhere publicly. Indeed, some fitful Googling revealed only this corrections policy from ProPublica:\n\n\n  News apps and graphics should follow your newsroom’s standard corrections policy. Observe the following additions:\n\n  When data is incorrect, place the correction language on every page that once showed the incorrect data point. That may mean that a correction will appear on thousands of pages.\n\n  When an app’s data is refreshed and the corrected data has been removed or superceded, remove the correction language to avoid confusion.\n\n\nThis is an admirable policy to follow, but it works best for only specific types of interactives that are published once using verified data and infrequently updated. For instance, in an election we are updating pages every few minutes, meaning a correction would quickly vanish. Should it? Or should we accumulate more and more corrections as the night progresses and transient reports are clarified? And if we computed citywide totals by summing up precinct votes, that erroneous precinct would lead to some distortion in the unofficial numbers. Should we report a correction for those derived figures too? I’m not criticizing the ProPublica policy here. The point is that thinking through ProPublica’s policy highlights the fact that that the world of data interactives is varied enough that it’s hard to find one policy that fits all.\n\nWhat’s Next?\nMaybe a unified policy is too much to expect. We could consider several different factors in specifying a correction policy for any specific interactive:\n\n\n  Is the data updated rapidly or infrequently/never?\n  Is the data official or provisional?\n  Has the data been processed or is it direct from the source?\n  Are derived measures like totals or averages included in the data or computed afterwards?\n  Does the data include geocoded values or were those derived later?\n  Does the interactive imply a specific level of accuracy that may be misleading?\n\n\nAnd so on. What I’m suggesting here is that there is one optimal correction policy for the single-page static graphic, another possibly different one for a browsable site of school ratings, another different still for election or Olympics results where the upstream provider only promises the data will be eventually correct, not that errors will never occur.\n\nIt might be possible for an individual news organization to codify a data correction policy as a decision tree, although that might be a more arduous process than it sounds. We could then imagine data corrections being automatically applied in some cases, although we’d likely need to aggregate error reports to prevent the reader from being drowned in very specific corrections for each data point.\n\nIt would be a hard-fought achievement to merely catalogue all the types of errors that might occur in these steps of data journalism (I imagine them organized like HTTP status codes but far more numerous and with developer errors included).\n\nI also wonder if we should always try reporting our data in such a way that replay is possible. One can imagine something like a single data table being committed into a source repository like Github to see the possibilities and pitfalls of this. The problem is, most interactives are not single tables. They often comprise entire databases. It might be difficult to let users replay the night. There may be reasons why the raw data can’t be shared. And many changes might simply be the result of new data coming in rather than errors in the original table (it would be nice to be able to annotate specific errors with corrections later). Still, the idea might be interesting to explore as a proof of concept on how to track changes to data in a similar way that NewsDiffs does for text. Similarly, would a machine-readable format make sense for presenting data corrections? How might that look?\n\nAnother approach is to get even more granular. What if we could easily generate an audit trail for changes to every bit of data in an interactive. A large number of errors can occur at the data-cleaning stage and an automated system that complements all data overwrites with an audit trail might be useful for verifying that no mistakes were made along the way. This seems pretty cumbersome for any of us to support, especially on deadline, so all most of us do is make backups and do our best. Is there a way to store data that would track all changes nondestructively and produce its own audit trail? Would this approach be useful for identifying errors or would it make us unable to see the forest for the trees?\n\nNo matter how we approach the problem, it’s clear our glorious data journalism future will be riddled with errors. But I’m not mistaken in hoping that we can figure out better ways to handle those errors when they arise.\n\nCorrect me if I’m wrong?"
        },
        {
          "id": "published-remember-posterity",
          "title": "And Remember, This Is for Posterity",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/remember-posterity",
          "content": "The Web celebrates the ephemeral. It’s a hoary cliché that the Internet annihilates geography, but it also doesn’t care much for history. We laugh about the days when we used to have Friendster accounts and use flip phones, but that was only 10 years ago. All of that is gone now. That’s the Internet. We focus on the next big thing, launch, disrupt and then expire our once-beloved projects when they’re no longer worth maintaining. Thus, it’s hardly surprising that it’s easier for me to read an issue of the New York Times from 1851 than the election results from 2000. So what? Websites expire all the time, but we’re journalists. We like to think our work is for the ages.\n\nI’m not the first or the only one to notice this. Indeed, Matt Waite has already written an excellent piece called “Kill Your Darlings” on the important need to think about how we end our projects before we begin them. If you haven’t read it, go ahead now. I’ll wait.\n\nOkay, welcome back. Matt’s article highlights a key point: as developers, we are often only thinking to the next milestone and slightly beyond. Definitely not into the next year. Or 20 years from now. That’s also true of traditional narrative journalists. The good ones are often only writing for their next deadline. And yet, their work is perfectly designed for posterity. English changes but at a far slower rate than programming languages. Paper will crumble eventually, but any pile of Zip Disks lurking in old desk drawers testifies that print is more durable than many digital formats. While individual sections and design specifications may change, the newspaper format has been largely consistent for years. Thus, it’s for the most part possible to read a newspaper from 50 or 100 years ago. Some of the context may not make sense, but you can read it.\n\nDeath Is Not the End\nSo, what are the steps to take when it’s time to mothball an application? The key is to remember that nothing is more resilient for the future than a static page. If your application was in a dynamic framework, the first step is to crawl it and save static version of all pages. In some cases, this may be as simple as running wget, but for sites that are not readily indexed, that might not be possible. An alternative approach would be to figure out the routes in your application and to iterate over all possible objects. In either case, it’s important to not forget to also save javascript files, stylesheets and such. You might be tempted to use a third-party CDN like Google’s hosted libraries or CDN JS, but that also makes your site vulnerable if that service ever shuts down. Some web applications may also load elements via JSONP callbacks, and it’s easy to forget to save those.\n\nSearch is another question. If you have a small site, it might be simple enough to just disable search and use index pages (ie, select a state or schools that start with A), but large sites are unusable without search. So, your best options are to either rely on a third-party service like Google CSE or to perform the search in the client via javascript (that approach would likely involve generating an index that would have to be loaded on any page). The key to mothballing is that the final product should never connect back to a server to do things like search, pagination. You have to assume it will be hosted on a web server that can’t run any scripts or connect to databases.\n\nUnmarked Graves\nDeath is often worse for news applications, precisely because our work often stands apart on the sites that employ us. Almost any news programmer generally loathes their organization’s Content Management System; its codified formats and rigid workflows often feel more like strictures to our project. And so, we do our work outside the CMS, skinning our pages so they look like the main news site while remaining architectually apart. For instance, look at our how we reported election results in 2012. It’s actually hosted on Amazon S3 and skinned to look like New York Times content. Why go through this extra work just to make it look like articles produced via the CMS in the end? In our case, controlling our own technology stack enabled us to do dynamic projects like election results that wouldn’t be possible within the CMS. Also, the CMS model for stories is a foolish fit for data projects that may include many thousands of browsable pages; you just can’t and shouldn’t represent a relational database in a CMS. So, we do our work outside the bounds of the CMS, but it has a cost.\n\nThe New York Times has an advanced and bespoke CMS called Scoop that is used for composing all aspects of the New York Times website. Currently, Scoop imports articles from the print CMS that governs the physical print newspaper, but the plan is to soon invert that into a “web first” workflow where all articles are composed in Scoop before being laid out for print. Scoop is tightly integrated with the website and the newspaper. It is what web editors use to classify documents against the proper taxonomies and to rank articles on the homepage and section fronts. When stories are published, they are automatically syndicated to partners, published into the appropriate RSS feeds and added to site search. Stories also flow quickly into web search engines like Google and products like Lexis-Nexis. Of course, print articles also are distributed in a reasonably durable form to subscribers, some of which include libraries that also get the newspaper in microfilm format. Other news organizations have different CMSes, but the general components of each infrastructure are similar: importing, syndication, indexing and archiving.\n\nNarrative journalists rarely think about this infrastructure. It’s just there for everything they write, because everything they write goes through the CMS and there are strong archival and financial reasons to syndicate, index and archive that content for posterity. But, then there’s us data journalists. Remember, we decided to pitch our tents outside the CMS so we can build exciting and new types of interactive website experiences. Which often means that our work is invisible in this greater world. It doesn’t show up in site search. It doesn’t show up in Google News. It isn’t rankable on the homepage. Our projects look like they belong to the website, but they are also fundamentally apart and often invisible when running. When they are mothballed, they can vanish almost completely.\n\nSo, what is to be done? You need to make some friends and leave your little fiefdom:\n\n  Find the developers on the CMS team and talk to them.\n  If your company has indexers and archivists, talk to them too.\n  Target important aspects of the website ecosystem.\n  Figure out where to bury your projects when they’re dead.\n\n\nYou will likely have to tackle integration in fits and spurts. Most CMSes are not monolithic, but this is actually an advantage. You may be able to add your content directly to the site search index or syndication workflows without having to interact with the core CMS software. There will likely be some strange workarounds in your future; it’d be nice if the CMS team gave you a direct API to call, but if your code breaks the CMS at 3 AM, you’re not the person who will get the wakeup call, after all. Finally, see if you can bring your content into the organization as pages. We often build our sites on separate servers like Amazon S3 or EC2, but whenever someone forgets to pay the bills for hosting, those sites will vanish. And we want them to stick around for a long time, even if they are only static versions of their earlier glory.\n\nRage Against the Dying of the Light\nMy discussion of posterity seems ludicrous when applied to web projects. Do I really think it’s possible to preserve a web interactive for a hundred or more years? Yes, I do think it’s possible - eventually. But for now, I would like to see interactives last more than five years even. It’s surprisingly hard to plan for the future of the web. For instance, this election site from 2008 seems to have held up well, but opening it on an iPad reveals that much of the site disappears if Flash is not installed. Sites based on Java fare even worse. I would like to think that modern websites based on web standards like HTML5 will better survive time’s bending sickle, but this confidence is likely misplaced. For instance, what if our page relies on a javascript function that’s deprecated in the future and removed soon after? What if Javascript itself falls out of favor, supplanted by a new technology and eventually dumped by all browsers on the Galactic Hyperweb?\n\nMore likely though, our sites will fail through dissolution rather than incompatibility. Modern web pages are built from many requests: pulling HTML from the web servers, javascript libraries and stylesheets from content-delivery networks (CDN), data files from other API endpoints or edge networks. All it takes is for a few of those dependencies to break and a clever example of interactivity can become unusable. Link rot is inevitable. I sometimes like to look at the New York Times’ Hyperwocky website to reminisce about the goofiness of the Web in the late 90s. The site is still readable, but its context is a shadow of itself - most of the links are dead. Serious links also die. A recent study was surprised to find that 49% of links within Supreme Court opinions no longer work either. In our own projects, link rot might have similar bad effects. It might cause scripts or stylesheets to no longer load; it might make contextual links like “for more detailed analysis, click here” fail; it might even make us think our old sites work only to find out we are so very wrong.\n\nThe difficulty of this exercise is that the future remains stubbornly unpredictable. Static mothballed versions of our sites will work best for the short term. But should we be thinking further down the road and create more basic versions of our content in the hope that it’ll last longer? Should we be archiving our content in a more durable way? The Internet Archive already has been, to some extent. In addition, the Library of Congress has been heavily involved through its National Digital Information Infrastructure and Preservation Program. Much of this material is only riveting to digital archivists, but they’ve put some thought into what digital formats will age the best for posterity. Among these is a specification for encapsulating projects into a single web archive (or WARC file) that would at least ensure internal link consistency. The Internet Archive is already using these to represent crawls of site, but it might be useful to consider this format for manually snapshotting our own sites at various newsworthy moments or for ensuring a complete archive of large interactive sites with thousands of pages. And we might consider producing “legacy” versions of our projects with minimal to no javascript, maps baked out into static images, etc. if we were really serious about longevity.\n\nThis may seem absurd and it probably is. If it were part of the requirements for a site that it had to be functional for 20 years after it was decommissioned, we probably wouldn’t bother. And for many light things we do like “send in your dog photos,” it would be overkill. And yet, we do also cover hard news like elections or the Olympics or serious investigative pieces. Shouldn’t we do more to ensure our work is there for future historians rather than just ceding that to whatever appeared in a newspaper the following day?\n\nWhat’s Next\nWhile I was writing this piece, I realized really quickly that I was in over my head. As a developer, I simply do not have the mental framing to think like an archivist does, and I doubt I’m alone in that regard. Looking into those websites and standards, I was confused by all the jargon. As a developer who regularly quotes technical acronyms and the Hacker Dictionary, I am aware of the irony in this. Several organizations already have defined programming style guides; maybe we should consider some archiving style guides too? This is something we could work with archival organizations to develop, and as Matt Waite’s piece shows (if you haven’t read it by now, please do so), it’s a lot easier to plan for posterity in advance than when the project has ended and people have moved on to other things.\n\nDue to the varying capabilities of different web browsers, web designers early on learned to code their sites to support graceful degradation, where the app regresses to a more limited but still usable state if certain functionality is not available. This has since been supplanted by the concept of progressive enhancement, where sites are designed to work for a baseline first and functionality is added for more advanced browsers that can support it. These concepts may seem similar in execution, but they are derived from different philosophies and assumptions on how users will upgrade their browsers or what they support. For instance, the rise of mobile devices negated the assumption in gradual degradation that browsers will get faster and more advanced with time. Thinking about our sites in terms of degradation or enhancement seems like an excellent basis for future compatibility. Will there be a time where we can assume that browsers are much faster, but they lack compatibility for some of the standards we take for granted today?\n\nWe will also need to build tools. Django already has the excellent Django Bakery plugin for baking out dynamic sites into static pages, but there is no equivalent solution for Ruby on Rails or some other web frameworks. We also need better tools for verifying that web archives are internally consistent and not missing any files including stylesheets or JSON loaded by scripts. It’s not glamorous work, but it’s specific and well-suited for well-organized minds who have the methodical skills I personally lack."
        },
        {
          "id": "published-data-sausage",
          "title": "How the Data Sausage Gets Made",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/data-sausage",
          "content": "Let me start with a caution. This subject-both the food issues and the code issues-might make you queasy. Food safety is an issue that’s of critical importance. In the U.S., food safety is long on data and short on ways to make the data usable. Every few months, we get another multi-state outbreak that reminds us of the safety problems in our food supply and how significant they are. Sadly, these problems are largely inevitable; to keep food costs low as we expect them to be, companies cut corners or import more food from other countries with laxer food-safety laws. Meanwhile, federal regulatory agencies are unable to adequately police an increasingly complex food supply chain. Many people think about food poisoning in terms of meat. There is a reason for this; in 1993, there was a severe outbreak of food poisoning at 173 Jack-in-the-Box restaurants, caused by a relatively novel strain of the E. Coli bacteria (O157:H7). It hospitalized 171 victims and killed 4 people, 3 of whom were small children. Since then, we’ve come to expect regular problems with ground beef. But, meat accounts just only 22% of food poisoning outbreaks; in the past few years, there have been several major outbreaks stemming from cantaloupes, spinach, sprouts, and peanut butter.\n\nHowever, when it comes to shockingly-large outbreaks, meat is still king. And E. Coli is a persistent problem. Many victims of E. Coli food poisoning will recover, but children and the elderly can develop Hemolytic-Uremic Syndrome (HUS), which can lead to kidney failure, paralysis, and death. In 2010, New York Times reporter Michael Moss won a Pulitzer Prize for his reporting on the beef processing industry. In one article, he told the story of a dancer who became paralyzed from HUS after she ate a single tainted hamburger in 2007. I’ve always been interested in food safety, and I felt like there were many safety issues besides those affecting meat that were begging to be explored and many questions to be answered. I looked for the government data to answer them. The most obvious data to start with was food recalls.\n\n\n  What are the common causes of food recalls?\n  How frequent are food recalls? And how many of those are because of  E. Coli?\n  What is the typical volume and distribution of a food recall?\n  What data-driven picture could I build up about the food supply?\n\n\nI am a software developer who works within the newsroom of the New York Times. I work on news-driven projects like our Olympics or U.S. Elections websites. What we do is called data journalism - it is also known by the quaintly dated moniker of computer-assisted reporting - because we often do similar things with data as journalists do with sources, such as:\n\n\n  Gathering the data we need to tell a story\n  “Interviewing” the data to find its strengths and limitations\n  Finding the specific narratives in the data we want to share and can support with data\n\n\nAs a computer science major, I’m far more experienced with data than the journalism aspect of my career, but food safety was an area I could get experience gathering the data and working with it in a narrative way. What do I mean by narrative? Narrative is what makes it data journalism. We could just put a large PDF or SQL dump online, but that’s not very informative to anyone but experts. The art is finding the stories in the data the way a sculptor finds a statue in the marble. Since we are turning data into a story, we also need to keep the data-and thus the story we make from it-accurate and objective. I wanted more practice working with data. So, I started by scraping food recalls.\n\nIn this case study, I’m going to just discuss a single type of data associated with food safety: food recalls. Along the way, I’ll illustrate some techniques I use for wresting data out of raw text and the limitations of the results. Finally, I’ll throw down the gauntlet and suggest ways in which you could explore making all of this better.\n\nFood Recalls\nThere are two agencies that regulate recalls in the USA: the U.S. Department of Agriculture (USDA), which inspects meat and poultry; and the Food and Drug Administration (FDA) which oversees seafood, processed food, and everything else - they also inspect medical devices and pharmaceuticals. Neither the FDA or the USDA are allowed to forcibly mandate a recall, but they help to find the sources of problems and issue the press releases from companies once they issue recalls. I have been parsing recalls from both the FDA and the USDA, but for the sake of brevity, I’m just going to talk about the USDA here. The USDA Food Safety and Inspection Service website is where the USDA’s food recalls are posted. Recalls are posted as freeform text releases. Whatever data I needed, I would have to pull out of the text myself. The FSIS includes current recalls and an archive back to 1994 (although the format changes for recalls before 2003).\n\nAll recalls are posted as press releases, but they generally follow the same general format, as is apparent from looking at a few of them. At a glance, the following information seems to be present in all recalls:\n\n\n  The title of the recall\n  The reason for the recall\n  The recalling company\n  The category of food being recalled\n  The date the recall was issued\n  The volume of the recall (often but not always in pounds)\n  The geographic range of the recall\n\n\nThis looks like a good start to a data schema. There is some other fascinating information in there too (product labels, UPC codes, retailer lists), but I needed to start somewhere, so I collected the only the data types in the list above.\n\nModeling the Recalls\nThe first step was to create a place to store the data about recalls. I use the Ruby On Rails web framework, so I created a new Rails project. The next step was to define the appropriate models. Each recall has an associated reason and a category of food (more on that later). In the ActiveRecord framework for Object-Relational Mapping (ORM), a recall is described like this.\n\nclass Recall &lt; ActiveRecord::Base\n    belongs_to :reason\n    belongs_to :food_category\n    belongs_to :company\n\nThere will likely be many recalls associated with a particular reason (eg, “E. Coli”) or in a particular food category (eg, “Ground Beef”), and creating separate tables for them is a common data approach. Here is the schema for creating the recalls table.\n\ncreate_table \"recalls\", :force =&gt; true do |t|\n  t.string   \"title\"\n  t.string   \"url\"\n  t.string \"type\"     # USDA or FDA in my database\n  t.text     \"html_content\",      :limit =&gt; 2147483647\n  t.string   \"parse_state\",       :limit =&gt; 12\n  t.string   \"source_id\",         :limit =&gt; 64\n  t.integer  \"reason_id\"\n  t.date     \"recall_date\"\n  t.integer  \"volume\"\n  t.string   \"volume_unit\",       :limit =&gt; 16\n  t.string   \"summary\",           :limit =&gt; 512\n  t.integer  \"company_id\"\n  t.integer  \"food_category_id\"\nend\n\nIn addition, I decided to create specific categories for the reasons and the type of food. This way, I could use a controlled vocabulary of keywords for those categories, making it easier to find all the recalls of a specific type. For instance, “undeclared allergen” is one of my list of reasons regardless of whether it’s sulfites, eggs, nuts, or other unlisted allergens that triggered the recall. This approach requires me to devise the categories and reasons I want to tag with, but it makes searching for matches much easier than freeform text fields. I also decided to create a separate companies table in case I wanted to associate multiple recalls with a single company.\n\nScraping the Pages\nMy first goal in this project was to grab the recalls from the USDA website. This sounds simple enough, but it’s actually a process with several steps.\n\nFetching from Unreliable Sources\n\nModern programming languages make it painfully easy to read content from remote web pages. In Ruby, for instance, the open-uri library allows programmers to open remote web pages as easily as they would load files on their local filesystem. Which is great, except the Web is like an extremely unreliable hard disk. It seems like almost every government web server will crash under even moderate load, and it can take several runs to fetch all the pages of an archive. Even when the remote server is working, it can be painfully slow. I knew I would likely be tweaking my code to process the USDA recalls iteratively. Thus, it was important to cache the HTML of recalls locally. Scraping and analyzing the pages thus involves the following distinct steps:\n\n\n  Populate the Recalls table with one record for each recall URL we want to fetch.\n  For each initial recall, fetch the HTML and save it to the Recall record. Mark the recall as retrieved.\n  For each recall marked retrieved, run our analysis to extract the data fields. Mark the recall as analyzed.\n  Sometimes I just need to hand-edit the fields for a record. To keep it from being overwritten if I reparse all the records marked analyzed, I’ll mark those as verified.\n  Sometimes, I have records that aren’t actually recalls but have URLs that look like recalls. I could delete them, but they might get added again when I look for recalls. Instead, I’ll mark them rejected and ignore them in all other steps.\n\n\nIf you’ve been following along here, you’ll notice that I’ve described each Recall object as a Finite State Machine. And the parse_state field above is where I track the state of each recall. This might seem like an overly mechanistic way of looking at things, but it works well for scraping websites. This way, when the retrieval script bombs out halfway through fetching recalls from the USDA, it’ll resume without fetching pages it already gathered. Similarly, analysis can be only rerun on small batches at a time and can recover from its own crashes.\n\nPopulating the Recalls Table\nOkay, I now had a plan. The first challenge was to figure out how to find the recalls on the USDA website. After spending a bit of time clicking around the FSIS site, there were clearly three types of pages where USDA recalls might be listed:\n\n\n  The current recalls page: http://www.fsis.usda.gov/fsis_recalls/Open_Federal_Cases/index.asp\n  The current year’s archived recalls: http://www.fsis.usda.gov/FSIS_RECALLS/Recall_Case_Archive/index.asp\n  An archive page for an earlier year: http://www.fsis.usda.gov/fsis_recalls/Recall_Case_Archive_2011/index.asp\n\n\nSo, we can figure out the URLs of all recalls by loading these “index pages” and gathering the URLs of the recalls linked from those pages. Looking further at the recall index pages reveals that a Recall URL looks like one of the two types:\n\n\n  http://www.fsis.usda.gov/News_&amp;amp;_Events/Recall_010_2013_Release/index.asp (2007 onwards)\n  http://www.fsis.usda.gov/fsis_recalls/RNR_053_2005/index.asp (before 2007)\n\n\nTo find the recalls linked off one of these index pages, I simply had to check if a URL matched either of these two regular expressions.\n\n\n  /http:\\/\\/www\\.fsis\\.usda\\.gov\\/News_&amp;amp;_Events\\/Recall_\\d+_\\d{4}_Release\\/index\\.asp/\n  /http:\\/\\/www\\.fsis\\.usda\\.gov\\/FSIS_Recalls\\/RNR_\\d+-\\d{4}\\/index\\.asp/\n\n\nMy first step was thus to create pending Recalls by fetching each of the USDA’s archive pages and creating a record for each URL matching one of these regular expressions that it hadn’t seen before. I do this by fetching the HTML of each possible index page, finding all the URLs inside the page, and creating a new Recall record for each URL I haven’t seen. This code does not fetch the HTML for the recalls. That is done by a separate method that looks for all Recalls in the initial state and retrieves the HTML for each of them individually.\n\nIt would be okay if this process ran in batches or crashed halfway through. It’ll just continue from where it left off. Actually, it did crash a few times and it was really slow. Crawling all the recalls back to 2004 took 3-4 days. Good thing I’m saving the HTML locally. But I found 492 recalls in the process, which is a decent data set (if you are curious, I also collected 2,811 recalls from the FDA)\n\nAnalyzing the HTML\nNow, that I grabbed the raw HTML of the file locally, I could now extract the data I want from it. Were I a great programmer, I would have devised an elegant method for algorithmically understanding the data in the recall; natural-language processing or some machine-learning mechanism seem like promising approaches. I’m not a great programmer, however. Faced with a problem like this, my only tool is a rough sledgehammer: more regular expressions. This approach is crude but effective, as long as you remember the most important rule: Never use regular expressions to parse an HTML document. Such is the path to madness.\n\nA far better approach is to parse the HTML into the Document Object Model (if you are using Ruby, try the Nokogiri library; Python coders can use BeautifulSoup). These libraries will allow you to select specific sub-elements of the document using XPath or CSS notation. Most modern websites use semantic HTML. This means they define their layout using descriptive classes named things like class=\"summary\" rather than simple stylistic CSS classes like class=\"bold larger-font justified\". This approach makes it easier for designers to redesign a page later, but it conveniently also makes it much simpler for us to find the elements we want to scrape data from.\n\nUnfortunately, the USDA FSIS recall site is not a modern website. The entire page is formatted using nested tables, and the only use of CSS classes is for basic text formatting; when you see a CSS class named BodyTextBlack, you know you are screwed. The following excerpt provides a taste of what awaited me\n\n&lt;!-- BEGIN PAGE CONTENTS UNDER BANNER IMAGE --&gt;\n&lt;tr&gt;\n  &lt;td&gt;\n    &lt;table width=\"368\" border=\"0\" cellpadding=\"6\" cellspacing=\"0\"&gt;\n      &lt;tr&gt;\n        &lt;td class=\"BodyTextBlack\"&gt;\n          &lt;table border=\"0\" cellspacing=\"0\" style=\"border-collapse: collapse\" bordercolor=\"#111111\" width=\"356\"&gt;\n            &lt;tr&gt;\n              &lt;td class=\"BodyTextBlack\" width=\"213\"&gt;Recall Release&lt;/td&gt;\n              &lt;td class=\"BodyTextBlack\" width=\"155\"&gt;&lt;strong&gt;CLASS I RECALL&lt;/strong&gt;&lt;/td&gt;\n            &lt;/tr&gt;\n            &lt;tr&gt;\n              &lt;td class=\"BodyTextBlack\" width=\"213\"&gt;FSIS-RC-068-2012&lt;/td&gt;\n              &lt;td class=\"BodyTextBlack\" width=\"155\"&gt;&lt;strong&gt;HEALTH RISK: HIGH&lt;/strong&gt;&lt;/td&gt;\n            &lt;/tr&gt;\n          &lt;/table&gt;\n\nOoof. I think I’m going to be sick. Sadly, if you are planning to scrape government data, you should expect to be horrified on a regular basis.\n\nGoing Meta\nWhen faced with unpleasant HTML, there is often one other escape we can try before we’re plunged into the muck of nested tables. Many auto-generated files will often have meta tags defined, and it can be helpful to look at them to extract the information we need. Sure enough, in a USDA recall, there are the following meta tags:\n\n&lt;meta name=\"description\" content=\"Main Street Quality Meats, a Salt Lake City, UT, is recalling approximately 2,310 pounds of ground beef products that may be contaminated with E. coli O157:H7.\"&gt;\n&lt;meta name=\"keywords\" content=\"food recall, FSIS, beef, 068-2012, ground beef products, E. coli\"&gt;\n\nHere is some code in Nokogiri to pull the summary from the document by using the meta tag:\n\nmeta = @html.xpath(\"//meta[@name = 'description']\")\nsummary_text = meta.first.attributes[\"content\"].to_s unless meta.nil? || meta.first.nil?\n\nunless summary_text.blank?\n      summary_text.squish!\nend\n\nThis code grabs the meta tag summary and also uses Ruby’s String#squish method to remove extraneous whitespace in the summary.\n\nBrute Force and REGEXPs\nAlthough each recall is hand-written, they follow enough of a general format that I could build regular expressions to extract what I needed from the document. The document summary is a great source for a lot of the information I need from a recall. Here is that description again that I plucked from the page’s meta tags:\n\n\n  Main Street Quality Meats, a Salt Lake City, UT, is recalling approximately 2,310 pounds of ground beef products that may be contaminated with E. coli O157:H7.\n\n\nOnce you look at several recalls, it’s generally apparent they follow a particular form even though the exact phrasing may vary:\n\n\n  Advance Pierre Foods, an Enid, Okla. establishment, is recalling approximately 1,200 pounds of chicken fried chicken breasts that may contain small pieces of metal, the U.S. Department of Agriculture’s Food Safety and Inspection Service (FSIS) announced today.\n\n  Pinnacle Foods Group LLC, a Fort Madison, Iowa, establishment, is recalling approximately 91,125 pounds of a canned chili with beans product because it was inadvertently packaged with an incorrect flag on the plastic over-wrap and may contain an undeclared allergen, wheat, the U.S. Department of Agriculture’s Food Safety and Inspection Service announced today.\n\n  United Food Group, LLC, a Vernon, Calif., establishment, is voluntarily expanding its June 3 and 6 recalls to include a total of approximately 5.7 million pounds of both fresh and frozen ground beef products produced between April 6 and April 20 because they may be contaminated with E. coli O157:H7, the U.S. Department of AgricultureΓÇÖs Food Safety and Inspection Service announced today.\n\n\nThere are some variations, but it’s clear they generally follow the same format:\n\n\n  COMPANY NAME, from LOCATION, is recalling N VOLUME of PRODUCT TYPE something something REASON something.\n\n\nKnowing this, I devised some regular expressions to extract the fields I needed from the summary.\n\nCompany Name\nThis is pretty simple to figure out.\n\nif !summary.blank? &amp;amp;&amp;amp; summary =~ /^(([A-Z0-9][0-9[:alpha:]\\.]+\\s*)+)/\n  company_name = $1\nend\n\nThe recall summary always begins with the company name. This regular expression looks for one or more capitalized words at the beginning of the summary. It assumes that is the company name.\n\nProduct Type and Reasons\nAlthough the summaries generally put the reason at roughly the same place, the phrasing is often varied enough that it’s not simple to extract the reason from the text. Given that most of the reasons have their own specific terminology like “E. Coli” or “Salmonellosis”, it’s easier to invert the process and iterate through a list of possible reasons trying their regexps individually against the summary until one matches (otherwise, tag the recall’s reason as “Other”). Here are some typical reasons for a food recall and regexps that might be used.\n\n\n  \n    \n      Reason\n      Regexp\n      Notes\n    \n  \n  \n    \n      E. Coli\n      /\\bcoli/\n       \n    \n    \n      Salmonella\n      /\\bsalmonell/\n      Sometimes it’s salmonella or salmonellosis\n    \n    \n      Undeclared Allergen\n      /\\b(undeclared\\|allerg)\\b/\n      Sometimes the summary may not specify allergies specifically\n    \n    \n      Listeria\n      /\\b(listeria\\|listeriosis)\\b/\n       \n    \n    \n      Foreign Materials\n      (foreign material)\\|(may contain (\\w+\\s)?(pieces of\\|fragments of)?\\s?(glass\\|metal\\|plastic))\n       \n    \n  \n\n\nAnd so on. Similarly, I also look for specific phrasings to figure out the product type being recalled:\n\n\n  \n    \n      Product\n      Regexp\n      Notes\n    \n  \n  \n    \n      Ground Beef\n      /ground beef\\|hamburger/\n       \n    \n    \n      Chicken\n      /chicken\\|wings\\|poultry/\n       \n    \n    \n      Beef\n      /beef/\n      This regexp needs to be run after the ground beef one\n    \n    \n      Sausage\n      /sausage\\|chorizo\\|salami\\|mortadella/\n      Add more sausage types here&lt;\n    \n  \n\n\nGenerally, we will want to be careful of several things when devising regular expressions to fish for matches within text:\n\n\n  Make sure the text you are checking against is small. To run multiple regular expressions against the entire document will be slow and also prone to false matches.\n  You will want to make sure your regexps are case-insensitive and can run across line breaks.\n  Sometimes it might be useful to evaluate regexps in order of decreasing specificity. For instance, if you were curious about recalls of ground beef specifically as opposed to recalls of all beef, you’d want to run the more specific regexp first.\n\n\nVolume\nOne particularly fun thing about the USDA data is that many recalls are provided with an estimate of how much meat was affected. This could lead to some stomach-churning statistics, so let’s pull it out too:\n\nINDIVIDUAL_UNITS = %w(unit package packet can jar pint box)\nUNITS = %w(pound case lot carton crate) + INDIVIDUAL_UNITS\nunit_regex = /#{UNITS.join('|')}/\n\nunless self.summary.blank?\n  if self.summary =~ /([\\d,\\.]+)\\smillion\\s(#{unit_regex})s?/\n    self.volume_unit = $2\n    self.volume = $1.gsub(',','').to_f * 1_000_000\n  elsif self.summary =~ /([\\d,]+)\\s(#{unit_regex})s?/\n    self.volume_unit = $2\n    self.volume = $1.gsub(',','').to_i\n  end\nend\n\nHand-Correcting the Data\n\nSo, I was able to take a collection of text recalls and turn them into a database. Time for a victory coffee while the computer parses all of the recalls (easy to do when I’ve saved the HTML locally). And voila! Here are the 10 most recent USDA recalls once they’ve been run through the processor\n\n\n  \n    \n      Company\n      Product Type\n      Reason\n      Volume\n    \n  \n  \n    \n      Advance Pierre Foods\n      Poultry\n      Foreign Materials\n      1200 pounds\n    \n    \n      Gab Halal Foods\n      Ground Beef\n      Salmonell\n      550 pounds\n    \n    \n      Stallings Head Cheese Co.\n      Fish\n      Salmonella\n      4700 pounds\n    \n    \n      Jouni Meats\n      Ground Beef\n      Salmonella\n      500 pounds\n    \n    \n      Annie\n      Prepared Meals\n      Other\n       \n    \n    \n      Global Culinary Investments\n      Poultry\n      Monosodium glutamate (MSG)\n      1331 pounds\n    \n    \n      LJD Holdings\n      Beef\n      Listeria\n      33500 pounds\n    \n    \n      Glenn\n      Ground Beef\n      E. Coli\n      2532 pounds\n    \n    \n       \n      Prepared Meals\n      Undeclared Allergen\n      2764 pounds\n    \n    \n      Stehouwer’s Frozen Foods\n      Sausage\n      Undeclared Allergen\n      6039 pounds\n    \n  \n\n\nThis is promising, but you might have noticed some gaps in the data here. And other cases where it looks like the regexp fell short. For instance, here is the summary for the “Annie” recall\n\n\n  Annie’s Homegrown Inc., a Berkeley, Calif. establishment, is recalling an undetermined amount of frozen pizzas that may be contaminated with extraneous materials.\n\n\nHere, our regexp for the company name ran headlong into the apostrophe. Time to fix that bug and run the parsing again. I probably have gone through 20 different tweaks to some regular expressions. Even after that, I found that it was sometimes necessary to just hand-edit the data I extracted from a recall instead of continually tweaking my parsers. To do this, I built an admin to search for recalls and edit them (screenshots attached). This is really easy to do in Rails, which I why I wrote the project in it. It’s important though that hand edits are not overwritten later if I rerun all my regexp data extractors again. This is why I defined an additional parse_state called verified. Once I manually edit a recall, its state is set to verified, and I make sure to only rerun my regexps against Recalls that are just in the analyzed state.\n\nInterviewing the Data\nSo, now that I had the data, it was time to ask it questions. In data journalism, we often refer to a process of “interviewing the data.” Let’s take this data out for a spin. While we often approach a data set looking for specific stories, sometimes there are other stories revealed by drilling down in the data.\n\nWhat Are the Biggest USDA recalls?\nI’m curious, so the first thing I checked is what were the biggest recalls:\n\nSELECT recall_date, reasons.title, food_categories.name, volume, companies.name\nFROM recalls\nINNER JOIN reasons ON reasons.id = recalls.reason_id\nINNER JOIN companies ON companies.id = recalls.company_id\nINNER JOIN food_categories ON food_categories.id = recalls.food_category_id\nWHERE parse_state &lt;&gt; 'rejected'\n  AND volume_unit = 'pound'\n  AND type = 'UsdaRecall'\nORDER by volume DESC\nLIMIT 15\n\n\n  \n    \n      Date\n      Reason\n      Type\n      Volume\n      Company\n    \n  \n  \n    \n      2011-08-03\n      Salmonella\n      Poultry\n      36,000,000\n      Cargill Meat Solutions Corporation\n    \n    \n      2010-06-17\n      Underprocessing\n      Prepared Meals\n      15,000,000\n      Campbell Soup Supply Company\n    \n    \n      2012-09-05\n      E. Coli\n      Beef\n      2,500,000\n      XL Foods, Inc.\n    \n    \n      2012-10-22\n      Undeclared Allergen\n      Sausage\n      1,768,600\n      BEF Foods Inc.\n    \n    \n      2010-02-04\n      Salmonella\n      Sausage\n      1,240,000\n      Daniele International Inc.\n    \n    \n      2009-02-04\n      Salmonella\n      Poultry\n      983,700\n      Chester\n    \n    \n      2010-01-18\n      E. Coli\n      Beef\n      864,000\n      Huntington Meat Packing Inc.\n    \n    \n      2007-10-06\n      E. Coli\n      Ground Beef\n      845,000\n      Cargill Meat Solutions Corporation\n    \n    \n      2009-08-06\n      Salmonella\n      Ground Beef\n      825,769\n      Beef Packers\n    \n    \n      2009-01-30\n      Foreign Materials\n      Beef\n      676,560\n      Windsor Quality Food Co.\n    \n    \n      2009-06-10\n      Undeclared Allergen\n      Poultry\n      608,188\n      Pilgrim’s Pride Corp.\n    \n    \n      2009-10-31\n      E. Coli\n      Ground Beef\n      545,699\n      Fairbank Farms\n    \n    \n      2005-04-12\n      Undeclared Allergen\n      Prepared Meals\n      473,500\n      Campbell Soup Supply Company\n    \n    \n      2009-07-22\n      Salmonella\n      Ground Beef\n      466,236\n      King Soopers\n    \n    \n      2004-08-20\n      E. Coli\n      Beef\n      406,000\n      Quantum Foods&lt;/td&gt;\n    \n  \n\n\nThere are a few repeat offenders in there. Let’s look and see how much Cargill has been recalled:\n\nSELECT recalls.recall_date, food_categories.name, reasons.title, volume, recalls.title\nFROM recalls\nINNER JOIN reasons ON reasons.id = recalls.reason_id\nINNER JOIN food_categories ON food_categories.id = recalls.food_category_id\nINNER JOIN companies ON companies.id = recalls.company_id\nWHERE parse_state &lt;&gt; 'rejected'\n  AND companies.name LIKE 'cargill%'\n  AND type = 'UsdaRecall'\nORDER BY recalls.recall_date DESC\n\n\n  \n    \n      Date\n      Category\n      Reason\n      Volume\n      Title\n    \n    \n      2012-07-22\n      Ground Beef\n      Salmonella\n      29,339\n      Pennsylvania Firm Recalls Ground Beef Products Due To Possible Salmonella Contamination\n    \n    \n      2011-10-01\n      Poultry\n      Salmonella\n      185,000\n      Ohio Firm Recalls Chef Salads Containing Meat and Poultry Due to Possible Salmonella Contamination Of Tomatoes\n    \n    \n      2011-09-27\n      Poultry\n      Salmonella\n      185,000\n      Arkansas Firm Recalls Ground Turkey Products Due to Possible Salmonella Contamination\n    \n    \n      2011-08-03\n      Poultry\n      Salmonella\n      36,000,000\n      Arkansas Firm Recalls Ground Turkey Products Due to Possible Salmonella Contamination\n    \n    \n      2010-08-28\n      Ground Beef\n      E. Coli\n      8,500\n      Pennsylvania Firm Recalls Ground Beef Products Due to Possible E. coli O26 Contamination\n    \n    \n      2007-10-06\n      Ground Beef\n      E. Coli\n      845,000\n      Wisconsin Firm Recalls Ground Beef Products Due to Possible  E. coli O157:H7 Contamination\n    \n  \n\n\nDrilling down within the data reveals that the Cargill problems have been at different locations. But the Arkansas plant has been the largest offender. Reading the text of the 8/3/2011 recall reveals it was triggered by an outbreak that hospitalized 22 people and killed 1 person. The 36 million pounds of turkey were produced over a six month period. Does this mean that all the turkey being recalled was tainted? It’s hard to say. How much of Cargill’s total output from that plant was affected by that recall? It’s hard to say. How did the USDA narrow down the outbreak to that source? The data doesn’t tell us. These are intriguing details that tell part of a story, but often we’ll have to look at other datasets, documents, or sources to figure out the story. Even if the data seems enough to tell the story, we’d want to verify against data and sources outside of the story.\n\nHow many pounds of beef get recalled each year?\nSo, how have efforts to fight E. Coli in the food supply been going? We can look at the data and see.\n\nSELECT YEAR(recalls.recall_date), count(*), sum(volume) AS pounds\nFROM recalls\nINNER JOIN reasons ON reasons.id = recalls.reason_id\nINNER JOIN food_categories ON food_categories.id = recalls.food_category_id\nWHERE parse_state &lt;&gt; 'rejected'\n  AND food_categories.slug IN ('ground-beef', 'beef')\n  AND reasons.slug = 'ecoli'\n  AND type = 'UsdaRecall'\n  AND volume_unit = 'pound'\nGROUP BY YEAR(recalls.recall_date)\nORDER BY YEAR(recalls.recall_date) DESC\n\n\n  \n    \n      Year\n      Recalls\n      Volume (lbs)\n    \n  \n  \n    \n      2013\n      2\n      3792\n    \n    \n      2012\n      6\n      2,563,467\n    \n    \n      2011\n      13\n      773,799\n    \n    \n      2010\n      9\n      1,150,647\n    \n    \n      2009\n      9\n      804,804\n    \n    \n      2008\n      8\n      2,157,497\n    \n    \n      2007\n      11\n      1,247,385\n    \n    \n      2006\n      6\n      21,328\n    \n    \n      2005\n      1\n      63,850\n    \n    \n      2004\n      5\n      668,335\n    \n  \n\n\nIt’s very easy to group columns and derive tables like this from data using SQL. You could easily envision this as a source for a story on food safety arguing the problem has not gotten better. But, it’s painfully easy to jump to the wrong conclusions when using this data for reporting. There are several big ways just presenting this table as journalism can go wrong:\n\n\n  Why are all the years before 2007 so sparse? Was that a golden age of food safety or is there something wrong with our data? (There was something wrong with the data parsing actually.)\n  Precision can be deceptive. It looks like we can say down to the pound how much meet was recalled in each year. But that number is bogus, since it’s a sum of fuzzy numbers from large recalls (e.g., “approximately 1.4 million”) and precise numbers from small recalls. When presenting totals like this, it’s better to forcibly round to fuzzier volumes, since higher precision suggests our data is more exact.\n  Double-counting is a problem. Companies will sometimes issue revised recalls with expanded product lists and new volume estimates. That this happens with very large recalls makes the possibility for major error even worse. If I wanted to report these trends, I’d have to double-check for duplicates.\n  Averages are even more deceptive. We might be tempted to view the recall trend each year by averaging the volume over the number of recalls in a given year. This is an even fuzzier number though. The problem is that recall volume doesn’t necessarily follow a random distribution. There is a power law in effect where a few single recalls are responsible for the bulk of the recall volume, making a measure of the average case pretty ludicrous.\n  Volumes for a single recall can be dizzying. But without knowing the total production volume from a facility, it’s hard to say how endemic the problems are. Similarly, not all meat recalled is necessarily tainted, it just might be.\n  The volumes for some years are dizzying too, but trends based on absolute values could be problematic too. For instance, 2012 might not be considered a worse year than 2010, if there was twice as much ground beef produced in 2012.\n  “Worse” is a loaded term. 2012 has much larger recall volume than 2011. Does this mean that 2011 was a safer year than 2012? Or does it just mean that food outbreaks in 2011 were not traced back to sources? It’s important to note what the data doesn’t include. Recalls are issued for food sold to the general public in stores. Fast food restaurants and public school cafeterias have their own supply networks and they will not issue recalls if they notice problems from a supplier.\n\n\nThat’s a pretty big list of caveats there. I’m not trying to discourage you from working with data. We just have to be careful and remember that we are trying to use the data to report the truth. This means we have to be skeptical of the data and never promise more than it can deliver.\n\nWhat We Can’t Learn from the Data\nUnfortunately, food recalls reveal only so much about food safety. It’s always important to investigate outside the dataset to find what it lacks. For instance, the Center for Disease Control (CDC) estimates that the norovirus is responsible for 34% of recorded outbreaks&lt;/a&gt;, but there is only a single food recall that mentions norovirus. Food recalls can be triggered by food poisoning outbreaks, but they are also often triggered by random inspections unrelated to reported illnesses, by state statutes, or by manufacturing problems - a large number of “undeclared allergen” recalls happen because a single batch of a product is put in the wrong box.\n\nWe could look at the CDC’s data on food outbreaks, but that has its own limitations we’d also have to check. Ultimately, some aspects of the problem might be unknowable. It’s hard enough to get a total view of a subject by collecting datasets; for instance, campaign finance data and TV ad spending give us additional insight into presidential elections, but imagine if they were the only way to report the story? Food safety is especially murky. Unless they result in hospitalizations or deaths, most outbreaks are not reported, because it’s often hard to say whether that queasy stomach is from the takeout you got last night or the “stomach flu” that’s going around. And only a small amount of food is preemptively inspected by food and health agencies.\n\nThis doesn’t mean we should give up. Indeed, there are still plenty of interesting things to explore in the food recalls data. But it’s an easy trap when working with any dataset to think it’s all you need to understand the story when the data itself reflects external limitations and assumptions you aren’t necessarily aware of. Always make time to figure out what you can’t figure out with the data.\n\nPlease Solve Me\nSo, there you have it. A simple how-to on how I wrested some data on food safety from the raw text of food recalls. It wasn’t pretty, but it worked. There are things that could be done better, for this and similar problems where we have to find data in freeform text. That’s where you come in. I want to inspire you to get excited about solving these problems journalists have in working with large bodies of data to get important stories out of them. You can start here.\n\nA Good Consumer Tool for Food Recalls\nThe majority of food recalls involve food sold at grocery stores. Many stores will be attentive about pulling recalled products and putting signs in the store, but they can’t contact you at your home to let you know that box of ravioli in your freezer was recalled a few months ago. Recalls do provide some interesting information for consumers. USDA recalls provide package labels and retail locations; FDA recalls often provide labels and UPC codes. It seems like it could be possible to create a helpful app for consumers who want to be informed about recalls. It wouldn’t be necessary to scan barcodes and track inventory; just letting me know the recalls that might affect me as a Trader Joe’s and Safeway shopper in Maryland would be enough.\n\nBeyond Regular Expressions\nIt should be obvious by now how contrived regular expressions can be for understanding the contents of recalls. It only works as well as it does because recalls tend to follow standardized patterns and the vocabulary of reasons and food categories is specialized enough that it works. The regular expressions look for matching words, but they don’t understand what the text says. An approach using natural-language processing might work better, especially since the opening sentence for most recalls involves the same clauses in the same order. Natural language approaches might also be more robust than using meta tags to find some data; in older recalls, these were often blank or sometimes even for the wrong recall and had to be manually corrected.\n\nCompany names are the biggest source of confusion and duplication however. There are at least four variations on “Cargill” in the recalls database that reflect different divisions and locations of meat processing plants. Some sort of mechanism for normalizing corporate names might help to better identify repeat offenders with issues at multiple locations. The OpenCorporates API seemed like a strong possibility, but using corporate registrations doesn’t seem to help with duplication and obfuscation. We’d also want to scope to only companies working in the food sector (maybe using NAICS codes?). Whatever we used would have to work for large multinational conglomerates down to small delis recalling a single batch of premade meals. So, it would probably be bespoke, but this problem of normalizing companies is one that happens a lot in datasets, so it would be great to improve what is out there.\n\nModeling the Supply Chain\nOne interesting aspect about food recalls is that they inadvertently reveal hidden connections in the global food supply chain. A recall is usually issued by a single company, but there are often many companies involved playing various roles:\n\n\n  The company issuing the recall\n  They may be an importer and the manufacturer may be a separate company\n  Distributors and institutional suppliers\n  Suppliers. When a major producer of a component like roasted peanuts or processed meat issues a recall, there can be many dozens of recalls made by companies downstream.\n  Grocery stores. Sometimes they are just selling the recalled product. Sometimes the recalled product is a store brand produced by another company.\n\n\nA case of related recalls wouldn’t produce a comprehensive means of illustrating the modern food supply chain in itself, but there are potentially interesting stories to be told in there about how food is produced these days. We just need tools to find those stories and visualizations to show them to journalists and the general public.\n\nGet to work!"
        },
        {
          "id": "published-election-loader",
          "title": "The New York Times' Election Loader",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/election-loader",
          "content": "This is an article about a very specific but important part-the election results loader-of the New York Times’ elections coverage online. The initial election results loader was written mostly by Ben Koski and me in the run-up to the general election of 2008. It also served up the election results for the midterms of 2010. This past year, I modified it further to support multiple election dates and some of the unique quirks of each state’s primaries. In the homestretch, I also got some much-needed help from Brian Hamman, Jacqui Maher, Michael Strickland, and Derek Willis. Of course, there are so many more people who worked on the Times’ election site, but their work deserves a better chronicler than I. (Besides, I’m sure all of you will find election loading as intensely riveting as I do.)\n\n\n    \n        \n            \n            The race-monitoring and calling interface for Ohio\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nLike every other news organization on the planet, we get our election results data from the Associated Press. Google may have had some success providing results from some early caucuses, but nobody matches the depth of AP’s operation. There are roughly two tiers of customers for the AP Election Results: the TV networks who pay lavishly for the timeliest results, and the other news organizations like newspapers that get a slightly time-delayed version via FTP.\n\nFor us, the AP provides an FTP server where they update election files on a regular basis. There are three files that specify the election results in a state:\n\n  The Races file specifies basic race metadata (i.e., this is the Ohio House 7th District Republican Primary) and records how many precincts are reporting at any given point in the night. This file includes both statewide and county-level results (i.e., the Presidential results for Ohio and the Presidential results for Cuyahoga County, Ohio).\n  The Candidates file provides a list of AP candidates and their identifiers.\n  The Results file provides the vote totals for each candidate in a race.\n\n\nThe Candidates file is largely fixed by the time an election occurs in a given state, so it only needs to be loaded once at the beginning of a cycle. Still, this means loading 102 files on a general election night (two for every state and the District of Columbia) containing some 50,000 state and county races. We wanted to run a load roughly once a minute, and just grabbing all those files from the FTP server takes approximately 30 seconds… so clearly, we needed to be fast with everything else. This constraint shaped a lot of our design, but the resulting code enabled several powerful features down the road.\n\nIt all starts with race changes.\n\nTrack What Changes\n\n\n    \n        \n            \n            A view of terminal output of the loader’s command-line execution. I liked to color the production terminal in a special color to distinguish it.\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nOnce loaded, the AP files provide a complete representation of the election at that particular moment in  time. But usually 10% or fewer of those races change between one load and the next. We were concerned that it would unduly stress the database to reload rows of unchanged data within transaction during each load. So, we decided to figure out which races change on any given load, and then only update those. We do this by creating parallel staging and production tables for races and results. Staging is used to load the races for that day. Production includes races for that day and prior elections. We also created a special table (race_changes) for tracking race_changes. Our loading process then runs like this:&lt;\n\n  Create a new Load record in the DB with a unique autoincrement ID\n  Clear all the staging tables\n  Load AP races/results into the staging tables\n  Run the ChangeDetector (a series of SQL queries that look for cases where a race has changed)\n  For each changed race, add a row to race_changes with the race_id and the load_id\n  Update production races/results by copying over from staging all races in the race_changes table for that load\n\n\nWhat marks a race change? Generally, there are a not a lot of situations we need to check for. If a race/result is not found in the production table is an obvious example of a race change (i.e., the initial load). But when the loader has been running for a while, it usually is triggered by the following situations:\n\n  The number of precincts reporting has changed\n  The number of votes for a candidate has changed\n  A candidate has been declared the winner\n  A call for a candidate has been retracted\n\n\nThere are a few other special cases in there (for instance, when we manually call races at the Times, we pick that up as a race change even if none of the other change conditions were triggered), but these change conditions are pretty simple to specify and really fast to check for in SQL.\n\nIt’s a little more complex than just loading the AP race data directly into production tables. But it gives us a real speed boost when not much has changed. And it turned out that it made some other cool features simple to implement.\n\nTrade Abstractions For Speed\nObject-Relational Mappers are a standard component of most web frameworks these days. They simplify the process of working with databases by mapping SQL records into objects that can be manipulated directly in code. Our team at the New York Times uses Rails for our code which includes the ActiveRecord ORM layer. This makes it simple for us to load state races when it’s time to render a page or respond to an API request. But that abstraction adds performance costs that we sometimes had to bypass in the name of speed. For instance, we know when a race is called by the AP when one of its corresponding results has its winner field set to ‘X’. While this is easier to code in Rails, the performance costs of marshaling objects makes the speed of an ActiveRecord implementation glacial compared to a reasonably complex SQL UPDATE statement. We let SQL do what it does best whenever it makes sense to use it.\n\nThus, almost every subroutine of our loading process is written in SQL. However, SQL is notoriously obtuse once you add a few JOINs to a query. How do you avoid the almost-certain path to madness? You write a lot of unit tests you can run regularly to ensure the code is correct. We actually wrote the tests first as we developed parts of the loader and added tests with any new features. There are more than 500 tests for the loading system. There could be probably be more.\n\nThe Joy of Exaptations\nExaptation is a term from evolutionary biology that describes when a trait evolved for one purpose is co-opted for something else. The feather is a classic example of this; dinosaurs likely first evolved them for insulation only to later use them for stability and eventually flight. In a similar fashion, we found there were several delightful features that were developed as exaptions of our change-driven loading cycle.\n\nLet’s start with the staging tables. Loading the AP data into staging first gives us an easy place to do basic sanity checking of the data and easy error recovery if a single state fails to load. More importantly, it allows us to forcibly zero out races before copying them to production if we need to. One of the quirks of the AP service is that they run tests and live data on the same servers, often waiting until the morning of an Election Day to start “moving zeroes.” We often want to set up race result pages before the AP is ready, but we most definitely do not want to run the risk of posting test data anywhere public. We do this by loading the data in staging and updating a few things like vote counts and precincts reporting to be zero (remember, SQL is really good at mass assignments).\n\nHaving a table that records what races have changed on each load turns out to be very useful indeed. We use it in our internal race-calling interface; AJAX requests check for changed races and update the results without reloading the page. We use it to optimize our publisher to bake out new static versions of only the races that have changed. We used it to record a detailed log of the race changes for key races, so we could chart the changes in vote margins during the night. We used it during the primaries to send emails whenever delegate allocations changed.\n\nThe Importance of Naming Things\nFairy tales have it right: names are power. In order to find a race in the database, you have to know how to call it in the database. The AP does assign numeric IDs for each race, but these are generally not guaranteed in advance and may even be reused over the course of a year (which is why we append a YYYYMMDD timestamp to IDs in our database). For instance, to find the New York Presidential Republican Primary, you could look for the race with the following conditions: {state_id: 'NY', office_id: 'P', race_type_id: 'R'} (republican primary). Change the office_id to ‘H’, the race_type to ‘G’ and add a seat_number of 2 and you find the general election for the NY-2 house district. Generally, elections are consistent enough that you can easily figure out how to find a race you are looking for. Except when they aren’t.\n\nSpecial elections usually muddle things up. This year, there were two elections-one to fill out the remainder of this term, one for the next term-with primaries and a general election for Gabby Giffords’ AZ-8 house seat; both would show up on a search for {office_id: 'H', race_type: 'G'}. Scott Walker’s recall election in Wisconsin was coded as a general election by the AP but it shouldn’t show up in the governor race results on November 6th. Even regular elections have their complications. For instance, California switched to open primaries this year where everybody runs and the top two vote-getters advance to the general election regardless of their party (that’s race_type_id: 'X' of course). A week before their primaries, we learned about Ohio’s interesting presidential primary process: every voter votes for a delegate from their congressional district and a statewide at-large delegate. The AP reports this as 17 distinct races for the Republican presidential primary in Ohio. We only want one. These are just a few of many examples. Every state has its own edge cases. This once meant that logic for handling those edge cases wound up duplicated in all the front-end applications that used the AP election results. What a mess.\n\nThis year, I decided to try a different approach and added a layer of abstraction: a mechanism for mapping our own NYT race slugs onto AP races. Thus, we map ny-house-district-2-2012-general to the AP fields {race_type_id: 'G', office_id: 'H', seat_number: '2', state_id: 'NY'}. If these conditions match a single race, then we have a successful mapping and can store the AP race ID in the table, binding our slug to that race. Unlike an AP identifier, it’s easy to derive the NYT slug for a race. For cases where the mapping fails because it matches too many races, we can add additional fields to constrain to a single race or manually fix the AP race ID in the database if worse comes to worst.\n\nThat is what happened with Ohio. Before the first caucus in Iowa, I had autopopulated NYT race mappings for all the presidential primaries (Derek Willis maintains the NYT politicians/races API and was my steadfast slugmaster for the entire election year). When we started testing for the Ohio primaries, I noticed it was mapping to 17 races. Instead of having to alert all the other developers to patch their code, I just added an additional constraint mapping oh-president-2012-primary-rep to {state_id: 'OH', office_id: 'P', race_type_id: 'R' , seat_name: 'Delegate-at-Large'}, thus mapping the NYT concept of the Ohio republican primary to the statewide delegate-at-large race in the AP. Brief panic averted, I could sip my victory coffee.\n\nThis approach is a specific example of a general strategy when working with third-party APIs: place an abstraction layer between your code and theirs. Downstream users of the AP election data should not use any of the AP’s codes or identifiers, letting our intermediary mapping layer do the translation. This approach also provides a nice place to anchor other exaptations where we need to enhance or override AP data. For instance, we have some copy-editing differences on candidate names and ballot initiative titles. We also need to track the incumbent parties of major races so we can calculate gains for each party. The AP does not provide this, but it was trivial to add to the nyt_races table. This even solved a general problem that bothered us on prior elections: how do we mark races that we are interested in showing on the site? The AP election data contains everything from presidential races to town aldermen; we want to only present a subset on the site. Just having a NYT race mapping is a mark that it’s a race we care about.\n\nElection Results As a Service\nElection results are just one part of the complete New York Times election site, but they require a large amount of logic and models to support. In the past, we had tried some awkward approaches where two applications would share the same database and we would copy over models for working with the AP results to the election application. That approach created a lot of organizational headaches, however. For this election, we decided to take the bolder step of keeping the election_results application completely separated from the application powering the election site, providing data only through a JSON API. In other words, we build Election-Results-as-a-Service (ERaaS).\n\nThis approach worked much better. But it requires some additional effort and good communication between the creator and consumers of the API to work. You want to avoid situations where developers who will be using the API are forced to wait for you to implement the API before they can start working.. Before we worked on major API endpoints we would often manually build the JSON we expected it to produce for a single call. This could then be loaded on the client side by a simple stub, letting the API users wire up their endpoints against real data while the service was built. In addition, we set up the election_results application on a staging server. When developers worked on the election site on their laptops, the development code would default to making API calls against the staging server (API calls could be made against localhost or even production API servers by setting environment variables locally). This allowed new developers to get working on the election site without also having to setup election_results locally on their machines.\n\nElection results are highly nested entities. Each state race usually has multiple results for each candidate and some other associated data. When you want to render something complex like a results page for a state, you have a choice: do you make many small queries or one large query that encompasses the data you need? The REST approach for APIs jibes well with the resource approach in Rails and argues for many smaller API requests, both for architectural coherence and execution speed. This puts a lot of work on the clients to figure out what they need and fetch it efficiently in parallel, so we decided to go with large API responses that encompassed them all. So, when you look at a page like the Senate Big Board, it’s built from two API requests: one to render the tally at the top and one that provides details of every race. In many cases, these API requests could take long times to execute. Downstream web caches like Varnish can smooth traffic a bit, but you still need cache misses to execute efficiently. I spent some time fretting about JSON generation performance, and then I cheated. In most requests like “give me all major races for California,” the API response is essentially an array of JSON representations of individual races (although it may have some metadata up top). I can already use the nyt_races table to get the list of races to render and I can use that to pull the race data from the AP tables if I need to render the JSON. Instead, I use the NYT Race ID as part of a key for storing a cached representation of the race JSON. And since I know when a race changes, I can keep the cached JSON indefinitely and regenerate it as part of the loading cycle. Through further exaptations I was able to cut down a 10 second API request to 10 msecs. Cheating never felt so good.\n\nAdmins for Each Audience\n\n\n    \n        \n            \n            A more refined view of the loading process as shown in the admin for the Election Loader. Gray meant the state file was unchanged, black meant the file changed but it didn’t have changes for any races we tracked, green meant there were changes for that state and red meant there were errors.\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nSo far, I’ve discussed the election_results loader and how we share results data with the election site. There’s one other component of the election_results application that bears mentioning: the internal admin screens used by the newsroom. The most important of these was the calling interface. Although the AP makes its own race calls, we prefer to manually call major races (although we are happy enough to autocall minor races following the AP). While our operation will never approach the scale of the networks’ heavily-staffed calling desks, we were able to more effectively call all the primaries and the general election through a view of the election data used only by the two editors who made all of the calls. Similarly, on election night, another team used a custom admin tool to record the race calls made by the networks (so we could show them as part of our coverage). Another admin existed for editing race mappings and copy-editing names.\n\nAnd of course, there were other applications with their own admins. At various points during the election cycle, I tweeted screenshots of our admin screens. This is not because they were always beautiful (though I am proud of the calling interface), but there is something right about showing the workings of the mechanism, even when it’s not working perfectly. It’s a bit like revealing how the magic trick worked or giving a tour of the tunnels under Disneyland. Which is why I’m excited about Source and honored to have contributed a piece to it."
        },
        {
          "id": "published-word-clouds",
          "title": "Word Clouds Considered Harmful",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/word-clouds",
          "content": "In his 2003 novel Pattern Recognition, William Gibson created a character named Cayce Pollard with an unusual psychosomatic affliction: She was allergic to brands. Even the logos on clothing were enough to make her skin crawl, but her worst reactions were triggered by the Michelin Tire mascot, Bibendum.\n\nAlthough it’s mildly satirical, I can relate to this condition, since I have a similar visceral reaction to word clouds, especially those produced as data visualization for stories.\n\nIf you are fortunate enough to have no idea what a word cloud is, here is some background. A word cloud represents word usage in a document by resizing individual words in said document proportionally to how frequently they are used, and then jumbling them into some vaguely artistic arrangement. This technique first originated online in the 1990s as tag clouds (famously described as “the mullets of the Internet”), which were used to display the popularity of keywords in bookmarks.\n\nMore recently, a site named Wordle has made it radically simpler to generate such word clouds, ensuring their accelerated use as filler visualization, much to my personal pain.\n\nSo what’s so wrong with word clouds, anyway? To understand that, it helps to understand the principles we strive for in data journalism. At The New York Times, we strongly believe that visualization is reporting, with many of the same elements that would make a traditional story effective: a narrative that pares away extraneous information to find a story in the data; context to help the reader understand the basics of the subject; interviewing the data to find its flaws and be sure of our conclusions. Prettiness is a bonus; if it obliterates the ability to read the story of the visualization, it’s not worth adding some wild new visualization style or strange interface.\n\nOf course, word clouds throw all these principles out the window. Here’s an example to illustrate. About six months ago, I had the privilege of giving a talk about how we visualized civilian deaths in the WikiLeaks War Logs at a meeting of the New York City Hacks/Hackers. I wanted my talk to be more than “look what I did!” but also to touch on some key principles of good data journalism. What better way to illustrate these principles than with a foil, a Goofus to my Gallant?\n\nAnd I found one: the word cloud. Please compare these two visualizations - derived from the same data set - and the differences should be apparent:\n\n\n  Mapping a Deadly Day in Baghdad from The New York Times\n  word cloud of titles in the Iraq war logs from Fast Company\n\n\nI’m sorry to harp on Fast Company in particular here, since I’ve seen this pattern across many news organizations: reporters sidestepping their limited knowledge of the subject material by peering for patterns in a word cloud - like reading tea leaves at the bottom of a cup. What you’re left with is a shoddy visualization that fails all the principles I hold dear.\n\nFor starters, word clouds support only the crudest sorts of textual analysis, much like figuring out a protein by getting a count only of its amino acids. This can be wildly misleading; I created a word cloud of Tea Party feelings about Obama, and the two largest words were implausibly “like” and “policy,” mainly because the importuned word “don’t” was automatically excluded. (Fair enough: Such stopwords would otherwise dominate the word clouds.) A phrase or thematic analysis would reach more accurate conclusions. When looking at the word cloud of the War Logs, does the equal sizing of the words “car” and “blast” indicate a large number of reports about car bombs or just many reports about cars or explosions? How do I compare the relative frequency of lesser-used words? Also, doesn’t focusing on the occurrence of specific words instead of concepts or themes miss the fact that different reports about truck bombs might be use the words “truck,” “vehicle,” or even “bongo” (since the Kia Bongo is very popular in Iraq)?\n\nOf course, the biggest problem with word clouds is that they are often applied to situations where textual analysis is not appropriate. One could argue that word clouds make sense when the point is to specifically analyze word usage (though I’d still suggest alternatives), but it’s ludicrous to make sense of a complex topic like the Iraq War by looking only at the words used to describe the events. Don’t confuse signifiers with what they signify.\n\nAnd what about the readers? Word clouds leave them to figure out the context of the data by themselves. How is the reader to know from this word cloud that LN is a “Local National” or COP is “Combat Outpost” (and not a police officer)? Most interesting data requires some form of translation or explanation to bring the reader quickly up to speed, word clouds provide nothing in that regard.\n\nFurthermore, where is the narrative? For our visualization, we chose to focus on one narrative out of the many within the Iraq War Logs, and we displayed the data to make that clear. Word clouds, on the other hand, require the reader to squint at them like stereograms until a narrative pops into place. In this case, you can figure out that the Iraq occupation involved a lot of IEDs and explosions. Which is likely news to nobody.\n\nAs an example of how this might lead the reader astray, we initially thought we saw surprising and dramatic rise in sectarian violence after the Surge, because of the word “sect” was appearing in many more reports. We soon figured out that what we were seeing had less to do with violence levels and more to do with bureaucracy: the adoption of new Army requirements requiring the reporting of the sect of detainees. Of course, the horrific violence we visualized in Baghdad was sectarian, but this was not something indicated in the text of the reports at the time. If we had visualized the violence in Baghdad as a series of word clouds for each year, we might have thought that the violence was not sectarian at all.\n\nIn conclusion: Every time I see a word cloud presented as insight, I die a little inside. Hopefully, by now, you can understand why. But if you are still sadistically inclined enough to make a word cloud of this piece, don’t worry. I’ve got you covered."
        },
        {
          "id": "published-using-varnish",
          "title": "Using Varnish So News Doesn't Break Your Server",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/using-varnish",
          "content": "A month ago, the City Room blog ran a call for photos, asking readers to submit their photos of New York City’s waterfront. Thanks to the Stuffy photo submission system, The Times can run these requests for user photos on a regular basis, and I generally remain unaware of them. In this case, though, my colleague and I were testing out a general traffic monitoring screen we’re calling “the big board.” At that moment, a web producer placed a Flash promo in the center well of the NYTimes.com home page. The promo pulled a data file directly from our application servers instead of using a static version saved to the Content Delivery Network.\n\n\nA screenshot of the “big board” in action. The blue represents cached web traffic served by Varnish, red is uncached requests going directly to our server and yellow indicates HTTP 500 server errors. On election nights, the blue part would be 100s of requests a second.\n\nThis mix-up resulted in a sudden burst of more than 300 requests per second to our machines - and we saw this burst all too clearly on our screens, because every home page load was hitting our machines (thankfully, it was a Friday afternoon and not election night). Three months earlier, I would have been swearing profusely at this point, trying to spin up new servers in time to avoid watching all my application servers groan and die. But that day I watched as all of the application servers remained unperturbed. The difference? Varnish.\n\nVarnish: Cache Power\nVarnish is an HTTP cache. Simply put, it sits between your web servers and the outside world (we also have a few load balancers in the mix) and looks for HTTP Cache-Control headers in the responses returned from your applications. If Varnish sees something like this Cache-control: public, max-age=300 - it knows it can cache that page for 5 minutes. When any other requests come in for that page in that window, it serves them directly from the cache. That means your web servers see less traffic and your scalability goes through the roof.\n\nFurthermore, when the cached entry expires, Varnish is smart enough to condense multiple simultaneous requests for the pager into a single back-end request, avoiding the dog-pile effect of stampeding requests on a cache bust. Finally, Varnish allows us to also delete cached pages by regexp patterns, meaning we can explicitly clear part of the cache when deploying a new version of an application.\n\nWe used to cache pages by using Rails’ page caching or by “baking” pages on a regular schedule.The problem with such approaches is that cache clearing requires file-system commands; Varnish does it instantaneously. Add in support for [“saint modes”[(https://varnish-cache.org/wiki/VCLExampleSaintMode) that can tolerate back-end downtime, edge-side includes and a dynamic configuration language, and you have a pretty powerful piece of middleware.\n\nWho Can Use Varnish?\nWeb caching is not for everybody. If your site serves unique pages to each user, Varnish is not the best fit for you (although you could use edge-side includes to cache most of a page). Varnish is a natural fit for us because our content is well suited for high cache hit rates. If you are serving wildly dynamic content, Varnish’s HTTP caching layer is not for you. But most of our content is the same for all readers, and some of it never changes once it is published (we can cache document viewer pages for weeks in Varnish). For instance, that specific surge for the Waterfront flash graphic was for a single data file that was the same for each reader. So, although we try to optimize our controller actions to return quickly, we can feel a little less guilty if the result is cached for several hours in Varnish. And we can even tweak the behavior of Varnish with some custom configurations, described in the next section.&lt;/p&gt;\n\nThe Varnish Configuration Language\nThe Varnish Configuration Language (VCL) is both powerful and maddening, allowing you to control in very fine ways the behavior of Varnish for web requests, but also forcing you to work with an obscure syntax with limited capabilities (the truly adventurous can add C extensions). But it does let us do some pretty neat things on top of a base Varnish configuration.\n\nVCL models a web request through the cache with a series of callbacks, of which two are the most important. The first of these is vcl_recv, which is invoked to process incoming web requests to Varnish.\n\nHere’s an example of how we use it:\nsub vcl_recv {\n    # Use HAproxy as back end for all requests\n    set req.backend = backend_director;\n\n    # Pass any requests that Varnish does not understand straight to the back end.\n    if (req.request != \"GET\" &amp;&amp; req.request != \"HEAD\" &amp;&amp;\n        req.request != \"PUT\" &amp;&amp; req.request != \"POST\" &amp;&amp;\n        req.request != \"TRACE\" &amp;&amp; req.request != \"OPTIONS\" &amp;&amp;\n        req.request != \"DELETE\") {\n      return(pipe);\n    }     /* Non-RFC2616 or CONNECT which is weird. */\n\n    # Pass anything other than GET and HEAD directly.\n    if (req.request != \"GET\" &amp;&amp; req.request != \"HEAD\") {\n      return(pass);      /* We deal only with GET and HEAD by default */\n    }\n\n    # Allow expired objects to be served for 10m\n    set req.grace = 10m;\n\n    # Stripping certain params\n    # x - from clicking on a submit image\n    # y - from clicking on a submit image\n    if (req.url ~ \"\\?\") {\n       set req.url = regsub(req.url, \"\\?(api\\-key|ref|scp|sq|st|src|x|y)(\\=[^&amp;]*)?\", \"?\");\n       set req.url = regsuball(req.url, \"&amp;(api\\-key|ref|scp|sq|st|src|x|y)(\\=[^&amp;]*)?(?=&amp;|$)\", \"\");\n       set req.url = regsub(req.url, \"\\?&amp;\", \"?\");\n       set req.url = regsub(req.url, \"\\?$\", \"\");\n    }\n\n    # Override default behavior and allow caching for requests w/ cookies\n    if ( req.http.Cookie ) {\n       return (lookup);\n    }\n}\n\nThe other important method is vcl_fetch, which is triggered on responses from the back end (in the case of cache misses). This example illustrates using C extensions in a VCL. By default, Varnish just follows the same time-outs specified in the Cache-Control directive for downstream browsers. However, we have many cases where we want to keep something in Varnish for a long time, but still tell the downstream browser to cache for a short period. So, our VCL looks for a special X-VARNISH-TTL header in responses from our web applications. If it finds that, it uses that for the TTL; otherwise, it falls back to the Cache-Control header.\n\nsub vcl_fetch {\n    set beresp.grace = 2m;\n\n    # Process ESIs if X-RUN-ESI is set. This will be stripped before being sent down to client.\n    if ( beresp.http.X-RUN-ESI ) {\n        esi;\n        remove beresp.http.X-RUN-ESI;\n    }\n\n    # cache 404s and 301s for 1 minute\n    if (beresp.status == 404 || beresp.status == 301 || beresp.status == 500) {\n       set beresp.ttl = 1m;\n       return (deliver);\n    }\n\n    # If X-VARNISH-TTL is set, use this header's value as the TTL for the varnish cache.\n    # Expires, cache-control, etc. will be passed directly through to the client\n    # Cribbed from //www.lovelysystems.com/configuring-varnish-to-use-custom-http-headers/\n    if (beresp.http.X-VARNISH-TTL) {\n      C{\n        char *ttl;\n        /* first char in third param is length of header plus colon in octal */\n        ttl = VRT_GetHdr(sp, HDR_BERESP, \"\\016X-VARNISH-TTL:\");\n        VRT_l_beresp_ttl(sp, atoi(ttl));\n      }C\n      remove beresp.http.X-VARNISH-TTL;\n      return (deliver);\n    }\n\n    # If response has no Cache-Control/Expires headers, Cache-Control: no-cache, or Cache-Control: private, don't cache\n    if ( (!beresp.http.Cache-Control &amp;&amp; !beresp.http.Expires) || beresp.http.Cache-Control ~ \"no-cache\" || beresp.http.Cache-Control ~ \"private\" ) {\n      return (pass);\n    }\n}\n\nEdge-Side Includes\nFinally, a word about edge-side includes (ESIs). Long a feature of content-delivery networks like Akamai, edge-side includes allow your web apps to specify parts to be stitched in by the cache and delivered downstream to the user. This allows you to break down complex pages into simpler actions. For instance, the Congress votes overview page has these ESI directives for the sidebars on the right:\n\n&lt;div id=\"party\"&gt;esi :include src=\"//politics.nytimes.com/congress/superlatives/vsparty/111\" /&gt;&lt;/div&gt;\n&lt;div id=\"missers\"&gt;esi :include src=\"//politics.nytimes.com/congress/superlatives/missers/111\" /&gt;&lt;/div&gt;\n&lt;div id=\"loneno\"&gt;esi :include src=\"//politics.nytimes.com/congress/superlatives/loneno/111\" /&gt;&lt;/div&gt;\n\nWhen Varnish receives this page (or serves it from the cache), it’ll insert ESI content into the page by finding the content of those URLs in its cache (or by calling the back-end server). The user sees only the final page (unlike with JS callbacks). Not only can this help break down complicated pages into simple modules, but it can also help serve mostly static pages with some private dynamic content. Of course, using ESI does impose some performance costs, so as our VCL above illustrates, we execute ESI only if the response includes an X-RUN-ESI header.\n\nThat’s a look at how Varnish helps us (and Rails) scale. If you’re intrigued, explore the Varnish source, documentation and community - and let us know whether it helps you too."
        },
        {
          "id": "published-how-often-times-tweeted",
          "title": "How Often Is the Times Tweeted?",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/how-often-times-tweeted",
          "content": "I recently had the honor of speaking at the Chirp conference, where I got to stammer nervously about @anywhere and share a fun statistic I figured out a few days earlier: Someone tweets a link to a New York Times story once every 4 seconds. That is the sound-bite reduction of an interesting process, so this post explains how I figured that out using the Twitter streaming API\n\nThe difficulty with link tracking on Twitter is URL shorteners. To accurately count links to your site on Twitter, you theoretically have to expand any URL you see and then select the links to your site. This could be a daunting task, since it seems like a link is posted to Twitter every millisecond. Luckily, I can cheat.\n\nThe New York Times is a customer of the bit.ly Pro service. That means NYT-shortened links are assigned under the nyti.ms domain - and this behavior holds true for anybody shortening a link to NYTimes.com. Now, it is also true that bit.ly is not the only URL-shortening service out there, but it remains the predominant shortener. At the very worst, my choice to limit this to bit.ly means I’m underestimating and thus erring a bit on the conservative side (although I do not know how to measure how big of a margin of error that is).\n\nAs a result, measuring how much people tweet New York Times links was as simple as using the Twitter streaming API to look for the keywords nyti and ms. I used the TweetStream Ruby gem to search Twitter and added some logic to ensure I was only collecting tweets where those keywords appear in that order. Tweets were logged to a MongoDB database for several hours, with a minute time stamp attached to each one. Then it was a simple matter of grouping the tweets per each minute, sorting the values, and taking the median to see the average number of tweets per minute - which could then be inverted to get seconds per tweet.\n\nThe data ranged from a minimum of 4 to a maximum of 57 (that is, from once every 15 seconds to almost once a second), as the following chart of minute-by-minute counts demonstrates.\n\n\n\nFrom there, it is simple to calculate the median number of tweets per minute: 17, or roughly once every 4 seconds.\n\nOf course, a few hours’ worth of tweets on a Monday afternoon is perhaps not the most representative dataset, and I encourage further study by anybody who wants to explore it further. But this was a surprising look at how people share (and share and share) the news online."
        },
        {
          "id": "published-right-kind-of-stupid",
          "title": "The Right Kind of Stupid",
          "collection": {
            "label": "posts",
            "name": "Articles"
          },
          "categories": "published",
          "tags": "",
          "url": "/published/right-kind-of-stupid",
          "content": "Utter the seemingly innocuous phrase “mobile messaging platform,” and you quickly descend into a world of increasing complexity, filled with issues like carrier agreements, access controls, message size restrictions, subscription support, making the multimedia people happy, and the surprisingly hard problem of making sure the right messages get to the right people all of the time. All of which means that it’s months before you have even a basic system in place that you can use, much less one you can really enjoy enough to hack around with. Which is why I find myself posting today to sing the praise of silly hacks, or\nwhat would be called kludges in the classical programmer jargon.\n\n\n  n. A clever programming trick intended to solve a particular nasty case in an expedient, if not clear, manner. Often used to repair bugs. Often involves ad-hockery and verges on being a crock. In fact, the TMRC Dictionary defined ‘kludge’ as “a crock that works”.\n\n\nSpecifically, this ability to do clever if stupid hacks in the name of expediency is why I came to love twitter, one of the fun new Web 2.0 tools for the coding tool bag that’s taken a fair share of knocks from the “grown-up” crowd. On the face of it, twitter is pretty easy to mock. For starters, it’s not much of a messaging platform: users are limited to text messages of 140 characters or less. This medium tends to influence the message; because messages are so short, most tend more towards the pedestrian (eg, “Sitting down to breakfast, eating toast” or “looking at code while watching a movie”) rather than the profound. Messages are also usually broadcast to all your friends (and often the general public), making it seem like a cavalcade of inanity to the skeptical.\nAnd until very recently the site itself has been plagued by slowness and outages. Small wonder, then, that some columnists see the very ruin of civilization in it:\n\n\n  Why do we think we’re so important that we believe other people want to know about what we’re having for lunch, how bored we are at work or the state of inebriation we happen to be at this very moment\nin time? How did society get to the point that we are constantly improving technology so that this non-news can reach others even faster than a cell phone, a text message, a blog, our Facebook profiles? - Helen Popkin, Twitter Nation\n\n\nBut twitter has some magical properties that make it wonderful for hacking around: a simple, well-documented API, support for a variety of data formats, and the ability of any user to designate that tweets are sent to his or her cell phone as SMS messages. Put them together and you suddenly have a rather kludgey mobile message platform on the cheap. Or as I instead chose to phrase it to Mallary Tenore in her excellent article on news organizations using twitter: the right kind of stupid.\n\nFar from dismissive, “the right kind of stupid” is high praise. Using twitter’s APIs, I was able to get headlines from the New York Times feeds to my cell phone with only an idle afternoon and a few lines of Ruby. For instance, here is the basic code for posting a new message to twitter (from the Twitterize gem)\n\npost_args = {\n  'status' =&gt; status\n}\n\nurl = URI.parse('//twitter.com/statuses/update.xml' )\nurl.user = user\nurl.password = password\n\nresponse = Net::HTTP::post_form url, post_args\n\nAll you need to add is some code to parse feeds, a database to keep track of posted items, and a crontab to schedule it, and you have the makings of a truly Simple Messaging Service (although not always a reliable one). For newcomers, a complete service like Twitterfeed makes the process even simpler. I put up the main New York Times feed in early March 2007; today, it has 625 viewers, although we had a surge of 100+ subscriptions in the last few days due to Mallary’s article and Twitter featuring us on the front page. In addition, I added other specific New York Times feeds a month or so later. The most popular of them has only 40 or so subscribers however, so it’s clear that the general mix of stories the front page feed has is the most appealing to readers. More interesting still, the official New York Times twitter feed is not the only New York Times account on Twitter. RSS and Blogging Guru Dave Winer set up his own independent NYT River of News account a week or so after my first one that aggregates all of our major public feeds into one place. Far from being displeased, we here at Open are openly thrilled at these sort of third-party projects, especially since we still have only begun to scratch the surface of the public feeds we have here at the New York Times.\n\nSimple is powerful. Feeds and twitter are a natural fit, but with Twitter’s simple API and cron you can turn any sort of data API into a twitter event stream (with event listeners, you could even stream irregular events like subversion checkins or server failures into twitter). To give another example, I wrote a simple weatherbot that calls the New York Times weather API and posts the current conditions for New York City to the nyt_weather twitter account. At the time I wrote it, I was working in a cubicle that only had a view of a dimly lit ventilation shaft, so it was very important to get the weather before I stumbled out to lunch without an umbrella. As an added twist, the program updates the avatar image for the twitter user, so I can see the weather at a glance as well. Doing this is surprisingly easy with the Mechanize gem in Ruby; we can script the actions to upload a new photo in Twitter’s web forms:\n\ndef upload_img(icon, user, password)\n  agent = WWW::Mechanize.new\n\n  # Login\n  page  = agent.get('//twitter.com/account/create')\n\n  # Fill out the login form\n  form  = page.forms.action('/login').first\n  form.username_or_email = user\n  form.password = password\n  page  = agent.submit(form)\n\n  # Go to the upload page\n  page  = agent.click page.links.text('Settings')\n  page  = agent.click page.links.text('Picture')\n\n  # Fill out the form\n  form  = page.forms.action('/account/picture').first\n\n  puts form.file_uploads.inspect\n  form.file_uploads.name('user[profile_image]').first.file_name = \"/data/weather_imgs/#{icon}.gif\"\n  agent.submit(form)\nend\n\nSo, there you have two simple examples of taking a seemingly “stupid” technology to do really interesting things. And there are possibilities far beyond this even (eg, interactive twitter bots, visualizations like Twittervision or Twitter Blocks, or even just new content streams). But that’s the beauty of the right kind of stupid - it can lead to some pretty smart ideas."
        },
        {
          "id": "projects-wikileaks-war-logs",
          "title": "Wikileaks War Logs",
          "collection": {
            "label": "projects",
            "name": "Projects"
          },
          "categories": "project",
          "tags": "",
          "url": "/projects/wikileaks-war-logs",
          "content": "In 2010, the Wikileaks organization approached three news outlets - The New York Times, the British newspaper The Guardian and the German Der Spiegel - with a trove of military dispatches that were later revealed to be leaked by the whistleblower soldier Chelsea Manning. This was the second of the big revelations from Wikileaks, following the earlier Collateral Murder release in April 2010 of cockpit footage from a US military helicopter that strafed a crowd in Iraq including Reuters journalists. It was later followed by a leak of US Diplomatic cables sent within the US State Department.\n\nAfter a few months of reporting, all three organizations released their stories on the same day. The New York Times package for these stories was called The War Logs, and it was updated in two batches: first for Afghanistan and then for Iraq. As a data journalist, I worked with a few other developers (primarily, the great Alan McLean) to make an internal tool that would allow the team of Times journalists to search, map and visualize the data. We joked it was an admin for very few people, but its impact was large.\n\nWorking with the Data\nThe War Logs were a collection of dispatches filed by American and allied forces all throughout Afghanistan and Iraq. Although they contained some metadata, the bulk of the contents were the messages, which could often be very dense with technical military jargon. The NYT published several redacted examples of these dispatches from both Afganistan and Iraq if you want to get a taste for them, but many of them were like this report of an interpreter being attacked and killed by allied forces:\n\n\nBLUE ON WHITE BY 1ST RECON S OF NASSER WA AL SALEM: 1 CIV KILLED, 0 CF INJ/DAMA\nAT 200100C FEB 06, A 1ST RECON SNIPER TEAM WHILE CONDUCTING CLANDESTINE SNIPER OPERATIONS\nIVO HAJI RD IN THE ZAIDON ENGAGED (1) MAM WITH (4) 5.56MM ROUNDS IVO (38S MB 09971 79804)\n4KM S OF NASSER WA AL SALEM. THE MAM WAS PID W/ AK-47 CREEPING UP BEHIND THEIR SNIPER\nPOSITION AND WAS SHOT IN THE CHEST W/ (2) 5.56MM ROUNDS AT 15M. QRF WAS LAUNCHED TO EXTRACT\nTHE SNIPER TEAM. THE MAM WAS SEARCHED BY TEAM AND RECOVERED (1) AK-47, (2) MAGAZINES OF\n7.62MM, DOUBLE TAPED, (1) LARGE KNIFE, (1) ID CARD WITH \"----- -----\" WRITTEN ON CARD.\nMAM WAS ALSO NOTED TO BE WEARING A TRACKSUIT AND SEVERAL WARMING LAYERS TO INCLUDE\n(2) PAIRS OF SOCKS. THE BODY WAS LEFT BEHIND AT (38S MB 09971 79804) UPON EXTRACT OF\nTHE SST. PIONEER OBSERVING ON SITE W/ NSTR.\n\nUPDATE: UPON FURTHER INVESTIGATION THE KIA TURNED OUT TO BE THE PLATOON'S INTERPRETER THAT\nWAS SEPARATED FROM UNIT. THE BODY WAS RECOVERED AND IS CURRENTLY LOCATED AT FALLUJAH SURGICAL.\nTHIS ACTION IS NOW CONSIDERED A BLUE ON GREEN. IT RESULTED IN (1) IZ KIA (IRAQI INTERPRETER\nEMPLOYED BY TITAN.\n\n\nThere was some metadata attached to each of these dispatches, but a fair amount of it (beyond the date and time) was hand-entered and often unreliable. Much of my work was looking for metadata and details within the content of the messages themselves. This involved a few different threads of implementation:\n\n  Building an API to support an web admin that was provided to journalists for searching the dispatches and making sense of their contents to relate it to their reporting and lines of investigation.\n  Unpacking the jargon and acronyms: the Times does have a few reporters on staff (like the C.J. Chivers) who came from military backgrounds, but most of them were not familiar with it like me. I worked with people who knew to create jargon translations to map phrases liie “blue on white” (military vs. civilian attack) and “blue on green” (military vs. friendly forces) as well as acronyms like IVO (in vicinity of), QRF (quick reaction force) and NSTR (nothing significant to report). These would show as pop-ups in the admin (and you can see some of them on the public site if you hover over underlined words)\n  More importantly, these reports contained embedded geographic coordinates in the MGRS system (38S MB 09971 79804), which could be converted to lat-long. While every dispatch had lat-lng in its metadata, that was usually tied to where the dispatch was filed (_e.g., back on base) vs. where things actually occurred. And most dispatches contained several or even dozens of distinct events with associated MGRS coordinates and sometimes internal timestamps too. These details in the dispatches could be used to reconstruct events as a basis for reporting to confirm and provide context.\n\n\nThis data also helped me to pitch an interactive graphic that would illustrate the horrors of Iraq post-occupation.\n\nA Deadly Day in Baghdad\nAmong the dispatches in the Iraq War Logs there were many that were just American troops documenting the aftermath of a city gripped by sectarian violence:\n\n\n(CRIMINAL EVENT) MURDER RPT BY NOT PROVIDED IVO BAGHDAD (ZONE 1)\n(ROUTE UNKNOWN): 1 ISF KIA 27 CIV KIA\n28X CORPSES WERE FOUND THROUGHOUT BAGHDAD:\n2X HANDCUFFED, BLINDFOLDED, AND SHOT IN THE HEAD IN AL JIHAD (MB393859, MAHALAH #887, 1136 HRS, HY ALAMIL PS)\n2X SHOT IN THE HEAD IN AL HURRIYA (MB367918, 1340 HRS, AL HURIYA PS)\n1X SHOT IN THE HEAD IN AL ALAMIL (MB374824, 1400 HRS, HY ALAMIL PS)\n1X SHOT IN THE HEAD IN AL JIHAD (MB332816, MAHALAH #895, 1245 HRS, HY ALAMIL PS)\n1X SHOT IN THE HEAD IN SADR CITY (MB502242, 1500 HRS, AL RAFIDIAN PS)\n6X SHOT IN THE HEAD IN SHEIKH MAARUF (MB425880, MAHALAH #212, 1620 HRS, AL JAAIFER PS)\n\n\nThis is completely horrifying to read, but it felt vitally important to document. This violence wasn’t directly inflicted by American forces, but it was a direct result of the American invasion and subsequent destabilization of the country. This blood was also on our hands. With the help of graphics editors and a reporter providing local context, I was able to present this data in an interactive graphic:\n\n\n    \n        \n            \n            A deadly day in Baghdad\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nFor more details, check out the following resources:\n\n  MGRS Explained: a presentation I gave at a data journalism conference\n  Reporting Wikileaks: an in-depth presentation on how this graphic came together\n  A Columbia Journalism Review interview on creating the visualization\n  My coworker Alan McLean’s presentation on Telling Stories With Data\n  Connecting With the Dots: some thoughts on how we remind readers that the dots in our infographics are people"
        },
        {
          "id": "projects-civic-tech",
          "title": "Civic Technology",
          "collection": {
            "label": "projects",
            "name": "Projects"
          },
          "categories": "project",
          "tags": "",
          "url": "/projects/civic-tech",
          "content": "After nine years in data journalism at the New York Times, I had to admit that I was feeling pretty burned out from the media business. Journalism is where I’ve had some really amazing experiences, but it’s also where I worked myself so hard at times that I had come down with shingles and pneumonia. I wanted to try something different, but it had to be something that served the public.\n\nIn 2015, I found out about 18F. And that organization literally changed the entire course of my career.\n\n18F\n18F was a novel experiment operating within the federal government: what if we could build technical capacity within the government by offering software development services to other agencies without the need for external contracting? 18F was located within the General Services Administration – whose headquarters are at 18th and F street NW – and entirely funded by one of the GSA’s internal discretionary funds that are normally used to purchase vehicles or buildings. 18F would then otherwise operate like a consulting firm by charging agencies a rate for services that would then be reimbursed back into the fund. This clever approach allowed for 18F to be spun up without the need for explicit appropriations and for GSA to keep it operating even when its costs exceeded its income during the early years when I was there.\n\nUnfortunately, that same flexibility made it easy for DOGE to destroy 18F in a single day years later by eliminating all of its staff in a mass-firing at midnight on a Friday and then deleting the website. This followed a similar evisceration of the US Digital Service, which suffered the additional indignity of being renamed the US DOGE Service. I am not going to document how woefully destructive and inadequate DOGE has been for the task – I have a whole other website for that – but I hope these monsters will be haunted to the end of their days for their actions. And I will encourage you to visit 18f.org, where the spirit of 18F lives on.\n\nWhen I joined 18F in 2015, it was about a year after its founding, and the enthusiasm was infectious. We had engineers, developers and product managers working remotely from all over the United States and the offices in DC. Our Slack was filled with various goofy jokes and reaction emoji to cheer on each win – I particularly liked the shorthand of 💪🖥️🇺🇸 that we used as shorthand for the field of civic tech; it was hard work, but it was hard work for the American people. Other of my favorite memories of the culture include how typing robot dance party in our Slack would trigger a little line of dancing robots in the chat or the Slack channel named #hi where all we would ever post was the single word “hi.”\n\n18F is also where I first learned how to do agile software delivery. This meant learning how to work in two-week sprints with teammates, how to accurately break down work into tickets and estimate its difficulty, how to use customer research so we could build the product that our customers (and the public) needed. For my first project at 18F, I joined the MyUSA team in progress, which was developing a tool for users to identify themselves and choose what information they wanted to share with government websites instead of having to re-enter their information everywhere. The project was ultimately scuttled, but it was a precursor to login.gov. After that, I helped to build out an online tool for an experiment in how the government might be able to pay for small software tasks without having to issue a complex procurement – a tool allowed contracting officers to create reverse auctions for engineers to bid on small tasks that could cost no more than the $10,000 micro-purchase limit. Once the auction closed, the winning engineer would have to deliver the work and then be reimbursed for their efforts. It was just one of the many fun ways in which 18F was trying to explore new ways to use procurement.\n\nMy last project at 18F was a modern revamp of the FBI’s Crime Data Explorer in 2016. Previously, the FBI had released annual reports of crime statistics collected from each state as a series of static HTML tables on a website. With the impending launch of\nthe more advanced NIBRS format for data collection, the FBI wanted to make a more modern site to highlight the new information available for the first time in NIBRS and allow for users to interactively explore the data. 18F built the initial version of the site for the FBI as a project in Python with a database backend that could load the data. I asked to be on this project both because I’d been wanting to work on a data-related product and it was a chance for me to learn Python properly. I particularly enjoyed building out some SQL-driven crosstab analyses with the new NIBRS data and providing pages where developers and data journalists could download the raw database tables if they wanted to explore the data further.\n\nNava PBC\nWhile the first Trump administration was not as overtly hostile to 18F, it wasn’t as supportive of developing technical staff within the US government. At 18F, I noticed that projects began to shift more towards defense and security projects, and that wasn’t an area I was interested in working. And so, I departed 18F in 2018 and joined up at one of the new civic tech consulting companies, Nava PBC. Traditionally, the government has often purchased software either as complete products or had contracted for custom software services from large consulting companies often referred to as “the Beltway bandits,” due to their pricing and licensing practices. Nava’s founders had been among the teams that were brought in to salvage the disastrous launch of healthcare.gov, and they founded the company in 2015 to build software using the same responsive practices for agile software development that were being pioneered in government by public entities like 18F and USDS as well as other small contracting companies within the newly formed Digital Services Coalition.\n\nAfter 18F, Nava felt very much like home from the start. There was a goofy Slack; teams were made up of developers, designers, product managers and more from across the country; there was the same ambition to reshape how government digital services could work and how we could replace the failure-prone waterfall planning processes that lead so many big IT projects into failure. The only difference now is that I was outside the government again, working on contracts that Nava had competed directly to win against our frenemies and to deliver the specific work we had been engaged for. My first project at Nava was to join a team developing the submissions API used by doctors and registries in Medicare’s Quality Payments Program. This was a new effort to essentially grade the doctors providing services to Medicare so that better doctors would be reimbursed at slightly better rates and this would incentivize all the practices to improve the quality of care. When I joined, there were 8 engineers on the team as well as a project manager, a product manager and a business analyst. Everything was built in Node.js; I didn’t know Node, but I always love the challenge of learning a program language on a new project!\n\nThe QPP project honed some of the delivery skills I had picked up at 18F. Nava was one of 14 different scrum teams from multiple contractors working on the QPP ecosystem. Everything was coordinated through the convoluted Scaled Agile Framework for managing multiple teams. After a few months as a senior engineer, I suddenly became the Engineering Lead for the project, when the team’s lead was promoted to the VP of Engineering at Nava. This meant that I was ultimately responsible for planning out the work with my team at one of the quarterly Program Increment sessions held at a ballroom outside of Baltimore. It also meant that I now had to become a people manager for the first time, responsible for performance reviews and career development for the engineers on my team. I absolutely loved it, and I continued to be a people manager over multiple projects at Nava.\n\nAfter a brief interlude on a project for improving IT operations at the Centers for Medicare and Medicaid (I got to learn Go on the job!), my last project at Nava was to join as the tech lead on the Medicare Payment System Modernization effort to eventually move Medicare payment processing off of the mainframes running COBOL and into the cloud. This was careful, deliberate work – any outage would affect the payment of millions of claims a day and Medicare spending alone accounts for 4% of the US GDP – undertaken by a consortium of CMS employees, USDS staff and small agile companies that would bid on specific projects. Our project was the Replicated Data API, which would for the first time share claims to other parts of CMS within a day of their filing instead of the mandatory 14-day delay required for the accounting system. This work involved translating data from COBOL files into a modern relational schema and then streaming millions of claims in realtime down to clients. I got to learn GRPC and relearn Java (which has improved a lot over the past few decades!) on the project, so that was exciting. And it was also just a test of what we could do with a small team – we only had a product manager, a tech lead, a scrum master working part-time and 3 other engineers. Our API’s data was going to be provided by a pipeline out of the mainframe itself; when that project had to be reworked, we had to adapt and load the data ourselves every day from a collection of COBOL data format files. It was a huge expansion in the scope of work for the team, but we built a solution that has run flawlessly and cheaply every day with only a few months delay from our original timeframe. This work was deeply technical, and all too often we had to grapple with random existential doubts, but I consider it one of the finest projects I’ve ever worked on.\n\nConsumer Financial Protection Bureau\nIn 2022, I felt like it was time to move on from Nava after several years on the job. After a ludicrously long federal hiring process, I was hired as Supervisory IT Specialist at the Consumer Financial Protection Bureau within the Design &amp; Development team. We used to joke that D&amp;D was bit like Yamaha or Kawasaki – those big conglomerates that make both motorcycles and grand pianos – except our portfolio ranged from video and photography to print materials to the website and other custom internal applications used by other teams at CFPB. As the supervisor of the App Dev team, I was the people manager for 12 engineers working on the front-end and back-end web development on several different teams. There are two things that were exceptionally remarkable about CFPB. The first of these was the mission – I looked forward every day to the daily email that shared news clippings of what CFPB was up to, whether it was introducing new policies or holding bad actors to account. The second was the people, all driven by a patriotic mission to protect the American consumer but also some of the kindest and most thoughtful people I have ever worked with.\n\nThe engineers that I managed had been there from the start. Many joined the CFPB in its 2013 and 2015 fellowship classes and stayed on afterwards to continue working for the agency. One of the engineers has been at the CFPB for 13 years, a time before Nava and even 18F existed. CFPB was the first digital services organization in any government, and the team had created and continued to support such fundamental work like the website or the design system over the past decade. I will admit that I didn’t always know what to make of this – I was more used to “greenfield” projects that started from nothing – but in time I really grew to appreciate the different sort of development work that was happening at the agency. It takes very different skills to code systems for longevity. Modern developers like to deride COBOL programs, but how many of us have created systems that span decades of service? Through its continued presence, CFPB was figuring out how to build modern software for the long haul while keeping it stable and accessible for all. I would have stayed with them on this journey until the end of my career if I could, but the Trump administration ruined that plan.\n\nMagnum Opus, LLC\nOn July 4, the GOP-controlled Congress passed HR.1 also known as the “One Big Beautiful Bill Act,” which contorted the budget reconciliation process to extend tax cuts for billionaires and radically reduce the government to make that happen. CFPB’s union (buy their merch!) had been successfully fighting off the Trump administration’s efforts to unilaterally shutter the agency without Congressional approval, but the new bill included a provision that cut the maximum budget for the CFPB in half, making it almost certain that the government would eventually succeed in forcing layoffs at the agency. I hated to leave and it was a painful job searching process, but there is no way in which I would or should be spared in any future layoffs that might occur. And so, I wound up taking a job at MO Studio, another member of the Digital Services Coalition.\n\nIronically, my latest work is also linked to HR.1 through its provisions that require every state to implement a system that will validate new work requirements for Medicaid recipients by January 1st, 2026. Past efforts to enforce such requirements have univerally been disasters for Medicaid recipients in their states, with most qualified people losing their benefits either due to poor communication about the new rules and broken systems that make it difficult for beneficiaries to comply. Some observers have noted that requiring validation in every state on an extremely short timeline like this make it seem like Congress is purposefully setting up states to fail their citizens in order to save some money from Medicaid, especially since many states also need to modernize their Medicaid processing systems to better support automated renewals. MO is one of many organizations, both private and public, that is trying to avert failure. I myself will be working on improving Medicaid systems to handle this validation and renewal in Pennsylvania, alongside some old faces from every part of my civic tech career. At 18F, we had the saying “hard work is hard,” when we described a project that necessitated very complicated work to succeed. This is going to be true for dozens of states in the US, and I only hope that all of us can pull it off."
        },
        {
          "id": "projects-data-journalism",
          "title": "Data Journalism",
          "collection": {
            "label": "projects",
            "name": "Projects"
          },
          "categories": "project",
          "tags": "",
          "url": "/projects/data-journalism",
          "content": "The story of the Interactive Newsroom Technologies team started with a tragedy.\n\nDuring the afternoon rush hour on 8/1/2007, the I-35W Mississippi River Bridge in Minneapolis collapsed, killing 13 people and injuring hundreds. Within the New York Times, the Computer-Assisted Reporting team had a database of bridge inspection data for every bridge in the US available within hours of the accident. There was a suggestion that the NYT should make a searchable site that would allow users to see the inspection data and condition of bridges near them as well as the most neglected bridges in the US comparable to the collapsed span in Minneapolis. But, the only options to do would be to create either a Flash app and some sort of custom data backend in the Times’ antiquated custom programming language or to individually publish tens of thousands of pages for each bridge in the Times’ CMS - in short, there was no good way to build an interactive application for news in the Times’ technology stack. It was time to make one.\n\nAt the time, I was working for the digital side of the newsroom as a Senior Software Architect, primarily on a project to port over some functionality for things like parsing session cookies from the Times’ custom programming language used on its website to more modern languages like PHP or Ruby. I also was running the @nytimes twitter account, but I was also interested in applying technology towards journalism. The newsroom, in turn, had some staff who were interested in applying technology to the Times. There already were strong teams inside of Graphics, Multimedia and Computer-Assisted Reporting, and I had met the journalist Aron Pilhofer and Matt Ericson a few times already - once when we were at a Google Dev Day event in Mountain View and also once when I crashed an election planning meeting to argue for greater digital integration into how we would report the 2008 elections. We’d already been talking about next steps, so when the moment finally arrived, we were ready to go. As the absurdly-titled coverage “The New Journalism: Goosing the Gray Lady” described it, things came together rather quickly from there:\n\n\n  ‘It was surprisingly easy to make the case,’ says Pilhofer, describing what he calls the ‘pinch-me meeting’ that occurred in August 2007, when Pilhofer and Ericson sat down with deputy managing editor Jonathan Landman and Marc Frons, the CTO of Times Digital, to lobby for intervention into the Times’ online operation-swift investment in experimental online journalism before it was too late.\n\n  ‘The proposal was to create a newsroom: a group of developers-slash-journalists, or journalists-slash-developers, who would work on long-term, medium-term, short-term journalism-everything from elections to NFL penalties to kind of the stuff you see in the Word Train.’ This team would “cut across all the desks,” providing a corrective to the maddening old system, in which each innovation required months for permissions and design. The new system elevated coders into full-fledged members of the Times-deputized to collaborate with reporters and editors, not merely to serve their needs.\n\n  To Pilhofer’s astonishment, Landman said yes on the spot. A month later, Pilhofer had his team: the Interactive Newsroom Technologies group, ten developers overseen by Frons and expected to collaborate with multimedia (run by Andrew DeVigal) and graphics.\n\n\nI was one of the staff who came over from the digital side alongside a few transfers in the newsroom and some new hires as well. And that’s where I remained until I left the Times in 2015 (I did move down to DC in 2012, so I did eventually change my desk). I somehow avoided being in any media profiles myself (note from future me: stop being so modest and shy!), but I did play my part in many projects and the culture of the team.\n\nOur Technology Stack\n\nAs I mentioned above, the Times website was running a custom but limited (e.g., it didn’t support nested hash tables, because the two developers felt nobody needed them) programming environment with similarly constrained connectivity to backend data sources. To be fair, given the high volume of traffic hitting the Times even then, it made sense to be careful about technology choices, especially when you had to customize responses for each user (in the form of personalization and ad-targeting). But, it also profoundly limited what developers could build: both in what the language itself supported and a lack of third-party libraries. And this was the middle of the Web 2.0 era, there were now a lot of choices to use for web applications!\n\nWe resolved early on to use Ruby on Rails to build code for Interactive Newsroom Technologies. At the time, many of the other newsrooms with interactive components were using the Django framework, originally developed for use at the Lawrence Journal-World newspaper in 2003-2004, so why did we pick Rails? For a few reasons:\n\n\n  It was also a framework for rapid web development (extracted from the Basecamp product), but slightly less opinionated than Python and remarkable metaprogramming support\n  NYC had a very strong Ruby developer community in the NYC.rb group\n  Rails’ ActiveRecord ORM library was different in that it looked at the database schema to figure out the fields and types of Ruby objects loaded from the database. This was very useful when you were handed a complete database by the Computer-Assisted Reporting team and needed to make an interactive for it\n  A bunch of the developers had been using it for personal projects and just thought it was neat\n\n\nWe were also the first team within the Times to use Amazon Web Services. Since we were organized out of the technology side of the newspaper, we were told that we would need to deploy applications using their approved mechanism. In this case, it would mean serving the application from an older version of Solaris x86. To deploy, we would need to create patch files and then give them to the sysadmins for them to push to servers on a regular patch cycle. Or, we could put AWS on a Purchase-card, spin up our own EC2 instances and then just deploy to them at any time we wanted to with Capistrano. You can guess what we did, but I did for a few days to make the Solaris x86 thing work.\n\nOne of the most important things we did early on was getting approval to run our own subdomain projects.nytimes.com. This allowed us to avoid having to embed/route our traffic through the main site which was a relief to both the CMS/regular site team and for us. This also is what let us pick our own infrastructure, but it also meant that we were responsible for uptime and on-call to make sure it didn’t go down. It also meant we had to write out own code to make it look like part of the regular site. My coworker Alan McLean made a frame in CSS/HTML to put around our content to make it look like a Times page. Luckily, I had already ported some other special code like session cookie parsing to PHP so I could also make versions in Ruby too. We launched with a few EC2 servers and a load balancer.\n\nLater, we built out our infrastructure to be more scalable. We added Varnish as a reverse proxy cache to make sure we could handle huge spikes in traffic. We switched out web server over to nginx and then built out on top of a custom infrastructure-as-code system that we called Red October. We also had a few on-premises Linux boxes under some desks for stuff that was too sensitive to put in the cloud. The first one was a random PC with a cheap and shiny black plastic exterior that we all called Jeff Vader because of the Eddie Izzard “Death Star Canteen” bit. Its brother was Chad Vader.\n\nNotable Projects\n\nThere were dozens of us in Interactive News Applications at some points, so we often had different people working on different projects at the same time. We also had situations where people worked on projects at the same time (some with urgent deadlines, some that were slow-burning). And there were also regular maintenance tasks to upgrade old apps, improve our infrastructure, patch servers. We didn’t really do agile, but agile isn’t really designed for large portfolios of many small applications.\n\nHere’s a roundup of some notable projects I enjoyed working on. These are just the ones that I worked on in some way, and there are many great projects that Interactive News did that I wasn’t involved with. Many of these links are probably broken or will break at some point in the future. It pains me still that it’s easier to read newspapers from 100 years ago than it is for interactives from 15 years ago. I wish we had spent more time in the early days figuring out how to properly archive our content when it was done. But we were moving fast like a startup and trying to show what we could do. And we did a lot!\n\nNFL Playoffs 2008\n\nOur first project was an interactive app we built for the 2008 NFL Playoffs that launched publicly on January 8th, 2008 and contrasted the offense-defense patterns of different playoff matchups as the NFL playoffs progressed, with a different page for each game. It was also implemented in Ruby on Rails 2, according to a tweet of mine. Unfortunately, it ran on a custom domain 2008playoffs.nytimes.com and now it doesn’t work. Ah well, as I was saying about how hard it is to preserve digital projects…\n\nElections\n\nProbably my single most important project was the work I did on election results within the Times from 2008 to 2014. Funnily enough, despite my advocacy at a meeting, I wasn’t much involved with the coding work for handling the primaries in 2008. But in July 2008, I started pairing with Ben Koski on how to improve and revamp the loader for the general election. It was a great experience, and the resulting code was the basis of our election results collection and reporting for the next six years. For more details, see the project page for the [election loader]/projects/nytimes-election-loader.\n\nOlympics\n\nThe other big data project was the Olympics. In 2009, we decided our next big data project would be ingesting Olympics data, both for our own purposes and as a revenue stream of licensed interactive components for other news organizations. The 2010 Winter Olympics were going to be in Vancouver from February 12 - 28, 2010 so we started working in the summer and fall of the year before. Honestly, this was the hardest project we ever did, and I don’t think we were entirely prepared for the amount of work we had in front of us before we arrived at the monumental task of ingesting Olympics data over 2+ weeks. We pulled it off for the most part, but there were a lot of frustrating moments and exhausting days. But, this work became the basis for other coverage of the 2012 Summer Olympics (London) and 2014 Winter Olympics (Sochi), among others.\n\n\n    \n        \n            \n            The admin from the Olympics app\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nTo understand why this was so challenging, you need to understand the Olympics data feed or ODF. To receive Olympics data, participating organizations set up a server which then receives a firehose of XML messages for every event while they are happening, including partial and full updates, roster communications, record notifications. Furthermore, each sport has its own specific customizations to the base message schema that you need to be able to parse and then you need to create visualizations for. For more details on the technical operation, read this excellent overview of the technology for the London Olympics 2012 by Jacqui Lough or this piece on all the audience considerations for the Olympics by Tiff Fehr.\n\n\n    \n        \n            \n            Jacqui Lough holding the schema for Olympics 2012\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nBesides the effort, I remember a lot of the silly moments. The jokes about our favorite mascots from Vancouver (Quatchi is the best!) and the creepy cyclopean mascots of London 2012. The comedic descriptions of unfamiliar horse in the Modern Pentathlon. The ludicrously large database schema (pictured above with Jacqui) needed to represent all the entities and their relationships to each other. I also remember that it was just physically exhausting, and I wound up with pneumonia from running myself so ragged. Elections only last one night, but an Olympics runs for several weeks.\n\nReader Input\n\nOver the years, the Interactive Newsroom Team tried a variety of different experiments in collecting and presenting inputs from our readers. The earliest example of this was the Election Night Word Train which ran on Election Night 2008 and asked Obama and McCain voters to describe their mood in a single word. Another novel example was the 2009 feature Healthcare Conversations which grouped comments into various rooms with little figures of people to show where the current discussion activity was happening. This was also echoed in the feature “I Hope So Too” which collected reader comments about the Obama administration (but is now completely nonfunctional).\n\n\n    \n        \n            \n            The Moment in Time admin during our submission peaks\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nIn a similar vein, we built an interactive tool for readers to submit photos related to Obama’s first inauguration in 2008. We later generalized this project into a generic tool for accepting and moderating uploaded photos from readers called PUFFY (Photo Uploader For You), which was then followed by another reworking of the application called STUFFY. That tool’s biggest moment was the “Moment in Time” project, where readers sent in more than 10,000 photos all taken at a single moment across the globe and stitched into an interactive graphic that let readers spin the earth and see the photos from a given area. Unfortunately, it also doesn’t work anymore, but it was pretty impressive at the time!\n\nDocument Dumps\n\nAs a newsroom team, we also would participate in important work driven by the news cycle. One early example of this happened in March 2008, when Hillary Clinton’s calendar during Bill Clinton’s presidency was released by the National Archives. After receiving scans of the documents, we spent a pretty frantic 24 hours building an online document viewer which allowed Times reporters to record annotations for readers to understand the context of important pages. This technical approach later became the basis for DocumentCloud, which is still in operation today.\n\nAnother application that I really wish were not still in operation is the Guantanamo Docket. Originally built out from data provided by Wikileaks, it has been supplemented and updated over the years to track the releases of various detainees that have been held at the prison. The application itself has gone through several major revisions, with different members of Interactive News contributing over the years. We thought it would only be in use for a few years. As I write this, there are still 9 detainees left in the prison. And so the app continues to run because it still fulfills its purpose.\n\nDespite our skills, our team didn’t participate in as many journalistic projects as I would have hoped. All too frequently, we were invited in near the very end and asked to make an interactive when all the original reporting had concluded. One exception to this outcome was the Toxic Waters series by the investigative desk that looked at common pollutants in drinking water across the US. Interactive News contributed to the project with an interactive tool that let readers examine the pollution in their own drinking water; this data also helped to inform the reporting of the piece. Similarly, when the Times was given leaked war logs and interactive cables by Wikileaks, our team built out internal admins for reporters to search and read the data. I’ve put information about the Wikileaks project on its own page.\n\nSilly Stuff\n\nIt wasn’t all breaking news and serious projects. We had a lot of fun working together and I miss the camaraderie of the team. We also sometimes would build silly things. I remember an April Fools project of a system for sending large files as 10,000s of tweets. We would sometimes crank up “Highway Through the Danger Zone” when it was time to do a deploy. I already mentioned Jeff Vader and his brother Chad. And we did silly projects sometimes like TimesHaiku. That too has its own separate page as a significant personal project to me.\n\nWhen I left the Times, the team gave me for a sendoff a special Twitter account that commemorated all my most memorable commit messages (“This time it’ll work!” “Okay, this time!” “Dammit!”) as well as a framed photo of my now-revoked SSH key that I used to connect to servers. It was both silly and incredibly thoughtful at the same time. In other words, the perfect synopsis of how Interactive Newsroom Technologies operated.\n\nFurther Reading\n\n\n  A talk I gave on data journalism for TimesOpen\n  The New Journalism: Goosing the Gray Lady\n  Talk to the Newsroom: Interactive News Collaborative\n  The Journalist as Programmer, International Symposium on Online Journalism, Spring 2012\n  Newsdev About-INT repo"
        },
        {
          "id": "projects-nytimes-election-loader",
          "title": "The New York Times Election Loader",
          "collection": {
            "label": "projects",
            "name": "Projects"
          },
          "categories": "project",
          "tags": "",
          "url": "/projects/nytimes-election-loader",
          "content": "Of all the projects that I worked on as part of the Interactive Newsroom Technologies team at the New York Times, I would count the sustained effort on the Election Results loading system as my most significant work and probably some of the best work I’ve ever done. It wasn’t so much a dazzling technical marvel (although there were some innovations I’ve especially proud of) as it was a reliable workhorse, doing its job through primaries and general elections, for federal or mayoral results, with no outages and only one serious bug in production (that it happened on election night 2012 wasn’t great). This page collects some expanded notes on the election night loader, how election nights were at the Times and various pictures I have posted to social media, etc. in the past, now collected here.\n\nBefore we begin, I wrote a basic overview of how the election loader works in 2012 that touches on some of the basics of how elections are modeled. It’s worth reading, if you haven’t already. I also touch on how it felt on election night 2008 in my essay on leaving the New York Times\n\n\n    \n        \n            \n            Special branded M&amp;Ms that we received for Election Night 2014\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nI also want to note that software development is very often a team sport rather than a solo endeavor. Although I will describe many things where I was a primary contributor, I was not the only one working on this code or other related election-related technologies, and I am indebted to my various colleagues in the newsroom who built front-tools, graphics and visualizations, as well as those who coded parts of it with me, helped with testing and helped with hosting infrastructure. This project would not have been possible without the coding help of Jacqui Lough, Michael Strickland, Erik Hinton, Tyson Evans, Derek Willis, Andrei Scheinkmann, Aron Pilhofer, Brian Hamman, Matt Ericson, Archie Tse and Matthew Block, among others I have accidentally ommitted in my list. The core of the loader was created through dedicated pair-programming sessions with Ben Koski, and he certainly helped shape it then as much as I did. I also would like to thank the journalists I worked with, especially the race-calling team of Janet Elder and Rich Meislin (sadly both RIP).\n\nHigh-level Overview\n\nThe Election Loader was a program written in Ruby on Rails that was operated over the years 2008-2014 and enhanced with new functionality during that time. It consisted of routines to fetch election results from the Associated Press (provided via FTP), load them into results tables and publish out static pages based on what changed. These were optimized to run incredibly quickly (mostly through low-level SQL operations on a database), processing 51 states within a minute before beginning the loop again. It also included an admin for editing specific configuration values about candidates and races, and which also included the “Meislomatic” interface for monitoring races and allowing New York Times editors to make manual race calls on election nights.\n\nHow the AP Provided Elections\n\nAt the time, the Associated Press was the only game in town for getting election results in realtime during elections. They have an impressive operation, with stringers located in precincts across the US calling in results to a central coordination center. They had two tiers of services provided for people who worked with election results: one with a delay of a few minutes for election result changes to show up in the feed and another one (“the dollar sign and two commas” plan as I heard it jokingly described) for the premiere TV networks. More recently, I believe the AP has moved to a RESTful API with JSON, and there are competitors like Reuters and DecisionDeskHQ also providing results, but at the time we were working with it, the “API” was simply an FTP server with a collection of 3 files for every state with results:\n\n\n  The *candidates file had the names of the candidates. This would likely not change once created for a given election date. Each candidate would have a unique candidate_id. If a candidate was in multiple races, they would have multiple rows in this table (one for each race) with a single unique politician_id (that also would stay the same for primaries and general elections).\n  The races table was a collection high-level race information. Each race would have its own unique race_id. There actually would sometimes be several races for an election: the state-wide totals as well as county-level results. This table would also tell yo how many precincts had reported.\n  The results table would show the specific results for a given candidate and race, as well as special flags to indicate if the AP had called the race, or if was going to runoff.\n\n\nAnd that was it. Three tables are all you needed to describe the current state of an election, but there is a lot of additional context that often needs to be computed based on the information provided: for instance, the delegate counts towards a presidential nomination or how a candidate is faring against their projected outcomes needed for a win. We’ll get to that, but the initial challenge was just how to load this data repeatedly during an election night without running astray.\n\nKey Innovations of the Loader\n\n\n    \n        \n            \n            A view of terminal output of the loader’s command-line execution. I liked to color the production terminal in a special color to distinguish it.\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nWe were big users of Ruby on Rails from a very early date at the Interactive Newsroom Technologies desk. On the programming side, many of us had come from more verbose statically-typed languages, and I appreciated the cheeky conciseness and mild anarchy of Ruby compared to other languages - I’ve also been using Python for years, but it’s hard for me not to read exhortations like The Zen of Python in anything but a fusty tone compared to Ruby’s “There More That One Way To Do It” ethos and the weirdness of why’s Poignant Guide to Ruby.\n\nOne of Rails’ key features for us was its Object-Relational Model (ORM), the library that allowed programming structures to be saved and loaded to a database. Rails used the Active Record pattern, where it would figure out the structures and fields of programming classes by looking at database tables and their structures. This was in contrast to other models where objects and their types would be described in programming languages first and then applied to the database layer. The latter approach is far better for making sure that website form inputs can be validated before being saved into a database, but at the Times, we were almost always being given complete databases and racing to build applications on top of them; the database-driven informality of ActiveRecord was key for making that happen.\n\nChange Detection\n\n\n    \n        \n            \n            A chart of vote changes in a race in Virginia over the course of the night. This was another byproduct of change detection.\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nTheoretically, we could just have used database utilities to load the data files from the AP directly into our databases and then they would be usable by our application. This seemed rather risky. Although we could use database transactions to rollback changes if an explicit error occurred, what if there were more subtle problems with the data that might not trigger loading errors? For instance, what if a race or result file were missing races accidentally? We would lose all record of them and they would vanish from our results pages. Instead, we decided to load the data into staging tables, run validations on it, and then run SQL commands to copy the staging data into our production tables that served election results. In order to run safely, we would use database transactions to do the copy, so that any error could be rolled back cleanly. But that led to our second concern: how performant would it be to lock millions of rows for database updates every few minutes while we were trying to read from them? What if we could make the update transaction focused on just what had changed?\n\nThis led to one of the best decisions we made for the entire program: change detection. After we had validated the staging data, our code now ran a change detection step that looked for what races/results were different from what was already in production. Then we just ran code to update the production tables for those that had changed. This meant the updates were remarkably fast and easier on system resources, but it also meant we now had the ability to see what changed during each load! This gave us the loader a lot of new powers:\n\n\n  Able to conditionally rebuild and rebake results pages and widget based on changes\n  Able to create events for things like AP race calls/uncalls that are indicated by a field being set\n  Setting up alerts for other newsworthy change conditions like first votes recorded in a race, 100% precincts reporting, delegate allocation changes\n  Dynamic updates to AJAX components in our admin to show changes\n  Able to efficiently record a sequence of vote totals for any given race over the night\n\n\n\n    \n        \n            \n            A more refined view of the loading process as shown in the admin for the Election Loader. Gray meant the state file was unchanged, black meant the file changed but it didn’t have changes for any races we tracked, green meant there were changes for that state and red meant there were errors.\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nElection tabulation in reality is a continuous process of various results being reported into the AP all the way to AP updating posted files on their servers. From our perspective though, we could think of loading as a quantized process - fetch, load, update, repeat - with each load cycle like a tick of the clock. Change detection is what let us see what changed in each tick so we could take action on it.\n\nMultithreaded FTP Processing\n\n\n    \n        \n            \n            An early version of the load history screen showing its performance over time\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nI may have mentioned it a few times, but the AP election files in the 2000s were available via FTP, a file-transfer protocol that had its heyday in the 1980s. Furthermore, on a general election night there could be up to 51*2 (race, results) = 102 files to fetch every load so we could identify what had changed. Over a single serial connection, it could take minutes alone to grab each file before we even started loading. We needed to optimize it. Luckily multi-threading worked well for this situation, and I determined that the AP had a collection of multiple FTP servers serving requests. After some trial-and-error, I determined that if I had 3 loading threads running at once I could parallelize pulling down the files I needed and also not overwhelm the servers for everybody. I also could use file info to determine what files had changed on the server rather than pulling all of them down (although I did periodically pull all of them down on regular intervals). This meant that file retrieval took a matter of seconds and I could run loads around once a minute, even on a general election night.\n\nBecause there were multiple FTP servers in use by the AP and clock skew was an issue, I decided to not look at timestamps in my loading scripts to determine files to load (instead it looked a size changes, etc.). This proved to be a huge mistake on the general election night in 2012, when one of the AP’s FTP servers was not being correctly updated like the others, thus making it appear like it was a few minutes behind the other servers. The result meant that the election results seemed like they reverted to an earlier time and then forward again depending on FTP server they were retrieved from. Oops! I was able to quickly deploy a fix to look at the timestamp once I identified a problem, but it’s not enjoyable to write code when you have several senior political editors breathing down your neck!\n\nIncidentally, I created Times Haiku because of the minor depression I fell into after the election due to that bug.\n\nCustomization\n\nOne other thing we had to do early when building out our election loader was adding the ability to customize various aspects of the data to match Times editorial standards. This could mean simple things like how we’d want to render the name of C. A. “Dutch” Ruppersberger or other things like the order in which we’d want candidates to appear (especially important in races like the New Hampshire primaries where dozens of candidates will run). We had customization tables of various types with an admin so that newsroom editors could edit them. These customizations would then be applied to the data after loading/change detection and before it is copied into production tables.\n\nWhile I’m talking about Times style, it’s also important to note that there is a specific style guide that we coded our app to follow when baking out results. For instance, we computed/displayed vote percentages to 1 decimal place. In a related vein, we displayed a race as “100% reporting” only when every single precinct had reported. Otherwise it would be displayed as “&gt;99% reporting.” There are a few other rules in there that I’m probably forgetting, but it’s important to have a data style guide (like this one for ProPublica) if you plan to report on data to a nontechnical audience.\n\nBaking Pages\n\nOne of the nice things about election reporting is that while the traffic can be huge, you can show the same page to everybody. We had already had experience using the reverse proxy cache Varnish to serve cached responses to users without hitting our back-end servers, but there was always the risk that our own caching servers could melt down under high load. Instead of caching our servers, we tried a “baking” model where we would bake out a set of static pages to Amazon S3, a service for hosting static content that supports massive numbers of simultaneous requests. All we had to do was supplement the existing cycle to fetch, load, update, bake, repeat and we could just bake things as part of loading. And, because we had change detection, the loader only had to rebake pages/elements that contained races that had changed during that specific load. This could mean a combination of saving out static items to S3 or telling Varnish it can clear the cache for certain pages.\n\n\n    \n        \n            \n            An example of a ‘Jim’ on the homepage during Super Tuesday in 2012\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nWe used baking extensively. Not just to create whole pages on elections.nytimes.com, but also for components and widgets that appeared throughout the site. We called the ones on the homepage either “the Jim” or “the Mini-Jim” after Jim Roberts, a politics editor at the Times. In 2012, baking was also integrated with our system for pushing data to browsers. For one election, I even made a special file in Adobe Illustrator format that could be baked with data to regenerate a map that would appear in the print edition (allowing them to grab the current version exactly when they needed it at press time).\n\nRace Slugs\n\nThis is one other thing that took a little while to figure out but then seemed blindingly obvious in hindsight. As I mentioned earlier, the AP had IDs for every race. However, these IDs would not necessarily be stable and available before every race and there were even times where they reused the same race_id for two different elections. They were meant to be unique enough for joining among tables in a database but not durable enough to be used to find races. The alternative was to use a combination of other fields to find a race. For instance, we could use the fields {race_type_id: 'G', office_id: 'H', seat_number: '2', state_id: 'NY'} to find the general election race for the NY-02 House district. This gets a bit unwieldy though, and unfortunately, there are many weird exceptions out there for various states. For instance, Ohio’s primaries are technically 16 races, selecting a delegate for each of its congressional districts as well as a general at-large delegate for the entire state. The Republican primary can be found with the combination of {state_id: 'OH', office_id: 'P', race_type_id: 'R', seat_name: 'Delegate-at-Large'} which is a very different than the one for NY {state_id: 'NY', office_id: 'P', race_type_id: 'R'}, for instance.\n\n\n    \n        \n            \n            The admin for race slug mapping\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nWhich is the biggest problem with this approach. Not only is it verbose, it’s inconsistent; looking up races this way would mean that each coder or page designer making those requests would have to remember the proper parameters for each of the races they needed, and the chances of making mistakes would be unacceptably high. What if we could replace the nuances and exceptions of how elections are specified in the real world with a model that is consistent? Race slugs are the answer. Under this approach, we defined a naming convention of race slugs - strings that could be used to reference a single race like oh-president-2012-primary-rep or ny-house-district-2-2012-general for instance. We then implemented a mapping table with a corresponding admin to define how individual slugs were mapped to values in the race table. At the beginning of an election night, when we loaded all the candidates and setup the races for the first time, we would run a process to map the slug to the specific races and their race_ids for that election. If a slug didn’t match anything or more than one race, it would raise an error and we could check and adjust until it matched exactly one race. Then we could use the slug to find the races and map races back to the more readable slug when we needed to display internal messages for debugging.\n\nRace Calls\n\nThe AP election result schema included their own race calls as part of their reporting. A race call is a newsworthy event, since it often triggers an alert on the homepage and possibly a recalculation of delegates won or electoral points awarded. Calls are significant enough that we need to record the moment in a persistent way as its own events. So, we made a separate calls table to record when the calls happened. How did we figure that out? Change detection! If the call flag is set in the staging table but not in the production table, that means a call was made between this load and the one before and we can record a new call! The inverse is also sometimes important, because that means the AP retracted a call, and that also can require a news alert or recalculation.\n\nThe Times generally would follow the AP’s race calls, but they also wanted to make their own calls independent from the AP. Using the calls table meant for races we wanted to autocall, it would create two different call records on an AP call: one for the AP call and one for an NYT call. But, if we had a race set in our admin to be manually called, then the NYT call could be created at any time (the interface for this was called The Meislomatic in honor of Rich Meislin, its primary user). We also used this table structure for the general election in 2008 to record calls made by other news sources like CNN and the networks (the key was a panel of volunteers in the newsroom checking those sites for calls and entering them via a custom admin screen). We could even use these calls to compute the Electoral College counts reported by the AP and other news organizations in addition to the Times.\n\nElectionbot\n\nIn the early days of Interactive News, we were using the Campfire and Basecamp apps for coordination and communication. But in 2014, the Times switched to using Slack, which created some exciting new possibilities for keeping track of races. In the early days, I sometimes would monitor election loading on off-hours or random primaries (like Samoa) by using ssh on my phone to connect to the election servers where I could monitor the logs. This was cool but a bit unwieldy, since it required that I kept a connection open to the server, and it took a little bit of time and effort to reconnect every time I disconnected.\n\n\n    \n        \n            \n            Early usage of the ElectionBot\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nOne of the nice things about Slack is that it supported something called “webhooks,” which is a programming convention for the software to interact with other systems by making short connections to remote web URLs with data (these were callbacks aka “hooks” but on the web, so webhooks). Slack supported two types of webhooks: we could create an endpoint on Slack which could receive messages from the outside and post them into the channel. And there was also an option to implement outgoing webhooks, where a command prefixed with a / could trigger a request and then display the response in the channel. We used both of these extensively, with the election loader posting in events like race calls or first votes to the Slack channel (using the race slugs as identifiers) and also we could type commands like /poll closings and get a nicely displayed list of poll closings for the night. For more details and examples of the code, see my 2015 article Thank You, Electionbot. These days, this kind of programmatic integration of server infrastructure with communication tools is called “ChatOps,” but back then we simply just called it cool.\n\nThe Flow of a Typical Night\n\n\n    \n        \n            \n            My typical vertical monitor setup\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nI often described election nights as hours of tedium punctuated by moments of terror. This is a bit of an exaggeration - apart from that terrible night in 2012, everything else worked smoothly - but it is true that the tension would usually ratchet up a few minutes before the polls closed and during that interval where we waited for first votes and race calls to come in for a given state before dissipating and then ratcheting up again. It was sometimes important to set alarms just to know when the best time to take a bathroom break was.\n\nWe would usually start our loader in the early afternoon to setup the races and map the slugs when the AP announced that the syatem was all zeroes. Then, it would be figuring out where to order dinner, watching the loader and seeing the data come in. I liked to setup at least one monitor in a vertical orientation so I could better see the admin and sometimes stack two races on top of each other. I also always wore a tie, out of some superstition.\n\nGeneral elections and big primaries like the early contests and Super Tuesday were exciting in the newsroom. It usually started with a small panic attack because I would load the zeroes and see that there were somehow votes in there before remembering and cursing the residents of Dixville Notch for their early voting gimmick. Then, it was just moving our servers, figuring out seating plans and talking about our expectations for the evening. Once the results started coming in for real, I was usually in a zone for hours, watching the code do its job and the pages fill with content. I likened it to a space launch, we would finally see how various interactives and designs unfurled with the real data of elections (as opposed to our various tests), and it was always amazing to see how the traffic numbers would go up, the homepage editors would tweak the layouts and the reporters would consult with editors and file their stories. This pace would usually continue until the race was called and some remaining states came in, the acceptance and concession speeches wrapped and the paper was laid out for the presses. I often would stay on for a few more hours after that - Alaska and Hawaii didn’t close their polls until 11pm and midnight on the east coast respectively - just to make sure things were running before putting the loader onto a slower cadence, to catch a car service home at 3am.\n\n\n    \n        \n            \n            The web traffic we saw on Super Tuesday 2012. Luckily, we caught that cache miss problem before it became a real issue.\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n    \n        \n            \n            Electionbot overview of poll closing times\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nThe Meislomatic\n\nI also spent a fair amount of the night running between floors of the newsroom with my laptop. The Interactive Newsroom Team and the Graphics department were located on the second floor of the building and below the main floor of the newsroom (I called it the Basement of News sometimes). But I usually spent most of the night upstairs on the third floor in the middle of the politics desk so I could be there to support the race calling team of Rich Meislin and Janet Elder in case they had any questions or noticed problems on the special race-calling admin I had built for them called the Meislomatic.\n\n\n    \n        \n            \n            The 201 presidential overview for the Meislomatic at 11:57pm\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nAs I mentioned earlier, the Times politics desk wanted to make its own calls for certain races rather than relying on the AP. This usually meant every state in the Presidential election and primaries, as well as key congressional and gubernatorial races. To help them keep track of all that, I built a series of custom admin interfaces to our election loader (we ran command-line scripts to do the loading, but it was also a web app in Rails too) that gave specific information that the call desk needed about the states of races. These screens would dynamically update their counts when they changed with yellow flashes showing the fields that had change (another use of change detection!). The other key part of the interface was a big red CALL RACE button that the person making the call could press once they had selected who had won that race. At Rich’s insistence, I added a dialog popup that said “Are you sure?” before the race call could be processed. At the time, I thought this was overly cautious, until the night of the 2008 election when Rich let me call the result for California and I started sweating anxiously when I thought about how this call would show up on the homepage and if I messed up it would be a minor story in the Gawker or the Observer titled “Times Flubs California” or something. He was right. It needed that popup.\n\nWhat Next?\n\nI left the Times in May 2015. After nine years in journalism, I was ready for something different and I went to 18F to start my career in Civic Tech. By that point, the state of technology used by the AP had also progressed. The FTP servers were soon to be deprecated and replaced by a more modern API. There was increasing competition by other media organizations to provide the results, and some states were improving in their ability to provide vote totals in realtime via their board of elections sites. The Times also wasn’t as interested in going it alone as it had been in the past, and it started building a new election loader as open-source in collaborations with a few other online newsrooms. My system was mothballed in 2015 by its successor. That system was archived in turn in 2023. I’m not sure what they’re running now, but there still is some sort of loader running to pull in election data. The code might have changed, the developers are very different from my day, but the general concepts remain the same. My loader was just the first in a sequence of technical innovations in how the Times provides election results.\n\nSome Other Images of the Meislomatic\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            2012 Meislomatic primaries screen\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            2012 Meislomatic governors races screen. You can see the yellow highlights used to indicate updating data.\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            Vertical monitor view of the Meislomatic showing data filling in for Michigan.\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            Detail of a race on the Meislomatic for a presidential primary in 2012\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            2012 Meislomatic senate races screen. You can see the yellow highlights used to indicate updating data.\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            A longer capture of the terminal output for the election loader.\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            A longer capture of the system status screen in the admin\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            Loading errors showing up the admin. I told you 2012 general election was a bad night.\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            A Meislomatic screen during the 1/31/2012 South Carolina primary for the GOP\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            A Meislomatic screen during a 2010 election for Ohio Senate\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            Kicking back at the end of election night 2014 (what turned out to be the last run of the loader)\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\n\n\n\n\n\n\n    \n        \n            \n            The popup asking if you’re sure you want to call a race\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\n\n\nEvery State is Weird\n\nFinally, here are the slides for a lightning tale I gave on all the special peculiarities of how each state conducts elections titled “Every State is Weird”. Enjoy!"
        },
        {
          "id": "projects-nytimes-twitter",
          "title": "@nytimes on Twitter",
          "collection": {
            "label": "projects",
            "name": "Projects"
          },
          "categories": "project",
          "tags": "",
          "url": "/projects/nytimes-twitter",
          "content": "This whole thing started out of spite.\n\nIn 2007, I started working at the New York Times on the digital side of operations. This was before the very new skyscraper/climbing gym had been completed, so our offices were even located in a separate building from the original New York Times’ headquarters, physically reflecting our distance from the editorial side of operations. In any event, I recall reading a piece of reporting somewhere that the NYT R&amp;D team had built a system that could text a Times article to a person’s phone.\n\nWell, that’s nothing fancy, I could do that, I thought to myself. I’ve since met them multiple times - and I will emphasize they have been lovely people - but there is a large part of me that hates the concept of a distinct R&amp;D Lab in any organization because it implies there is only a specific group allowed to innovate. Yes, it’s petty and probably psychologically revealing, but this feeling made me turn to Twitter (where I already had a personal account) and register @nytimes. Then, all I had to do was write a script to do the following:\n\n  read an RSS feed from the homepage, take all the URLs and run them through a URL-shortening service1 to make tweets\n  put those into a database to keep track of what articles I’ve seen and don’t need to rescan, tweets to send (and details of those sent)\n  Look for tweets that haven’t been posted yet and post 1 or 2 of them at a time.\n\n\nThen, I just needed to edit my account preferences to receive those tweets as text messages and I had the NYT as messages on my phone. Voila! If you want even more technical details on the initial process, I also wrote it up in an early blog post of TimesOpen praising Twitter as “the right kind of stupid”.\n\nThe @nytimes bot officially launched on March 6, 2007\n\nWord Up!\nAfter a few weeks, I decided it would be fun to supplement the regular homepage feed by adding specialized accounts like @nyt_science, @nyt_books, etc. that used the RSS feeds from specific sections. This required some architectural revision to my original code to handle multiple accounts with different posting backlogs. I also marked the occasion with a famously silly tweet:\n\n\n    \n        \n            \n            Word up! It is I, the Gray Lady\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nAs I described this in an 10-year anniversary interview, the fun was “imagining what The New York Times would say if it were trying to be cool.”\n\n\n  “It was very important to me when I was writing that tweet that even though the metaphorical Gray Lady would try to use slang, it was still very proper grammar,” he said. “‘It is I’ versus ‘It’s me.’ It’s like the Queen trying to use slang. It had to be that combination of fusty and fashionable.”\n\n\nOver the years, it’s been pretty fun to see how people have stumbled upon this tweet and wondered what had happened, but at the time only a few dozen accounts would have seen it. For the first year, the @nytimes bot was a silly hack project I just kept running, with its early user base being a weird mix of news nerds and regular nerds. In October 2017, the account finally hit 1000 followers, which felt like a huge deal at the time (but is comical considering the current follower count: 55 million)\n\nBecoming a Product\nI liked to joke that I knew the twitter feed became a product the weekend a cleaning person accidentally unplugged the machine under my desk where it was running. That morning, I rolled into work to be greeted with an email thread of people wondering which team was supporting this product, was it even an official New York Times product, do we need to get legal involved… I had to come clean and start supporting it as an official product.\n\nI also started giving presentations to the New York Times newsroom staff explaining what twitter was and why they should consider joining it and using it for news. I especially enjoyed how I designed it to flow. Here is the third iteratio from 2009:\n\n  Slides\n  Movie\n\n\nAround this time, Twitter also rolled out a feature called The Suggested User List, where new users could follow a group of selected account in one click so they could get an idea of what Twitter is like. They included @nytimes on the list and, as a result, the number of followers for this sleepy little accounts started increasing exponentially. What had once been a project for dozens of subscribers was now being followed by millions. Along the way, we streamlined and improved some things. The service got an actual hand-coded web admin. I moved it off my computer and into the cloud. The NYT paid for Bitly pro so we could have better short URLs. And the numbers of subscribers continued to climb.\n\nDuring the height of all this, I was invited to give a talk at Twitter’s Chirp conference in 2010. The whole conference was a showcase to unveil a bunch of cool new features for developers that the company promptly reversed course on and buried unceremoniously after Evan Williams was forced out as CEO. I gave a little talk about @nytimes and how many times a NYT story is tweeted per second. I remember that for some reason the musician will.i.am was there. I also remember looking out into the audience during my section and realizing I had bored will.i.am to sleep. Good talk!\n\nOther Bots\nThe @nytimes twitter account wasn’t the only Twitter bot that wrote over the ensuing years. There was something appealing about how easy it was to create silly little scripts that would feed in some content and post to Twitter. Early on in my experimentation, I made a bot that would post weather messages and change its icon based on the current weather.\n\nAfter the rise in popularity of the @horse_ebooks account2 twitter account, I made my own version that would try to generate similar content from New York Times articles. The outcome was generally more weird than funny, but it did get a write-up from Nieman Lab about it.\n\nAlso, the @nytimes_ebooks code became the basis of my most famous bot, Times Haiku, details of which can be found on its own project page.\n\nHanding Over the Keys\nEventually, the New York Times built out an entire social media team to handle strategy and content across multiple social media properties. Third party tools like buffer gave organizations the abilities to post content and track metrics easily. Hand-crafted social media content was seen as both more relatable and effective, with a higher click-through rate. Content could also be scheduled to be posted and repeated during the hours that most people were using Twitter, rather than the middle of the night when many news stories still were published (due to press times). It was time to move on. I handed over the passwords3 to the accounts into the capable hands of the NYT social media editors.\n\n\n\n  \n    \n      Originally, I used shurl, because it was slightly shorter that tinyurl (this was in the days before bit.ly). Ironically, shurl went out of business and its domain was purchased by a porn site, so I think a lot of those early tweets now link to porn. Deep sigh… &#8617;\n    \n    \n      I’m still so mad that this account turned out to be two dudes pretending to be a program. I spent days trying to figure out how they got their output so perfectly random and it turns out I was trying to reverse-engineer a Mechanical Turk &#8617;\n    \n    \n      In the days before OAuth and MFA, I was worried about someone getting access to the account to spam all our followers. I think the main account password was a 45-character random string generated by 1Password. I didn’t know. Nobody did! &#8617;"
        },
        {
          "id": "projects-open-source",
          "title": "Open Source Projects",
          "collection": {
            "label": "projects",
            "name": "Projects"
          },
          "categories": "project",
          "tags": "",
          "url": "/projects/open-source",
          "content": "I have contributed to Open Source software in a variety of places. It’s not as substantial as I would like (sadly, some of my most best coding work has been for proprietary codebases), but it’s been enough to make me happy. This page documents a few of the more prominent projects and work.\n\nTimesOpen\nEarly on at the New York Times, I was a cofounder of TimesOpen, a blog for announcing the various open-source efforts at the the newspaper. It was a new direction for the Times, both contributing to open-source but also opening up new APIs in the hopes of better engaging with the developer community. In the beginning, the focus was largely on DBSlayer, a connection-pooling layer for databases that could be used for different languages. I also wrote up a fair number of the early posts on the blog, where among other things I talked about the @nytimes twitter account and how we used Varnish to cache our service. We also used Open to launch various new APIs from the Times. We also organized a few events for developers at the Times building, like this one on the Real-Time Web in 2010.\n\n\n    \n        \n            \n            The NYT Open logo features a finite-state-machine-like diagram\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nI was particularly proud of our logo, inspired by a finite-state machine diagram I had in the Red Dragon compiler book I kept at my desk. I also am not sure who thought of the tagline “All The Code That’s Fit to Print,” but I recall the argument about making it Printf instead before we decided that was too C-coded.\n\nToday, the Open blog continues strong, now known as NYT Open, and it’s expanded in scope to include design, interviews with staffers, and deep dives into how the Times uses modern technologies like React Testing or Kafka or how to design infrastructure for resilience on election nights\n\n18F Projects\n\nIn 2015, I left the Times to go work for 18F, a consulting entity that operated from within the US government at the GSA. Besides all the wonderful people and the compelling mission, one of the really appealing aspects of working for 18F was its widespread use of open-source. Since we were government employees, every line of code we wrote belonged in the public domain, and creating open-source repositories were the first step for any project. It was the first time I had worked on a project that was open-source from the start, and I learned a lot from the process and really appreciated working in the open.\n\nSome notable projects I worked on while at 18F:\n\n\n  MyUSA: I joined this project while it was in progress and helped build out the user interface. MyUSA was a prototype system for single-sign-on that allowed users to sign in and control what information they share with various government websites.\n  Micro-purchase The premise of the micro-purchase experiment was radical: government employees should be able to commission custom software development with the same ease as they can buy office supplies. The initial experiment was built in Google Docs; I helped create a robust web application in Ruby on Rails to successfully run all other auctions.\n  Connect_VBMS - a Ruby gem for connecting to the VBMS system within the VA. Funnily enough, colleagues of mine at Nava PBC wound up using this code years later and realized I had contributed to it when they looked at the commit history.\n  FBI Crime Data Explorer I am extremely interested in Open Data; when I learned that 18F would be building an interface for crime data from the FBI, I asked to be part of the project, especially since it also meant learning Python, a language I did not know that well. I have worked closely with another developer on the backend, building and optimizing an API used by the visual explorer website.\n  Confidential Survey As part of my involvement with the Diversity Guild and a project to gather statistics on 18F’s efforts at diversity and inclusion, I built a prototype for conducting surveys without collecting detailed records that could compromise a user’s privacy\n\n\nPersonal Projects\n\nI also have done a few different serious and silly projects over the past few years on my own Github account. Many of these are abandonware and reflect technologies and interests from over a decade ago, but sharing just for fun:\n\n\n  harrisj.github.io jekyll site this site! (2017 - preset)\n  Trump Data - a collection of hand-curated datasets related to the second Trump presidency (2025)\n  Food Recalls Actions - a rework of the food recalls scraper to use make and Github Actions (2022 - present)\n  NYT Haiku Python - a rework of the original Ruby NYT haiku code now in Python (and with a few more improvements) to run on Twitter (2020 - 2022)\n  Luigi Scraper Demo - using Luigi to orchestrate a scraper (2017)\n  Haiku Elm - a haiku validator written in Elm to help me learn the language (2017)\n  Food Recalls - the original food-recalls scraper as described in this article for OpenNews Source (2015)\n  qrencoder - a Ruby gem for making qrcodes before they were cool (2011-2012)\n  airport_scraper - some Ruby code to extract airport info from freeform text. I build and used in a personal art project that looked for people posting about their travels on Twitter (2009 - 2010)\n  tweetftp - an April Fool’s joke where I implemented a system for sending a file via Twitter as a serious of small tweets (2010)\n  lifeline - a cron-based approach for launching and keeping daemons alive we used on some of our servers for Interactive Newsroom Technologies (2010)"
        },
        {
          "id": "projects-sky-gradients",
          "title": "Sky Gradients",
          "collection": {
            "label": "projects",
            "name": "Projects"
          },
          "categories": "project",
          "tags": "",
          "url": "/projects/sky-gradients",
          "content": "One of my personal little hobbies has been to take a photo of the clear blue sky and post it to Instagram with no other context or explanation. I find it soothing, but I have fallen out of using it since I no longer use Instagram. So, I decided to instead collect all of the images here! Click on any image to view it in its original size and scale. All images are in the public domain Creative Commons Zero; it’s the sky, I don’t own it!\n\nI also wrote an article in The Atlantic explaining the whole thing with some possible artistic antecedents as well. There also is a zip archive (8.2MB) if you want to use them for your own projects. If you make something cool, please let me know!\n\n2014\n\n\n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2014-06-26 Quebec City, QC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2014-07-05 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2014-07-26 Rockville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-08-06 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2014-08-07 Hyannis, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2014-08-10 Hyannis, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2014-08-10 Yarmouth, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2014-08-14 Stellwagen Bank\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2014-08-19 Hyannis, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2014-08-26 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2014-08-29 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2014-09-01 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-09-09 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-09-18 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-09-22 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-09-27 Key Biscayne, FL\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-09-29 Miami, FL\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-10-04 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-10-19 Elk Neck State Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-11-10 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-11-21 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-11-30 Sugarloaf Mountain, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-12-10 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2014-12-15 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n  \n\n2015\n\n\n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2015-01-10 Chillum, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2015-01-15 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2015-02-20 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2015-03-15 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2015-04-16 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2015-04-21 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2015-04-29 Manhattan, NY\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2015-05-14 Washington, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2015-09-11 Rockville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2015-10-06 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2015-10-11 Wheaton, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2015-10-18 Waldorf, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2015-10-20 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2015-10-30 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2015-11-04 Shepherd Park, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2015-11-24 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n  \n\n2016\n\n\n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-01-02 Hillandale, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-01-24 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-02-02 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-02-05 Rockville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-02-14 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-02-16 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-02-28 Wheaton, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-03-01 Rockville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-03-09 Denver, CO\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-03-11 Denver, CO\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-03-16 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-03-18 Rockville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-03-24 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-03-26 Lancaster, PA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-03-29 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-04-02 Governors Run, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-04-03 Governors Run, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-04-05 Rockville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-04-11 Washington, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-04-12 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-04-16 Ashburn, VA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-04-24 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-05-02 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-05-07 Cleveland Park, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-05-15 Wheaton, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-05-20 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-05-24 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-06-01 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-06-06 Farragut Square, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-06-08 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2016-06-17 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-06-22 Columbia, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-07-09 Alexandria, VA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-07-23 National Mall, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-07-29 Fenwick Island, DE\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-08-03 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-08-07 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-08-10 Ballard, WA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-08-11 Paradise, WA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-08-22 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-08-27 Swampoodle, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-09-06 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2016-09-12 Wheaton, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2016-09-22 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2016-10-09 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2016-10-15 Bethesda, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2016-10-28 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2016-11-11 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2016-12-01 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2016-12-12 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-12-22 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2016-12-27 Trumbull, CT\n                  \n              \n              \n                close\n              \n            \n        \n      \n  \n\n2017\n\n\n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2017-01-01 Bennington, VT\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2017-01-08 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2017-01-30 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2017-02-13 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2017-02-20 McHenry, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2017-02-23 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2017-03-02 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2017-03-08 Kensington, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2017-03-23 Takoma Park,MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2017-04-17 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2017-04-18 College Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2017-05-03 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2017-05-09 College Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2017-05-15 Foggy Bottom, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2017-05-19 Midtown NYC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2017-06-01 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2017-06-03 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2017-06-10 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-06-21 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-07-02 Brewster, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-07-03 Centerville, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-07-03 Brewster, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-07-05 Somerville, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-07-09 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-07-19 Bethesda, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-08-04 Minneapolis, MN\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-08-09 Hampton Beach, NH\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-08-13 Southwest Harbor, ME\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-09-04 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-09-08 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2017-09-18 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2017-09-24 Wheaton, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2017-09-28 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2017-10-02 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2017-10-19 Foggy Bottom, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2017-10-31 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2017-12-31 Miami, FL\n                  \n              \n              \n                close\n              \n            \n        \n      \n  \n\n2018\n\n\n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2018-01-07 Wheaton, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2018-01-13 National Mall, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2018-01-26 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2018-02-03 Tenleytown, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2018-02-13 Bethesda, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2018-02-27 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2018-03-24 National Mall, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2018-04-12 Forest Glen, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2018-04-18 Thomas Circle, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2018-05-08 Thomas Circle, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2018-06-24 Berkeley Springs, WV\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2018-07-06 Montreal, QC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2018-07-09 Tenleytown, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2018-07-28 Adelphi, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2018-08-09 Stone Harbor, NJ\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2018-08-22 Highland Park, NJ\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2018-08-23 Brooklyn, NY\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2018-08-26 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2018-09-02 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2018-09-26 Thomas Circle, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2018-09-30 Rockville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2018-10-19 Thomas Circle, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2018-10-23 Bethesda, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2018-10-30 Thomas Circle, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2018-11-11 Rockville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2018-12-18 Thomas Circle, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2018-12-30 Pittsburgh, PA\n                  \n              \n              \n                close\n              \n            \n        \n      \n  \n\n2019\n\n\n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2019-01-28 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2019-02-24 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2019-03-16 Cleveland Park, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2019-03-22 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2019-03-26 Thomas Circle, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2019-04-10 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2019-04-23 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2019-05-15 Ellicott City, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2019-05-25 Southampton, Bermuda\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2019-05-31 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2019-06-04 McPherson Square, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2019-06-13 Seville, ES\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2019-06-17 Barcelona, ES\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2019-06-19 Madrid, ES\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2019-06-21 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2019-06-26 Logan Circle, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2019-06-30 Laytonsville,MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2019-07-13 Cary, NC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2019-07-29 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2019-08-10 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2019-08-30 Bethany Beach, DE\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2019-09-03 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2019-09-20 Metro Center, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2019-09-24 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2019-09-27 Brunswick, ME\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2019-10-15 College Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2019-11-03 Poolesville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n  \n\n2020\n\n\n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2020-01-12 Rockville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2020-02-02 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2020-02-09 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2020-03-04 Thomas Circle, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2020-03-26 Hyattsville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2020-04-11 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2020-05-04 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2020-05-15 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2020-07-02 Bethesda, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2020-07-08 College Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2020-07-13 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2020-07-18 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2020-07-29 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2020-08-04 South Dennis, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2020-08-08 Beaver Cove, ME\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2020-08-18 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2020-09-19 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2020-09-30 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2020-11-02 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2020-11-07 Cleveland Park, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2020-12-29 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n  \n\n2021\n\n\n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2021-01-09 Tenleytown, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2021-03-20 Shaw, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2021-05-01 Hyattsville, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Spring Gradient 2021-06-18 Asheville, NC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2021-06-22 Nashville, TN\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2021-07-13 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2021-07-30 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2021-08-11 Brewster, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2021-09-02 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2021-09-19 Williamsburg, VA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2021-09-24 Foggy Bottom DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2021-10-21 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2021-11-19 Metro Center, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2021-12-12 Kennedy Center, DC\n                  \n              \n              \n                close\n              \n            \n        \n      \n  \n\n2022\n\n\n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2022-01-14 Treme, LA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2022-02-28 Takoma\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2022-08-19 Takoma\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2022-11-10 Bethesda, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2022-11-26 Brookline, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2022-12-01 Silver Spring, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Winter Gradient 2022-12-27 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n  \n\n2023\n\n\n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2023-07-05 Dennis, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2023-08-17 Prague, CZ\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2023-10-26 Takoma Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2023-11-23 Boston, MA\n                  \n              \n              \n                close\n              \n            \n        \n      \n        \n            \n            \n              \n                  \n                      \n                      Autumn Gradient 2023-12-15 College Park, MD\n                  \n              \n              \n                close\n              \n            \n        \n      \n  \n\n2024\n\n\n      \n        \n            \n            \n              \n                  \n                      \n                      Summer Gradient 2024-08-05 Cape May, NJ\n                  \n              \n              \n                close"
        },
        {
          "id": "projects-times-haiku",
          "title": "Times Haiku",
          "collection": {
            "label": "projects",
            "name": "Projects"
          },
          "categories": "project",
          "tags": "",
          "url": "/projects/times-haiku",
          "content": "This project largely happened because I was depressed and bored.\n\nIt was always a bit rough to end a season of election coverage. After spending almost more than a year testing, tweaking and loading election results, it’s disorienting to suddenly have no more election stuff to do. In 2012, this shock was compounded with frustration from several significant technical bugs that had happened on election night itself. I needed an outlet. And that’s when I decided to go hunting for haiku.\n\nI had been inspired by the Haiku Leaks project in late 2010, that combed the Wikileaks Cablegate corpus to find haiku hidden within the messages. And I had already written a few silly bots that acted on NYT article text. So, why not combine the two?\n\nThe initial seed of this project was a hackish selection of a few Ruby scripts and a database to assemble the following building blocks into a haiku-finding machine:\n\n  First, I needed a database to store the haiku I found, but also to help me lookup the syllable counts for specific words. To seed that database, I wrote a script to pull words and count syllables in the CMUdict\n  I also had some handy code to pull out the text after retrieving a NYT article page.\n  The next step was writing some code to scan the text for haikus. It did this first by breaking up the text into a series of sentences and then it went through each sentence counting syllables to see if we got exactly 5-7-5.\n  Finally, I created some code to regularly check the homepage and when it found a URL it hadn’t seen before, it would fetch the article, scan it for haikus and store any haiku it had seen as well as a record that it had processed the page.\n\n\nPut that all togther and I had a little program I could run on my computer that would go searching for haiku and sharing them with me. When I showed this to some of my coworkers on the Interactive Newsroom Technologies team, they agreed it could be a fun project to create. To implement this, we tried a novel moderation approach where the Haiku bot would post haiku it found to a private moderation Tumblr blog. When moderators approved, they would the be published to a public tumblr. The design for the site was done by Heena Ko and the distinctive procedurally generated format for each haiku was built by Anjali Bhojani.\n\n\n    \n        \n            \n            An example of the visual Times Haiku presentation\n\n        \n    \n    \n        close\n    \n\n\n    \n\n\nExplainer that Appeared on the Times Haiku Site\nWhimsy is not a quality we usually associate with computer programs. We tend to think of software in terms of the function it fulfills. For example, a spreadsheet helps us do our work. A game of Tetris provides a means of procrastination. Social media reconnects us with our high school nemeses. But what about computer code that serves no inherent purpose in itself?\n\n\nThere is pleasure to\nbe had here, in flares of spice\nthat revive and warm.\n\n\nThis is a Tumblr blog of haikus found within The New York Times. Most of us first encountered haikus in a grade school, when we were taught that they are three-line poems with five syllables on the first line, seven on the second and five on the third. According to the Haiku Society of America, that is not an ironclad rule. A proper haiku should also contain a word that indicates the season, or “kigo,” as well as a juxtaposition of verbal imagery, known as “kireji.” That’s a lot harder to teach an algorithm, though, so we just count syllables like most amateur haiku aficionados do.\n\n\nAs dawn broke we warmed\nstrawberry Pop Tarts over\nthe dying embers.\n\n\nHow does our algorithm work? It periodically checks the New York Times home page for newly published articles. Then it scans each sentence looking for potential haikus by using an electronic dictionary containing syllable counts. We started with a basic rhyming lexicon, but over time we’ve added syllable counts for words like “Rihanna” or “terroir” to keep pace with the broad vocabulary of The Times.\n\nNot every haiku our computer finds is a good one. The algorithm discards some potential poems if they are awkwardly constructed and it does not scan articles covering sensitive topics. Furthermore, the machine has no aesthetic sense. It can’t distinguish between an elegant verse and a plodding one. But, when it does stumble across something beautiful or funny or just a gem of a haiku, human journalists select it and post it on this blog.\n\n\nStop the machine and\nscrape down the sides of the bowl\nwith a spatula.\n\n\nFinding the haikus is only the beginning. Because we want the poems to retain their visual integrity, even when people share them across social networks, we post them as images instead of text. On every image, you’ll notice a seemingly random background pattern of colored lines. The different orientations of those lines are computer-generated according to the meter of the first line of the poem.\n\nSo, what’s next? This experiment in automated poetry detection has only just begun. We’ll fine-tune the algorithm, expand the dictionary and see what treasures we find. We hope you’ll follow along.\n\nLaunch and Reception\nAfter spending a few weeks refining the process, we decided to let the haiku generator run for a little bit so we could evaluate the bot and have a collection of dozens of haiku when the site was open. Since April is National Poetry Month, we picked April 1st, 2013 as the day to go live.\n\nIn hindsight, this was not a good idea.\n\nAlthough much of the coverage was appreciative, many also wondered if this was an elaborate April Fools’ joke. Oops. Others were just confused why we had spent time building this. Every few months, I would also receive an irate email from an American haiku poet who wanted to inform me that the syllable count is not what defines a haiku (I know), and that to truly be called that, these would need to include both a nature theme and a thematic cut in the middle (I know). Honestly, they probably aren’t even good senryū (yes, I KNOW, it says all that in the intro!).\n\nAfter the initial hubbub died down, it just became a little part of our days, monitoring the bot was running, moderating the best ones to publish on a timed cycle and seeing how people reacted on Tumblr and the Twitter account where they were reposted. I especially enjoyed seeing Times reporters sometimes retweeting haiku they didn’t realize they had written. Over the years, I myself would fix little bugs in things like sentence identification and add more info on syllable counts for unknown words. I also added checks to avoid sensitive stories or topics.\n\nI left the New York Times in 2015. But the site itself continued operating until December 19, 2017 - or over four-and-a-half years. Over that time it posted a lot of haiku (it looks like I need to pull them all down, but here is data from the first year of operation)\n\nLater Years\nI still have enjoyed playing with the concept and often when I am trying to learn a new programming language (like Elm or Clojure),I will try writing a haiku finder/validator for it. More recently, I revived the Haiku bot with a new open-source version written in Python that posted haiku to the @nythaikus) twitter account from September 14, 2020 (you can guess what global event was making me depressed for this one) until November 18, 2022 (when Elon Musk started charging bots for API access).\n\n\nAh well, that was fun.\nToo bad Elon has to ruin\nall that he touches"
        },
        {
          "id": "projects-trump-data",
          "title": "DOGE Track",
          "collection": {
            "label": "projects",
            "name": "Projects"
          },
          "categories": "project",
          "tags": "",
          "url": "/projects/trump-data",
          "content": "This is another project that was born out of anger.\n\nAs I write this in late October 2025, we are now 9 months into the second Trump presidency. It’s been hard keeping track of all that has been damaged and destroyed within the federal government. Emboldened by Musk and the absence of oversight, the so-called “Department of Government Efficiency” (DOGE) went rampaging through agencies to subvert their security, cancel contracts, fire staff and siphon up confidential data into large data warehouses. Some of this was motivated by Silicon Valley’s empty libertarian platitudes about disruption and efficiency. Much of this was to be the point of the spear for Russell Vought’s plan to subvert the Constitution.\n\nAs someone who has spent the past decade of my live in Civic Tech, this has been extremely demoralize to watch. Not only are they destroying vital government services, they’re undermining the idea that technology can serve the public good. I feel compelled to bear witness to this moment, but ‘m not particularly good at writing commentary. I’m no longer adjacent enough to journalism that I can report on what is happening. However, I do enjoy working with data and seeing what patterns will emerge over time from data collection and analysis.\n\nAnd so, I created a new GitHub repo named trump_data on February 8th, 2025. And then, I started collecting data. The first datasets were relatively modest in scope:\n\n\n  I wrote a script for pulling data from the Just Security Litigation Tracker to see how cases changed over time\n  I created a dataset for tracking Trump’s trips to one of his various properties and what days he went golfing. This turned out to be easier to just hand-edit rather than write a scraper for it. More recently, John Emerson contributed an automated scraper to find the golf dates so I don’t need to update those manually\n  I added a table of CSV data recording the population at the concentration camp in Guantanamo Bay that Trump promised to create for immigration detention\n  I started collecting the weekly unemployment reports to see if it would start showing a surge in unemployment for federal employees\n\n\nMy biggest project within the repository has turned out to be an evolving effort to track the activities of DOGE’s “IT Modernization” efforts across the federal government. Practically every single example of DOGE’s smash-and-grab assaults on a given federal agency starts with the lie that they’re just there to help the agency with “IT modernization” before they quickly escalate their privileges, lock out staff, cancel contracts and fire much of the staff. I was tired of how much they operated in the shadows, so I started collecting data from news sources of their activities.\n\nEvolving the IT Modernization Dataset\n\nIt started simply enough as a single YAML file, with the following basic structure:\n\n- agency: Centers for Medicare and Medicaid Services\n  acronym: CMS\n  date_started: 2025-02-05\n  date_completed:\n  participants:\n  - Luke Farritor\n  vandalism:\n  systems:\n  - name: CMS Acquisition Lifecycle Management system\n    acronym: CALM\n    description: System for tracking CMS acquisitions, contracts, milestones and audits.\n  sources:\n  - https://www.msn.com/en-us/money/general/doge-targets-u-s-health-agencies-gains-access-to-payment-systems/ar-AA1yu5OD\n  - https://www.cms.gov/newsroom/press-releases/cms-statement-collaboration-doge\n  - https://www.wsj.com/politics/elon-musk-doge-medicare-medicaid-fraud-e697b162\n\nI just wanted to track who was at each agency and what was happening. I chose YAML because it is a data format that is designed for machine-processing but it is also meant to be somewhat readable for non-technical people if they wanted to also read the data From there, I have kept evolving both the types of data I’m collecting and the systems for keeping track of it all with a variety of iterations:\n\n\n  I added an events field where I started listing the dates and details of specific events, always with a linked source for attribution and reconstruction\n  I added a named field to the event structure to record when specific DOGE staff were associated with events\n  I added a roundups section for listing when news sites published roundups of who is where. I also started recording more info for systems.\n  To make it more accessible to non-programmers, I built a script to create a CSV version of events that could be loaded in Excel\n  While working on that, I discovered that I had made some formatting errors in the YAML file and I was sometimes inconsistent with field names. So, I created a JSON Schema file to validate the YAML so my editor could tell me when I was introducing errors.\n  I then extended that YAML schema to also include Enumerated field types for the DOGE names so I would never have to worry about keeping it consistent for people (e.g., Mike Russo in one place and Michael Russo in other places). I used that for also validating agency abbreviations and system acronyms.\n  I added a type field and defined several basic event types so I could differentiate between things like “DOGE staff were spotted at an agency” from “a specific DOGE staffer was granted access to several systems” or “a person was detailed form one agency to another”\n  I added support for imprecise dates, so I would be able to better represent the fuzziness of a news article reporting that something happened “late last week” vs. an exact date\n\n\nAt this point, the YAML looked more like this for a single agency\n\n- name: Department of the Interior\n  acronym: DOI\n  roundups:\n  - source: https://www.nytimes.com/interactive/2025/02/27/us/politics/doge-staff-list.html\n    organization: The New York Times\n    named:\n    - Tyler Hassen\n  - source: https://projects.propublica.org/elon-musk-doge-tracker/\n    organization: ProPublica\n    named:\n    - Tyler Hassen\n  - source: https://www.wired.com/story/elon-musk-doge-silicon-valley-corporate-connections/\n    organization: Wired Magazine\n    date: 2025-03-28\n    named:\n    - Tyler Hassen\n  events:\n  - date: 2025-01-28\n    type: disruption\n    event: Two DOGE staffers attempted to force water pumps to be turned on in a large reservoir in California for a photo op\n    named:\n    - Tyler Hassen\n    - Bryton Shang\n    source: https://www.cnn.com/2025/03/07/climate/trump-doge-california-water/index.html\n  - date: 2025-02-24\n    type: onboarded\n    onboard_type: detailed\n    event: Stephanie Holmes is detailed to the Department of the Interior as a Special Advisor and acting Chief Human Capital Officer for the entire agency\n    detailed_from: DOGE\n    named:\n    - Stephanie Holmes\n    source: https://subscriber.politicopro.com/article/eenews/2025/03/05/heres-the-people-connected-to-doge-at-interior-00213330\n  - date: 2025-03-04\n    type: disruption\n    event: DOGE boasts in a tweet that 27% more water was released in February compared to January (unclear if this adjusts for different lengths of months)\n    source: https://xcancel.com/DOGE/status/1896948512975433787\n  - date: 2025-03-07\n    type: promotion\n    event: Tyler Hassen is promoted to Acting Assistant Secretary of Policy, Management and Budget\n    named:\n    - Tyler Hassen\n    source: https://www.eenews.net/articles/doge-official-appointed-head-of-policy-at-interior/\n  - date: 2025-03-28\n    type: report\n    event: Expressing concerns about DOGE requesting access to FPPS, the CIO and CISO of the Department of the Interior present a memo to the Interior Secretary about the risks for him to acknowledge and sign. He doesn't sign it\n    source: https://www.nytimes.com/2025/03/31/us/politics/doge-musk-federal-payroll.html\n  - date: 2025-03-28\n    type: disruption\n    event: Tyler Hassen places the CIO and CISO on admininstrative leave under investigation for raising alarm about DOGE access\n    named:\n    - Tyler Hassen\n    source: https://www.nytimes.com/2025/03/31/us/politics/doge-musk-federal-payroll.html\n  - date: 2023-03-29\n    event: Two DOGE staffers are granted admin access to the FPPS payroll system at the Department of the Interior\n    type: access_granted\n    access_type: admin\n    access_systems:\n    - FPPS\n    named:\n    - Stephanie Holmes\n    - Katrine Trampe\n    source: https://www.nytimes.com/2025/03/31/us/politics/doge-musk-federal-payroll.html\n  systems:\n  - name: Federal Personnel Payroll System\n    id: FPPS\n    description: A shared service which processes payrolls for the Justice, Treasury and Homeland Security departments, as well as the Air Force, Nuclear Regulatory Commission and the U.S. Customs and Border Protection, among other federal agencies.\n    risk: PII and payment info for federal staff at several large agencies, including the ability to interfere with pay\n    pia: https://www.doi.gov/sites/doi.gov/files/fpps-pia-revised-04222020_0.pdf\n  cases:\n  - name: Center for Biological Diversity v. U.S. Department of Interior (D.D.C.)\n    description: Plaintiffs, a nonprofit organization focused on habitat preservation for endangered species, alleges that DOGE and the Department of Interior have violated the Administrative Procedures Act by failing to follow Federal Advisory Committee Act (FACA) requirements\n    case_no: 1:25-cv-00612\n    date_filed: 2025-03-03\n    link: https://www.courtlistener.com/docket/69698261/center-for-biological-diversity-v-us-department-of-interior/\n\nBut it was starting to get more unwieldy to edit. And sometimes, when I was dealing with a single event that affected multiple agencies for instance, I would need to duplicate and move content around. That made it harder to ensure everything was consistent. So, the next big step was to define a workflow where I would edit raw data and then a pre-commit hook could be used to regenerate files downstream. Under this model, I defined a few files with basic types that I can then join into more complicated data structures, I could then use the raw data to create files that are derived from processing the source files and combining information. Using this approach, I could edit a few source files and then generate other files from the data. This is the key change that powered the next iteration of the IT Modernization dataset.\n\nDOGE Track\n\nIt was time to make the “IT Modernization” dataset its own project. Following these lines, I created a new repository named doge_track on May 7, 2025 and copied over the existing data from Trump Data to there. From there, the immediate next step was to build a website.\n\nThe main thing to understand about me is that I am really cheap and I didn’t want to spend a lot of money every month just to keep a website up and running. Static site generation seemed like the best option to use here because this is content that doesn’t change that frequently (at most, several times a day) and that does not need to serve personalized information to different users. I love static sites because it’s a simple approach that scales tremendously for peanuts, especially since there are several providers who offer free accounts for static site hosting.\n\nIn fact, the only money I have spent on DOGE Track was to buy a license for Font Awesome and the domain dogetrack.info. The DOGE Track website currently runs in a free tier on Render\n\nTo make this happen, I picked the same Bridgetown library that I have used for this site as well. I had originally considered using a similar static generator in Python, but what I like about Bridgetown (as well as its predecessor, Jekyll) is that it’s very easy to integrate data into page generation by adding YAML files to the _data/ directory. These can then be referenced in templates which use either Liquid or embedded Ruby to let you iterate over things. Eventually, I redesigned the site to just call a database directly for data, but it was a helpful way to build the site initially.\n\nFor the front end, I used Tailwind CSS and DaisyUI, because it seemed like interesting tech I wanted to learn and a coworker recommended it. I have enjoyed the highly granular control over formatting I get with Tailwind, but if I were doing it today, I would probably take the time to learn proper web components\n\nBut I didn’t have the time. New events were happening every day. It was\nimportant to keep up. And so since then, it has continued to be an obsessive\nproject of mine. Over the past year, I\nhave added lots of data, but I have made a lot of changes to how the data is\nstored and how it is presented, both on big screens and mobile phones.\n\nI don’t know when I will stop working on this project. Even now, I am\nconsidering new ways to present what is happening each month, filling in\ninformation about prior affiliations (like Tesla) or post-DOGE jobs and\nproviding more context on what DOGE has been doing to further its projects\nacross the government. I don’t know when DOGE will truly end, but I do know this\nwork continues to keep me engaged. It has given people insight into what is\nhappening. And it provides me with a way to channel my anger so I can bear\nwitness in the hopes that one day these people will be held accountable."
        },
        {
          "id": "404",
          "title": "404",
          "collection": {
            "label": "pages",
            "name": "Posts"
          },
          "categories": "",
          "tags": "",
          "url": "/404",
          "content": "404\n\nPage Not Found :(\n\nThe requested page could not be found."
        },
        {
          "id": "500",
          "title": "500",
          "collection": {
            "label": "pages",
            "name": "Posts"
          },
          "categories": "",
          "tags": "",
          "url": "/500",
          "content": "500\n\nInternal Server Error :(\n\nThe requested page could not be delivered."
        },
        {
          "id": "about",
          "title": "About",
          "collection": {
            "label": "pages",
            "name": "Posts"
          },
          "categories": "",
          "tags": "",
          "url": "/about/",
          "content": "This is the basic Bridgetown site template. You can find out more info about customizing your Bridgetown site, as well as basic Bridgetown usage documentation at bridgetownrb.com\n\nYou can find the source code for Bridgetown at GitHub:\nbridgetownrb /\nbridgetown"
        },
        {
          "id": "",
          "title": "jacobharr.is",
          "collection": {
            "label": "pages",
            "name": "Posts"
          },
          "categories": "",
          "tags": "",
          "url": "/",
          "content": "jacobharr.is\n\n        \n            \n                \n                    Résumé\n\n\n                    \n    https://jacobharr.is/resume\n\n\n\n\n                    Email\n\n\n                    \n    mail@jacobharr.is\n\n\n\n\n                    Location\n\n\n                    \n    Takoma Park, MD\n\n\n\n\n                    Bluesky\n\n\n                    \n    @jacobharr.is\n\n\n\n\n                    Signal\n\n\n                    \n    jacobharris.28\n\n\n\n                \n            \n\n            \n                Notable Projects\n                \n                \n                \n    \n        2007-2014\n\n\n        \n    \n\n\n\n    @nytimes on Twitter\n\n\n\n\n\n\n    \n        2007-2015\n\n\n        \n    \n\n\n\n    Data Journalism\n\n\n\n\n\n\n    \n        2007-2025\n\n\n        \n    \n\n\n\n    Open Source Projects\n\n\n\n\n\n\n    \n        2008-2014\n\n\n        \n    \n\n\n\n    The New York Times Election Loader\n\n\n\n\n\n\n    \n        2010-2011\n\n\n        \n    \n\n\n\n    Wikileaks War Logs\n\n\n\n\n\n\n    \n        2012-2017\n\n\n        \n    \n\n\n\n    Times Haiku\n\n\n\n\n\n\n    \n        2015-2024\n\n\n        \n    \n\n\n\n    Sky Gradients\n\n\n\n\n\n\n    \n        2015-2025\n\n\n        \n    \n\n\n\n    Civic Technology\n\n\n\n\n\n\n    \n        2025\n\n\n        \n    \n\n\n\n    DOGE Track\n\n\n\n\n\n\n    \n\n\n            \n\n\n            \n                Personal\n\n                \n                \n                \n    \n        2015\n\n\n        \n    \n\n\n\n    Leaving the New York Times\n\n\n\n\n\n\n    \n        2016\n\n\n        \n    \n\n\n\n    How the Times Did Election Results 100 Years Ago\n\n\n\n\n\n\n    \n        2021\n\n\n        \n    \n\n\n\n    Bad Metrics For Agile Development\n\n\n\n\n\n\n    \n        2025\n\n\n        \n    \n\n\n\n    A Civic Tech Bookshelf\n\n\n    Visualizing the Human Toll of COVID-19\n\n\n    An AI-Fueled Bureaucratic Nightmare\n\n\n    The Assault on Oversight in the Executive Branch\n\n\n\n\n\n\n    \n        2026\n\n\n        \n    \n\n\n\n    A Privacy Act With Teeth?\n\n\n    Why I Don’t Vibe Code\n\n\n\n\n\n\n    \n\n\n            \n\n            \n                Published\n\n                \n                \n                \n    \n        2007\n\n\n        \n    \n\n\n\n    The Right Kind of Stupid\n\n\n\n\n\n\n    \n        2010\n\n\n        \n    \n\n\n\n    How Often Is the Times Tweeted?\n\n\n    Using Varnish So News Doesn’t Break Your Server\n\n\n\n\n\n\n    \n        2011\n\n\n        \n    \n\n\n\n    Word Clouds Considered Harmful\n\n\n\n\n\n\n    \n        2012\n\n\n        \n    \n\n\n\n    The New York Times’ Election Loader\n\n\n\n\n\n\n    \n        2013\n\n\n        \n    \n\n\n\n    How the Data Sausage Gets Made\n\n\n    And Remember, This Is for Posterity\n\n\n    The Times Regrets the Error\n\n\n\n\n\n\n    \n        2014\n\n\n        \n    \n\n\n\n    Bots With Thoughts\n\n\n    Distrust Your Data\n\n\n    Prediction for 2015 - A Wave of P.R. Data\n\n\n\n\n\n\n    \n        2015\n\n\n        \n    \n\n\n\n    Connecting with the Dots\n\n\n    Consider the Boolean\n\n\n    Thank You, Electionbot\n\n\n    Why Is It So Hard to Make Great Food Infographics?\n\n\n\n\n\n\n    \n        2016\n\n\n        \n    \n\n\n\n    Why I Like to Instagram the Sky\n\n\n    Solving a Century-Old Typographical Mystery"
        },
        {
          "id": "posts",
          "title": "Posts",
          "collection": {
            "label": "pages",
            "name": "Posts"
          },
          "categories": "",
          "tags": "",
          "url": "/posts/",
          "content": "Why I Don't Vibe Code\n    \n  \n    \n      A Privacy Act With Teeth?\n    \n  \n    \n      The Assault on Oversight in the Executive Branch\n    \n  \n    \n      An AI-Fueled Bureaucratic Nightmare\n    \n  \n    \n      Visualizing the Human Toll of COVID-19\n    \n  \n    \n      A Civic Tech Bookshelf\n    \n  \n    \n      Bad Metrics For Agile Development\n    \n  \n    \n      Solving a Century-Old Typographical Mystery\n    \n  \n    \n      Why I Like to Instagram the Sky\n    \n  \n    \n      How the _Times_ Did Election Results 100 Years Ago\n    \n  \n    \n      Why Is It So Hard to Make Great Food Infographics?\n    \n  \n    \n      Leaving the New York Times\n    \n  \n    \n      Thank You, Electionbot\n    \n  \n    \n      Consider the Boolean\n    \n  \n    \n      Connecting with the Dots\n    \n  \n    \n      Prediction for 2015 - A Wave of P.R. Data\n    \n  \n    \n      Distrust Your Data\n    \n  \n    \n      Bots With Thoughts\n    \n  \n    \n      The Times Regrets the Error\n    \n  \n    \n      And Remember, This Is for Posterity\n    \n  \n    \n      How the Data Sausage Gets Made\n    \n  \n    \n      The New York Times' Election Loader\n    \n  \n    \n      Word Clouds Considered Harmful\n    \n  \n    \n      Using Varnish So News Doesn't Break Your Server\n    \n  \n    \n      How Often Is the Times Tweeted?\n    \n  \n    \n      The Right Kind of Stupid\n    \n  \n\n\nIf you have a lot of posts, you may want to consider adding pagination!"
        },
        {
          "id": "resume",
          "title": "Resume for Jacob Harris",
          "collection": {
            "label": "pages",
            "name": "Posts"
          },
          "categories": "",
          "tags": "",
          "url": "/resume/",
          "content": "Jacob Harris\n    \n      mail@jacobharr.is\n      https://jacobharr.is/\n      (917) 535-2026\n    \n  \n\n  \n    \n  \n\n\n  \n  Objective\n\n  Experienced Engineering Lead/Manager with a strong background in backend engineering, specializing in creating APIs and web applications using Node.js, Python, Ruby on Rails and even Java. Proven track record in agile software development and building scalable solutions, including high-volume streaming API for Medicare claims and the software that presented election results for the New York Times. Eager to leverage both technical expertise and software delivery skills for an organization with a public-facing mission and a strong culture of collaboration.\n\n\n\n  \n    Coding:\n    \n    \n        Python\n    \n        Flask\n    \n        FastAPI\n    \n        Node.js\n    \n        Express\n    \n        JavaScript\n    \n        Ruby\n    \n        Ruby on Rails\n    \n        Jekyll\n    \n        Bridgetown\n    \n        Hugo\n    \n        Skeleton\n    \n        Bulma\n    \n        TailwindCSS\n    \n        C\n    \n        C++\n    \n        Go\n    \n        Bash\n    \n        Java\n    \n        Scala\n    \n        Spark\n    \n        Scheme\n    \n        Lisp\n    \n        Clojure\n    \n        SQL\n    \n        GRPC\n    \n        Protobuf\n    \n    \n\n    Tools:\n    \n        \n            AWS\n        \n            Terraform\n        \n            PagerDuty\n        \n            Docker\n        \n            CloudFoundry\n        \n            Scraping\n        \n            Airflow\n        \n            Splunk\n        \n            Kafka\n        \n            Kinesis\n        \n            MySQL\n        \n            Postgres\n        \n            Sqlite\n        \n            Temporal\n        \n            Lambda Step Functions\n        \n            Redis\n        \n            Elasticsearch\n        \n            Deque Axe\n        \n    \n\n    Skills:\n    \n    \n        Data journalism\n    \n        Web frameworks\n    \n        Accessibility\n    \n        WCAG\n    \n        REST\n    \n        OpenGraph\n    \n        GRPC\n    \n        Monitoring\n    \n        Telemetry\n    \n        ETL\n    \n        ELK\n    \n        Microservices\n    \n        Remote work\n    \n        Agile\n    \n        Kanban\n    \n        SAFe\n    \n        Engineering management\n    \n        Federal procurement\n    \n        Compliance\n    \n    \n\n\n  \n  Employment\n\n  \n    \n      2025–now\n\n\n      \n    \n\n  Engineering Lead,\n  Magnum Opus, LLCMO Studio\n\n  2025–now\n\n\nMagnum Opus is a small consulting company that applies human-centered design and agile delivery to improve government services.\n\n\n    \n      \n        Pennsylvania HR1 Verification\n        Member of a small discovery-focused effort advising the Pennsylvania Department of Health and Human Services on how the state can meet its work verification requirements mandated by HR.1 while reducing harm for beneficiaries and caseworkers. As part of this work, helped to identify several key interventions for the state to prioritize. Investigated potential data sources and built a prototype system to certify compliance and keep a permanent audit trail using open-source components in Python and PostgreSQL.\n\n        Pennsylvania HR1 Verification: Tech lead on discovery work on how Pennsylvania can comply with new Medicaid work requirement while preserving benefits and reducing churn and administrative burdens. Built a quick prototype of an external verification engine in FastAPI and PostgreSQL.\n\n      \n    \n  \n\n\n\n\n    \n      2023–2025\n\n\n      \n    \n\n  Supervisory IT Specialist,\n  Consumer Financial Protection BureauCFPB\n\n  2023–2025\n\n\nCFPB is a small federal agency founded in 2010 and dedicated to financial education, regulation and enforcement.\n\n\n    \n      \n        Application Development Lead\n        Member of D&amp;D’s leadership team. Oversaw technical standards for engineering practices and helped teams to anticipate and handle bureaucratic roadblocks, resulting in fewer surprises and improved team productivity. Supported teams building web applications in Django with Javascript or React frontends, as well as the Wagtail CMS and SQL backends.\n\n        Application Development Lead: Manager in the Design &amp; Development team. Oversaw technical standards and steered product work for multiple development teams working on public-facing and internal products. Worked with systems built in Django with Javascript/React frontends and SQL backends. Anticipated blockers and fended off bureaucratic obstacles that impeded progress. Supported accessibility and design standards.\n\n      \n    \n      \n        Supervisory IT Specialist\n        Supervised 12 software engineers distributed across 5 different engineering teams. This included teams working on the CFPB website, consumer-facing tools, internal products for enforcement practices and software for receiving and handling consumer complaints. Conducted regular 1-1s with all my direct report as well as performance reviews. Re-established dormant engineering practices like syncs and training.\n\n        Supervisory IT Specialist: Managed up to 12 engineers across 5 scrum teams. Held regular checkins with all my direct reports and conducted performance reviews twice yearly. Followed a high-contact approach that supported my engineers as people rather than “resources,” with an emphasis on career growth and personal development.\n\n      \n    \n      \n        Contracting Officer Representative (Level 1)\n        Certified as a COR Level 1 in the federal government. Managed the the procurement, payment and even the wind-down of several Software-as-a-Service (SaaS) products used by developers within D&amp;D. Ensured services obtained their necessary Authorizations To Use (ATUs) for Deque AXE, Sauce Labs, Netlify and Mapbox.\n\n        Contracting Officer Representative (Level 1): Managed procurements for multiple SaaS products used by developers at the CFPB.\n\n      \n    \n  \n\n\n\n\n    \n      2018–2023\n\n\n      \n    \n\n  Senior Engineering Lead,\n  Nava PBCNava\n\n  2018–2023\n\n\nNava is a public benefit corporation that contracts with both federal and state government agencies to build custom software.\n\n\n\n    \n      \n        Quality Payments Program (QPP) Submissions API\n        Joined the team as a senior engineer building an API for accepting millions of data submissions. Learned Express on Node.js on the job. Created a more flexible library for test data to expand test coverage and improve developer experience. Built the initial version of the scoring engine that would process doctor submissions and calculate scores with bulk SQL operations based on formulas defined by Medicare staff. Consulted on a revamp in Apache Spark and Parquet after prototyping in Luigi and Airflow. Moved into role of Senior Tech Lead and managed up to 9 engineers on project as both engineering lead and people manager. Worked closely with product and project managers to plan out our work using Scrum, coordinating with 14 other scrum teams on the project in the Scaled Agile Framework.\n\n        Quality Payments Program (QPP) Submissions API: Started as senior developer on a team building an API in Node for Medicare doctors. Transitioned into tech lead role managing up to 9 engineers on project. Built prototype for scoring engine later written in Spark.\n\n      \n    \n      \n        CMS Cloud IT Operations (CLOUD ITOPS)\n        Identified needs for and implemented custom tooling for infrastructure security and compliance requirements. Built tools in Go to address gaps in CloudFoundry’s capabilities. Implemented infrastructure as code (IaC) in Terraform. Served as the people manager for 6 engineers.\n\n        CMS Cloud IT Operations (CLOUD ITOPS): Engineering manager for devops team improving cloud operations for Medicare and Medicaid. Wrote multiple tools in Go to address gaps in functionality for Cloud Foundry.\n\n      \n    \n      \n        Medicare Replicated Data Access (RDA) API\n        Led the Replicated Data Access (RDA) API of 3 engineers, building a GRPC-based API in Java to store and stream millions of Medicare claims to other parts of the organization. Served as an effective people manager for 10 engineers across the company. Successfully expanded project scope to include an orchestration pipeline using AWS Lambda Step Functions to ingest claims data from COBOL format files. Managed the rollout and launched to production with 99.999% uptime and extremely low costs over lifetime of project. Consulted on several other contract modifications and built a pilot prototype allowing Medicare to receive patient survey information via the FHIR format.\n\n        Medicare Replicated Data Access (RDA) API: Led a small team of 3 engineers building an API in Java to stream millions of medical claims each day over GRPC. We also created a loading process in AWS Lambda Step Functions to load data reliably and cheaply.\n\n      \n    \n      \n        Business Development Work\n        Participated in multiple bid teams to craft proposals and implement technical challenges to win new contracts. Worked closely with business development to research the technical landscape and propose our unique approach for any bid proposal, including architecture diagrams, staffing and cost estimates, QASPs or other measures of success as well as our proposed timelines and milestones.\n\n        Business Development Work: Member of multiple bid teams that crafted winning proposals and executed on technical challenges.\n\n      \n    \n      \n        Engineering Management\n        Regularly managed 6-12 engineers, coaching their career development through weekly 1:1s and regular performance check-ins. Guided multiple junior engineers to senior roles and also into engineering management. Received accolades in 360-degree reviews for building supportive teams and proactively addressing delivery problems.\n\n        Engineering Management: Regularly managed anywhere from 6-12 engineers with weekly checkins and through multiple performance management seasons. Developed a tracker for engineers to plot out their advancement against career ladders.\n\n      \n    \n  \n\n\n\n\n    \n      2015–2018\n\n\n      \n    \n\n  Innovation Specialist, 18F,\n  General Services AdministrationGSA\n\n  2015–2018\n\n\n18F was a “government startup” within the General Services Administration the consulted on custom software for other federal agencies.\n\n\n\n    \n      \n        MyUSA\n        A precursor to login.gov, MyUSA provided single-sign-on (SSO) for multiple government sites and allowed users to control what information they shared. Helped to build out the backend server in Ruby on Rails. Also contributed to frontend in JQuery and vanilla Javascript.\n\n        MyUSA: Joined as a senior engineer building an authentication service for federal sites in Ruby on Rails.\n\n      \n    \n      \n        Micro-purchase\n        Launched an experiment that allowed government agencies to pay for custom software development with the easier micro-purchase approach for procurement. Built a robust web application in Ruby on Rails to run reverse auctions where developers could bid on small coding projects from government agencies.\n\n        Micro-purchase: Launched an experimental site in Rails that used micro-purchases to pay for small scale software development via reverse auctions.\n\n      \n    \n      \n        FBI Crime Data Explorer\n        Worked with other engineers and stakeholders to modernize how the FBI published crime data online. Learned Python to work with another senior developer on the project. Built a RESTful API with Django, PostgreSQL and SQLAlchemy used by the visual explorer website. Created new SQL queries to generate reports that weren’t possible with prior data formats. Created tools to generate open data downloads.\n\n        FBI Crime Data Explorer: Worked closely with 18F team and FBI stakeholders to develop a modern site for publishing crime data statistics. Learned Python as part of the work and built loading processes and crosstabs for Postgresql.\n\n      \n    \n      \n        Confidential Survey\n        Designed a prototype in Ruby on Rails for conducting surveys that aggregate demographic data while protecting user privacy, ensuring secure data collection practices for sensitive data.\n\n        Confidential Survey: Designed a prototype app in Ruby on Rails to conduct surveys with risk of leaking identities.\n\n      \n    \n  \n\n\n\n\n    \n      2006–2015\n\n\n      \n    \n\n  Senior Software Architect,\n  The New York TimesNY Times\n\n  2006–2015\n\n\nIn 2007, I was a co-founder of the Interactive Newsroom Technologies Team, a hybrid team of configs: journalists and engineers within the newsroom to create news-driven applications on deadline.\n\n\n    \n      \n        Elections (2008-2014)\n        Paired with another developer to build the initial version of a new and better election results loader in time for the 2008 general election. Enhanced it over the years to improve performance and functionality and handle primaries and general elections from 2008 to 2014, as well as for the NYC mayoral election in 2013. Built in Ruby on Rails with bulk SQL operations for performance. Operated as a modular REST microservice that shared data to website, generated graphics and result tables and even built graphics for the print newspaper. Architected to use static page generation and reverse caching to handle tremendous amounts of web traffic with minimal resources.\n\n        Elections (2008-2014): Paired with another developer to build an election loading backend in Rails. Enhanced its functionality and managed its operation over primaries and general elections in the years after. Built admin for news editors to track results and call races.\n\n      \n    \n      \n        Olympics Results (2010/2012)\n        Helped to architect and build a backend service and admin in Ruby on Rails to process a firehose of XML data provided by the International Olympic Committee into a SQL DB with a RESTful API. This service populated pages, visualizations and interactive widgets for both the web and print. Also helped to architect the successor system built for the 2012 London Olympics.\n\n        Olympics Results (2010/2012): Helped architect and develop a backend service in Ruby on Rails to parse results data from the 2010 and 2012 Olympics to render static pages and widgets for the newspaper’s coverage.\n\n      \n    \n      \n        Wikileaks War Logs\n        When Wikileaks provided the Times with leaked military dispatches from Iraq and Afghanistan, built an internal web admin in Ruby on Rails used by reporters to search and analyze the data for stories. Included an ETL process to extract and geocode locations within the text. Also contributed research and pitched a graphic to accompany a story about a surge of sectarian violence in Baghdad after the US invasion.\n\n        Wikileaks War Logs: Built an internal admin in Ruby of Rails for reporters to analyze the Wikileaks War Logs for Iraq and Afghanistan.\n\n      \n    \n      \n        PUFFY\n        In 2009, we created a site for readers to upload and share their photos from the inauguration of President Obama. Built on this prototype to create a tool in Ruby on Rails called PUFFY (for Photo Upload Form For You) that allowed editors to moderate reader-submitted photos in 30+ projects. For one of these named “A Moment in Time,” more than 10,000 readers submitted geotagged photos taken at the same time across the Earth.\n\n        PUFFY: Built a tool in Ruby on Rails for editors to moderate reader-submitted photos in 30+ different projects at the Times.\n\n      \n    \n      \n        Open Source\n        Spearheaded new open-source initiatives at the NYT and improved outreach for people to use public APIs from the New York Times. Helped to create and wrote early content for NYT Open, which still runs to this date.\n\n        Open Source: Co-created the Open blog for open-source outreach and represented NYT digital engineering at conferences.\n\n      \n    \n      \n        @nytimes Twitter Account\n        One afternoon in 2007, created the @nytimes twitter account and built a custom posting bot as a simple Ruby script. Expanded it into an admin interface in Ruby on Rails that fed stories from RSS feeds into 80+ Twitter accounts associated at the Times. Ran operations for social media for several years before handing off to the new social media team at the NYT.\n\n        @nytimes Twitter Account: Registered the @nytimes twitter account and created the software to automatically push new articles out as tweets to it and 80+ other specialized accounts. Iterated admin to support social media team.\n\n      \n    \n      \n        Times Haiku\n        After the 2012 election, built a bot in Ruby that scanned Times articles to find haiku embedded within them. This became a whimsical and official NYT Tumblr blog of curated haiku that ran from 2013 to 2017, posting thousands of found haiku.\n\n        Times Haiku: Built a whimsical bot to scan new articles for phrases that met the haiku syllable pattern. Developed a moderation interface and adapated a visual presentation to launch as a dedicated Tumblr blog. This was a fun one.\n\n      \n    \n  \n\n\n\n\n    \n      1998–2006\n\n\n      \n    \n\n  Software Developer,\n  Alacra, Inc.Alacra\n\n  1998–2006\n\n\nAlacra resells financial data from over 80 databases to financial and legal firms.\n\n\n\n    \n      \n        XML/XSLT\n        Developed a framwork in XML and XSLT to dynamically share data and build docs. Created XML DOM implementation.\n\n        XML/XSLT: Developed a framwork in XML and XSLT to dynamically share data and build docs.\n\n      \n    \n      \n        PortalB\n        Built a web crawler in C++ to create search engine for for business sites.\n\n        PortalB: Built a web crawler to create search engine for for business sites.\n\n      \n    \n  \n\n\n\n\n    \n  \n\n\n  \n  Education\n\n  \n    \n      \n        1993–1997\n      \n\n      \n        \n          B.S., Computer Science (Minor: Literature),\n          MIT\n          1997\n        \n\n        Concentration of studies: operating systems, software engineering, programming languages and compilers.\n\n      \n    \n      \n        1997–1998\n      \n\n      \n        \n          M.Eng., Computer Science,\n          MIT\n          1998\n        \n\n        Thesis: Lightweight Object-Oriented Shared Variables for Distributed Applications on the Internet"
        },
        {
          "id": "test",
          "title": "Testing",
          "collection": {
            "label": "pages",
            "name": "Posts"
          },
          "categories": "",
          "tags": "",
          "url": "/test/",
          "content": ""
        },
        {
          "id": "",
          "title": "jacobharr.is",
          "collection": {
            "label": "data",
            "name": "Posts"
          },
          "categories": "",
          "tags": "",
          "url": "",
          "content": ""
        }
]
