Article: Key Takeaway Points and Lessons Learned from QCon New York 2018

admin

You are here: InfoQ Homepage Articles Key Takeaway Points and Lessons Learned from QCon New York 2018 Key Takeaway Points and Lessons Learned from QCon New York 2018 | Posted by Abel Avram Abel Avram Follow 8 Followers on Aug 01, 2018. Estimated reading time: 76 minutes | NOTICE: The next QCon is in San…

You are here: InfoQ Homepage Articles Key Takeaway Points and Lessons Learned from QCon New York 2018 Key Takeaway Points and Lessons Learned from QCon New York 2018 | Posted by Abel Avram Abel Avram Follow 8 Followers on Aug 01, 2018. Estimated reading time: 76 minutes | NOTICE: The next QCon is in San Francisco Nov 5 – 9, 2018 . Save an extra $100 with INFOQSF18! Share Reading List
Conferences often have a theme that emerges during their course that none of us had predicted. One year at QCon London it was, bizarrely, cat pictures; every presentation you went to had a cat picture in it. Another year, as the microservices movement was just getting going, it seemed that we had mandated a Conway’s law slide in every presentation – we hadn’t of course, but there certainly were plenty of them.
This year, at the seventh annual QCon New York , our second year in Time Square, it felt like it was diversity and inclusion.

The event had a particularly positive atmosphere that made for something truly special, and we got a huge amount of positive feedback from attendees and speakers about it, both during the event and afterwards. This is something the QCon team has worked on for several years, and it felt wonderful to see that work starting to pay dividends.

From a content perspective, attendees at the event got to see keynotes from Guy Podjarny co-founder of Snyk , talking about “Developers as a Malware Vehicle”, Joshua Bloch giving a “Brief, Opinionated History of the API”, and Tanya Reilly, Principal Engineer of Squarespace , giving a thoroughly interesting and unusual talk about the history of fire escapes in New York City and what we as software engineers can learn from them. Related Vendor Content
Continuously build, test and monitor your microservices for optimal performance
In total we had 143 speakers across the 117 sessions, workshops, AMAs, Open Spaces and mini-workshops. Topics included containers and orchestration, machine learning, ethics, modern user interfaces, microservices, blockchain, empowered teams, modern Java, DevEX, Serverless, chaos and resilience, Go, Rust, Elixir, and security.
Videos of most presentations were available to attendees within 24 hours of them being filmed, and we have already begun to publish them on the InfoQ site. You can view the publishing schedule on the QCon New York website.
InfoQ also reported from the event , and recorded podcasts with a number of speakers. This article, however, presents a summary of QCon New York as blogged and tweeted by attendees.
by Joshua Bloch
Twitter feedback on this keynote included:
@lizthegrey: Subroutine libraries first appear in Goldstine and von Neumann’s 1948 paper on programming methodology *before general-purpose computers physically existed*.

#QConNYC
@lizthegrey: Key idea: programs require common operations. Library subroutines reduce duplicated code and number of errors. #QConNYC
@lizthegrey: Maurice v Wilkes: Second ever Turing Award given to Wilkes for subroutine libraries. Why didn’t Goldstein and von Neumann get the award? It was vaporware at the time. #QConNYC
@lizthegrey: EDSAC was the real deal — world’s first stored-program computer; was immediately useful.

650 instructions per second. #QConNYC
@lizthegrey: 4 million times slower than a modern PC, and 4 million times less memory; 100 times the power and 1000 times the size. But it changed the world.

#QConNYC
@charleshumble: EDSAC was 17 bit. Don’t ask why. I do know but you don’t want to. @joshbloch #QConNYC https://t.

co/REcLrd8jPP
@jeanneboyarsky: In first program on first computer ”.

.. realization came over me that a good part of the remainder of my life was going to be spent in finding the errors in my own programs”—— Maurice Wilkes. Subroutine library is a partial fix. #qconnyc
@lizthegrey: If you don’t have to debug every little thing because you have trusted libraries, you might have an easier time..

.

So Wilkes gave the job to Wheeler. #QConNYC
@lizthegrey: Wheeler devised “coordinating orders” to direct the compiler to insert subroutines into the code. [ed: kind of like macros, almost].

Required no manual intervention, unlike von Neumann’s idea #QConNYC
@lizthegrey: Everything fit all on a single tape, and the bootloader stayed at 42 instructions (up from 30), constrained by the phone switches “a tour de force of ingenuity.” #QConNYC
@lizthegrey: Arbitrary recursion was permitted from subroutines into other subroutines, and passing functions in as arguments to other functions. Self-modifying code. The linkage technique was called “The Wheeler Jump”. Amazing for its time. #QConNYC
@jeanneboyarsky: ”The wheeler jump”—— call function by jumping.

Requires self modifying ocde which would be a security nightmare now. #QConNYC
@lizthegrey: Large amount of mathematical coverage of the subroutines; written all in one year by one research team.

#QConNYC
@danielbryantuk: “The EDSAC subroutine library was in fact a library of code tape” @joshbloch #qconnyc https://t.

co/uLvgSXLUXg
@glamcoder: If you ever wondered why a software library is called “library”, that’s why #qconnyc #keynote #day2 https://t.

co/eGZataIwCQ
@lizthegrey: Entire API was contained in _The Preparation of Programs for an Electronic Digital Computer_, the first text on computer programming (WWG abbreviation for the authors’ last names) #QConNYC
@lizthegrey: Key ideas presented in Wheeler’s 1952 paper: “The Use of Sub-routines in Programmes”. Described subroutines, libraries, performance/generality tradeoffs, first-class functions, etc.

#QConNYC
@lizthegrey: Quote from the paper: “After coding/testing, there remains the task of writing a description so that people not acquainted with the interior coding can use it easily; *this task may be the most difficult*.” #QConNYC
@lizthegrey: 42 years later, Parnas still had to remind people that “Reuse is far easier to say than do, requiring both good design and documentation”. #QConNYC
@jeanneboyarsky: 1951 – “The Preparation of programs for digital computers”. Called WWG for last names of authors. World’s first text on computer programming and remained primary text until higher level languages arose. Introduced subroutines to the word. #QConNYC
@lizthegrey: Wheeler’s 1952 paper remains accurate: “simplicity of use, correctness of code, accuracy of description, and burying complexity out of sight.

” #QConNYC
@lizthegrey: Why didn’t WWG discuss APIs separate from the library? Because the two were isomorphic. Only one machine architecture and one machine. No portability needed. #QConNYC
@charleshumble: “Wheeler’s 1952 was only 2 pages long. And by the way I found a typo in it.” @joshbloch #QConNYC
@lizthegrey: No legacy programs, because there were no earlier programs. No need for backward compatibility.

They understood API design principles but didn’t see a difference between library implementation and API. #QCOnNYC
@lizthegrey: But the field progressed, and libraries had to be reimplemented on new hardware. Keeping the same API let you preserve code and knowledge. #QConNYC
@lizthegrey: New algorithm implementations made existing APIs faster. First use of API: 1968 paper “Data structures and techniques for remote computer graphics” #QConNYC
@lizthegrey: “System designed to be hardware independent… implementation may be recoded for different hardware, but maintain the same interface with each other and the application program.” #QConNYC
@lizthegrey: “Eventual replacement is almost a certainty given the rapid rate of developments in computer technology.

.. flexible, hardware independent systems will ensure that systems don’t become prematurely obsolete.” #QConNYC
@lizthegrey: Authors separated implementation from the API, to allow implementations to be replaced without harm to clients. Libraries naturally give rise to APIs.

Not so much invented as discovered. #QConNYC
@lizthegrey: It took us 20 years from the invention of libraries to the latent discovery of APIs. #QConNYC
@lizthegrey: Two-part test for whether something’s an API: does it provide operations defined by inputs and outputs, and can it be reimplemented without compromising its users? #QConNYC
@lizthegrey: C standard library from 1975. K&R’s commentary from 1978: “routines are meant to be portable, and programs that only use the standard library can be moved from one system to another without change.” #QConNYC
@jeanneboyarsky: If it provides operations defined by inputs/outputs and allows reimplementation without compromising user, it is an API #QConNYC
@lizthegrey: Core libraries become joined to the language.

Unix VI system calls from 1975 — OS kernels have APIs! #QConNYC
@lizthegrey: IBM PC BIOS (1981) — firmware provided API to underlying hardware. MS-DOS command-line interface — is that an API? Well, a script requires access to the commands… #QConNYC
@lizthegrey: Win32 API (1993) still used today. Java class libraries (version 2, 1998), with many implementations including Harmony, Android, etc. #QConNYC [ed: obligatory that these are Bloch’s opinions and not necessarily mine or Google’s]
@lizthegrey: And then we have web APIs. the first web API was delicious.

Lessons: APIs come in all shapes and sizes, and can live long after the platforms they were created for. They can create entire industries above and below. #QConNYC
@lizthegrey: “APIs are the glue that connects the digital universe.” –@joshbloch #QConNYC
@lizthegrey: But API reimplementation is under serious attack, says @joshbloch. in August 2010, Oracle sued Google in Federal Court for reimplementing Java APIs in Android.

#QConNYC
@lizthegrey: May 2012: The jury ruled no patent infringement, judge ruled APIs not copyrightable. Oracle appealed, and in May 2014 US Court of Appeals overturned Alsup’s ruling. Google petitioned SCOTUS, and in June 2015 SCOTUS declined to re-hear it. #QConNYC
@lizthegrey: “Unfortunately, the case was remanded to the court in California to decide if it was fair use.

“– May 2016: jury ruled fair use. Appealed by Oracle, but despite academic amicus briefs, May 2018 appeals court reversed jury verdict. #QConNYC
@lizthegrey: “Currently, unless something changes, it it is today the law of the land that API reimplementation isn’t allowed without the permission of the API creator.” But there are efforts to have a new en banc hearing, supported by academics and industry professionals.

#QConNYC
@lizthegrey: [ed: all of the above are, I must stress again, @joshbloch’s opinions and not mine or Google’s.] What happens if the ruling stands? Payment of licensing fees or field of use restrictions (or outright denial) are possible. #QConNYC
@lizthegrey: Author would get a near-perpetual monopoly on the API. If you think software patents cause problems (20 years), a life plus 70-year or 95-year monopoly on implementations of an API would strangle the industry.

#QConNYC
@lizthegrey: None of GNU, non-IBM PCs, Samba, Wine, Android would be possible without this right to reimplement, which had been the case for most of the 70-year history of computers. #QConNYC
@lizthegrey: We’ll wind up spending less time coding, more time arguing with lawyers, and wind up reimplementing incompatible things, says @joshbloch. #QConNYC
@lizthegrey: APIs date back to the dawn of the computer age. Don’t develop APIs unless they’re free to reimplement. Don’t work for companies that assert copyright on APIs. And let executives at the companies you work for, and the courts and Congress know your opinions.

#QConNYC
Jeanne Boyarsky attended this keynote:
Developers have more power than ever – can get more done and faster.

Can also do more harm….
Must trust the people who write the software.
We ship code faster. Hard to find if developer introduces code maliciously or accidentally….

As we get more power, we need to get more responsible
Causes of insecure decisions: Different motivations – focus On functonality.

Security is a constraint. Need to be cognizant of it Cognitive limitations – we move fast and break things Lack of expertise – don’t always understand security implications Developers are overconfident. Harder to train where think know it. ”It doesn’t happen to me” . Security breaches happen to everyone.
Mitigations Make it easy to be secure Developer education Manage access like the tech giants Challenge access requests.

When need. For how long. What happens if don’t have access. What can go wrong with access? How would you find out about access being compromised?
Developers have access to user data. Be careful.
Google BeyondCorp All access route through corporate proxy Proxy grants access per device – limits what can do from Starbucks Monitoring access
Microsoft Privileged Access Workstations (PAW) Access to production can only be from a secure machine No internet from the secure machine Your machine is VM on secure machine
Twitter feedback on this keynote included:
@lizthegrey: “Not being a malware distribution vehicle is generally useful.

” — @guypod #QConNYC
@lizthegrey: Devs in China chose to mirror Xcode to local mirrors e.g. on Baidu filesharing. But some of the mirrors had malware 🙁 #QConNYC
@lizthegrey: XcodeGhost included a malicious CoreServices component that spies on users.

Evaded Apple’s malware detection. 🙁 #QConNYC
@lizthegrey: Undetected for 4 months from May to September, infecting 300+ apps in China — WeChat, DiDi, Railway apps… #QConNYC
@lizthegrey: Some apps were compromised through third party libraries. Total of 1.

4M active victims per day. Imagine a startup with 1.4M daily users within 4 months. #QConNYC
@lizthegrey: Victims weren’t just in China; many in US, Japan, Canada, Australia, and Ireland.

#QConNYC
@lizthegrey: Even with a closed App Store environment, _months_ to get users to choose to update to non-infected apps. #QConNYC
@lizthegrey: Eventually, Apple wound up solving the underlying motivation: providing local mirrors so developers in China wouldn’t have to use untrusted sources. #QConNYC
@lizthegrey: Developers were a distribution vehicle for the malicious library. Without them, it would have gone nowhere.

#QConNYC
@lizthegrey: Second example: 2009. Virus inside of Delphi called Induc.

Compromised sysconst.dcu statically linked into every program compiled on machine. #QConNYC
@lizthegrey: Even worse than XcodeGhost — took 10 months to find, propagated millions of times, more viral and harder to centrally remove; replicated within compiler rather than executables. #QConNYC
@thenewstack: With Delphi’s #Induc and Apple’s #XCodeGhost, #malware can be widely spread through developers, by way of compilers and hidden libraries. @snyksec’s @guypod #QconNYC https://t.

co/4y7IHplWq2
@lizthegrey: The moral: You can’t trust code that you didn’t create yourself. But nobody does this. #QConNYC
@lizthegrey: It’s more important to trust the people who write software, according to Thompson.
@lizthegrey: There’s no tie between the code on GitHub and the compiled binaries served on NPM. Malicious PyPi packages last year, RubyGems, NPM, etc.

— and malicious docker images too.

#QConNYC
@lizthegrey: These are just the ones that we know about. Attackers are smart and sophisticated, and evolving faster than defenders. — @guypod #QConNYC
@lizthegrey: Our users trust the code that we ship as developers.

We need to pay attention. But we have another power: access to code, systems, and data. #QConNYC
@lizthegrey: Salesforce (@modMasha) ran an internal phishing test, and developers were the second most likely group to click the malicious link.

#QConNYC
@thenewstack: With #DevOps, developers have access to production systems, and user data—which is not always a good idea, given that developers are just as likely to click on a phishing email as any other employee — @snyksec’s @guypod #QconNYC
@lizthegrey: Story 2: Uber Hack of 2016: 600k Uber drivers had PII leaked, and some personal info of 57M Uber users. Uber paid a $100k ransom disguised as a bug bounty. #QConNYC
@lizthegrey: Uber didn’t report the breach for an entire year until November 2017. How did it happen? S3 tokens pushed to private github repo w/o 2fa; attackers gained access to repo. #QConNYC
@lizthegrey: Tokens used to steal info from S3. Where did this go wrong? Uber said they told people to use 2fa, and to stop using GitHub. But repeat of 2014 incident where public gist contained sensitive URL. #QConNYC
@lizthegrey: Spiderman quote: With great power comes great responsibility.

Why do developers keep falling for these? #QConNYC
@lizthegrey: People make insecure decisions because they’re motivated by something other than security (e.g. baby photos). Cognitive limitations of number of distinct passwords remembered.

Lack of expertise. #QConNYC
@lizthegrey: Developers are, in addition to failing on those three previous things, overconfident and arrogant. Training developers is harder than training regular employees..

. ‘This couldn’t happen here.’ –@modMasha #QConNYC
@lizthegrey: We are trustworthy but not infallible. — @guypod We need to protect ourselves when we inevitably make mistakes.

#QConNYC
@lizthegrey: Three lessons: (1) learn from past incidents [eg automate security controls, make security the default, educate developers], (2) manage access like a tech giant [e.g. beyondcorp/Cloud IAP, u2f, access controls] #QConNYC
@lizthegrey: PAWs from Microsoft — access to production requires a dedicated secure isolated machine; your personal work is a VM inside that secure host machine.

#QConNYC
@lizthegrey: Netflix’s BLESS: no long-lived access; central SSH certificate authority and bastion servers to mediate access. #QConNYC
@lizthegrey: and the most important question: what’s the worst case if it were compromised? How would we detect it? #QConNYC
@lizthegrey: At the end of the day, users trust us. Care about user safety, even if it’s hard and slows us down. Don’t be a malware distribution vehicle. [fin] #QConNYC
@thenewstack: Tech companies mitigate against security breaches by limiting of employee permissions — see the Google #BeyondCorp central proxy, Microsoft’s Privileged Access Workstations and Netflix’s ssh-based Bless— @snyksec’s @guypod #QconNYC @qconnewyork https://t.co/mu8jpOvbNR
by Tanya Reilly
Twitter feedback on this keynote included:
@danielbryantuk: “Fireproof buildings are more effective than fire escapes.

Much like building resilient software is more effective than tacking on an ops process and debugging in prod” @whereistanya #qconnyc https://t.co/9qzvPKWJwh
@charleshumble: “An optimistic disaster plan is a useless disaster plan.” @whereistanya #qconnyc
@charleshumble: “I read this patent 3 times and I’m pretty sure this person invented a rope for a fire escape.

It’s the most silicon valley invention of 1846″ @whereistanya #qconnyc
@danielbryantuk: Fascinating insight into the failure of fire mitigation in New York city, via @whereistanya at #QConNYC “Human error is never the root cause of failing to escape from a building fire” https://t.co/HLQJ8RHP8g
@charleshumble: “fire escapes collapse during times of intense use – such as during actual fires.

” #qconnyc @whereistanya
@John03000413: Reliability is everyone’s job #qconnyc https://t.co/d9YuveWSU1
@lizthegrey: “You don’t want the NYFD rushing into your kitchen every time you burn toast.” –@whereistanya #QConNYC
@danielbryantuk: “If you missed my subtle metaphor for issues with handling fires, it’s the same with software failure” @whereistanya #QConNYC https://t.

co/RTqHDgJ86m
@charleshumble: “The New York fire department recommends you don’t cook when drunk or sleepy. I’d like to respectfully suggest that the same applies to a root prompt.

” #QConNYC @whereistanya
@charleshumble: “Fatigue is a form of encumbrance. Push back on this” @whereistanya #qconnyc
@mpredli: “Software without built-in reliability is a tenement.” @whereistanya at @qconnewyork #QConNYC
by Joe O’Neill & Haozhe Gao
Twitter feedback on this session included:
@thenewstack: To monitor its thousands of services, Facebook captures about a billion traces a day (about ~100TB collected), a dynamic sampling of the total number of interactions per day — @Facebook’s Haozhe Gao and Joe O’Neill #QConNYC https://t.co/iHXCirnp3L
Twitter feedback on this session included:
@micheletitolo: “Lyft’s architecture wasn’t scaling because the developers didn’t trust the infrastructure” – @mattklein123 #qconnyc
@micheletitolo: “People have partial or no implementations of distributed systems best practices” – @mattklein123 #qconnyc
@micheletitolo: “If I want consistent [microservices best practices] I need to implement them in each language” #qconnyc
@micheletitolo: “If developers don’t trust it, they won’t use it.

They would go add their features to the monolith” @mattklein123
@danielbryantuk: Breaking down @EnvoyProxy with @mattklein123 at #qconnyc “We use Envoy as a middle proxy and an edge proxy too — this way there is only one tool to learn” https://t.

co/1jINCs0BOS
@nWaHmAeT: If your devs are spending 60% on infrastructure instead of business logic, you’re doing it wrong. @mattklein123 on removing impediments to a microservices architecture
@danielbryantuk: “Observability is vital for modern microservices-based networking” @mattklein123 #QConNYC https://t.

co/B85vN8n23x
@danielbryantuk: “There is no traffic at Lyft that does not go through @EnvoyProxy — we have 100% coverage, including observability” @mattklein123 #qconnyc https://t.co/e8sBr115tq
@philip_pfo: “Consistency reduces cognitive load and improves operability of a service.” @mattklein123 #QConNYC
@danielbryantuk: “Distributed tracing is not so useful for fire fighting, but it is fantastic for debugging and performance issues. You need to have 100% coverage of communication though, and also make it easy for devs to access data via tooling” @mattklein123 #qconnyc https://t.

co/fCyyBJs1In
@danielbryantuk: “@EnvoyProxy is a universal data plane. There is a thin client for ID propagation and best practices etc, but the sidecar proxy is where the magic happens” @mattklein123 #QConNYC https://t.co/tmnASAhOYl
@danielbryantuk: “In the long term I’m not sure people will even be aware that @EnvoyProxy is running.

It will most likely be embedded into container and serverless platforms” @mattklein123 #QConNYC https://t.co/BwNVHQhSky
@thenewstack: The next big thing will be connecting asynchronous rest-based systems, such as the ones #Envoy supports, and event-based “real-time” synchronous systems such as #Kafka. Most companies already use both. Envoy will look into supporting Kafka soon — @mattklein123 #qconnyc
by Susheel Aroskar
Twitter feedback on this session included:
@danielbryantuk: Great start to @susheelaroskar’s #QConNYC talk — the architecture of Zuul Push at Netflix, and why this is important (paraphrasing) “this reduced traffic by 12%” in comparison with pull-based comms https://t.

co/H8ZYfYPIdY
@kcasella: @susheelaroskar auto-scaling a push notification system requires looking at open connections, not RPS or CPU #qconnyc @WeAreNetflix https://t.co/GzwSECk1CD
@danielbryantuk: An overview of managing a push cluster like Netflix’s Zuul Push, @susheelaroskar at #QConNYC https://t.co/Y3EALG5npp
@whereistanya: That was an engaging and informative talk by Susheel Aroskar about how Netflix does push notifications. Key takeaways: – recycle connections often – autoscale on open connection count – use websocket-aware or TCP LB
by Haley Tucker
Twitter feedback on this session included:
@danielbryantuk: The @netflix team learned from an experiment of shutting off the non-critical service shard that although everything worked, the critical services saw 25% more traffic.

.. via @hwilson1204 at #qconnyc https://t.co/x4L32VJGcJ
@danielbryantuk: An overview of the principles of chaos from @hwilson1204 at #qconnyc (with a hat tip to @caseyrosenthal et al for https://t.co/LTW9u8p4hU ) https://t.

co/ph2JSQ5kdv
@danielbryantuk: “Limit the impact, and pick your times, when running chaos experiments” @hwilson1204 #qconnyc https://t.co/RtThU0YIgE
@danielbryantuk: Interesting to hear from @hwilson1204 about the value automated canary analysis can provide at Netflix (and some of the tooling is now open sourced in @spinnakerio) #qconnyc https://t.co/wlVEzMBtXO
Twitter feedback on this session included:
@lizthegrey: @tammybutow How you apply chaos engineering depends upon the scale of your infrastructure.

#QConNYC
@lizthegrey: It’s like riding a bicycle; you can’t just hop on and ride at full speed.

“The hello world of chaos engineering is a CPU attack.” –@tammybutow #QConNYC
@lizthegrey: And once you can ride a bicycle, then you can drive a car, and perhaps drive an F1 car as you get more sophisticated at operating wheeled vehicles. It is a journey that could take multiple years. #QConNYC
@lizthegrey: .

.. and Gremlin’s CEO asserts that chaos engineering is about testing that those graceful degradations work. #QConNYC
@charleshumble: “Google is an expert in outages that you don’t notice – graceful degradations – and testing this is where chaos engineering comes in.” @tammybutow #qconnyc
@lizthegrey: We have to be able to gracefully omit parts of our site that we aren’t able to serve instead of leaving holes in our UI. It’s a cross-functional effort involving product managers and UX, not just infrastructure engineering. #QConNYC
@lizthegrey: What are the implications of our services not working correctly? Sometimes it’s small, but if you work in finance you could cost someone a mortgage and cost them their dream home! (and get you fined by regulators). #QConNYC
@lizthegrey: “If you never get paged, you won’t know what to do when a real failure happens or be able to train engineers.

” so @tammybutow used Chaos Engineering at dropbox to inject faults for people to train on. #QConNYC
@lizthegrey: Always be careful about affecting real customers if you can while doing chaos engineering. #QConNYC
@lizthegrey: Gremlin provides chaos engineering as a service, allowing simulations of packet loss, host shutdown, etc. with a local agent #QConNYC
@lizthegrey: Laying foundations: defining resiliency. Resilient systems are highly available and durable. They can maintain acceptable service and weather the storm even with failures.

#QConNYC
@lizthegrey: We need to know what results we want to achieve. Do thoughtful planned experiments to reveal weaknesses in our system. More like vaccines — controlled chaos. #QConNYC
@charleshumble: “Failure Fridays are dedicated time for teams to collaboratively focus on using chaos engineering practices to reveal weaknesses in your services” @tammybutow #qconnyc
@lizthegrey: Why do we need chaos for distributed systems? Unusual failures are common and hard to debug; systems and orgs scale and chaos engineering helps us learn.

#QConNYC
@lizthegrey: We can inject chaos at any layer — API (e.g.

ratelimiting, throttling, handling error codes…), app, ui, cache (e.g. empty cache -> hammered database), database, OS, host, network, power etc. #QConNYC
@lizthegrey: So why run these experiments? Are we confident that our metrics and alerting are as good as they should be? “Alert and dashboard auditing aren’t that common but should be practiced more.” [ed: yes.

] #QConNYC
@lizthegrey: Do we know that customers are getting good experiences? Can we see customer pain? How is our collaboration with support teams? #QConNYC
@lizthegrey: Are we losing money due to downtime, broken features, and churn? #QConNYC
@lizthegrey: How do we run experiments? Need to form a hypothesis, consider the blast radius, run the experiment, measure results, then find/fix issues and repeat at larger scale. #QConNYC
@lizthegrey: Don’t forget to have baseline metrics before you start experimenting. Don’t run before you can walk, it’s okay to start slow. Three key prerequisites: (1) monitoring & observability (e.

g. 4 different systems 🙁 🙁 ) #QConNYC
@lizthegrey: (2) Oncall and incident management.

If you don’t have any type of alerting and are manually watching dashboard, that’s bad. You need a triage and incident management protocol to avoid treating all outages with the same severity. #QConNYC
@lizthegrey: (3) Know the cost of downtime per hour. [ed: or have clear Service Level Objectives so the acceptable budget is defined by/for you!] #QConNYC
@lizthegrey: Tools that @tammybutow recommends: @datadoghq, @getsentry, and old fashioned Wireshark. #QConNYC
@lizthegrey: The most critical thing is having an IMOC rotation, says @tammybutow [ed: although a good end goal is empowering *every* engineer to become an incident commander]. #QConNYC
@lizthegrey: How do we choose what experiments to run? Identify your top 5 critical systems and pick one! Draw the system diagram out. Choose something to attack and determine the scope.

#QConNYC
@lizthegrey: Things to measure in advance: availability/errors, KPIs like latency or throughput, system metrics, and customer complaints. We need to verify we can capture failures. Does our monitoring actually work? #QConNYC
@lizthegrey: https://t.co/lM3QZBNWV8 is a toolkit for running your own gameday. example: a chart for how many hosts we can affect and how much latency we’re going to add to each. #QConNYC
@lizthegrey: Make sure you have a switch for turning off all chaos experiments in case of emergency.

#QConNYC
@lizthegrey: Think about what attacks you can run — both on individual nodes, as well as on the edges between the nodes, says @tammybutow. #QConNYC
@lizthegrey: Verify that your k8s clusters are as self-healing as you think they are — will they spin back up correctly if restarted? #QConNYC
@lizthegrey: Resource chaos is also important.

Increase consumption of CPU, disk, I/O, and memory to ensure monitoring can catch problems. Make sure that you find limitations before you have to turn away customers.

#QConNYC
@lizthegrey: https://t.co/C9QtRdufQY is a known-known experiment that tests situations we can anticipate and is a bicycle for learning.

#QConNYC
@lizthegrey: Disk chaos — issues like logs backing up. we can fill up the log partition on a replica or primary and make sure the system can recover. #QConNYC
@lizthegrey: “Use your experience of past outages to prevent future engineers from being burned in the same way.” –@tammybutow #QConNYC
@lizthegrey: Memory chaos: what if we run out of memory? What if it’s across all the fleet? Process chaos: kill or crashloop a process, forkbomb..

.

#QConNYC
@lizthegrey: Shutdown chaos: turn off servers, or turn them off after a set lifetime.

#QConNYC
@lizthegrey: k8s pods are a natural target for shutdowns and restarts. or simulate a container that’s a noisy neighbor that kills the containers on its own host. #QConNYC
@lizthegrey: The average lifetime of a container in prod is 2.5 days, and they die in many different ways.

#QConNYC
@lizthegrey: Time chaos and clock skew: simulate time drift and different times. (and @tammybutow points out this could have been used for y2k tests) Network chaos: blackhole services, take down DNS. #QConNYC
@lizthegrey: Reproducing outages on demand lets us be confident we can handle them in the future. #QConNYC
@lizthegrey: What were the motivations for chaos engineering? For one, Dropbox and Uber’s worst outages ever (both involving databases).

Resources: the gremlin community and https://t.co/czw9Oef1L9 . [fin] #QConNYC
by Mohit Gupta
Twitter feedback on this session included:
@aspyker: #qconnyc “We used Mesos as Twitter used it.

And @clever was right across the street. Worked but the Twitter team was 3x our company size.” #qconnyc (foreshadowing of how offloading orchestration is important to small to medium companies)
@aspyker: Key message: design a great control plane (api) that lets you change infrastructure via thin wrappers keeping what your engineers work with stable.

#qconnyc @mohitgupta https://t.co/XDU1fsclON
@danielbryantuk: “Even though we tiered our services (and SLOs) we soon realized that manual processes for deployment and issue remediation don’t scale” @n1kooo #qconnyc https://t.co/1bvI9BVbtM
@danielbryantuk: “We wanted to build a ‘paved road’ for developers to follow when deploying and operating apps.

We also wanted it to be self service, as this is efficient and scalable (and engineers don’t want to talk to people too ;-))” @n1kooo #qconnyc https://t.co/nckRCgxGvc
@aspyker: “We wanted kubernetes as the foundation, but didn’t want to show kubernetes to developers” @n1kooo (foreshadowing of a different contract/api?) #qconnyc
@aspyker: Engineers on Shopify platform uses web form for what they need, platform submits back PR to their repo with templates for their “entire” set of dependencies.

#qconnyc
@danielbryantuk: Interesting to hear about the use of “cloudbuddies” at @ShopifyEng — effectively extending @kubernetesio with custom controllers that provide operator-style functionality for making a developer’s life easier @n1kooo at #qconnyc https://t.

co/EqrKD0pFfS
@danielbryantuk: Very nice developer experience at @ShopifyEng when bootstrapping a new service. Everything is UI-driven and hooks into platform infra. A few clicks and you have all templates generated and the shell app deployed via @n1kooo at #qconnyc https://t.co/wAzFbIvhyX
@danielbryantuk: “Documentation was vitally important for our paved road platform rollout. We focused on how to ‘drive a car’ rather than ‘how to build a car’ — developers typically just want to deploy apps” @n1kooo #QConNYC https://t.

co/4e9yZWDcHd
@danielbryantuk: “If you want to build your own PaaS then focus on hitting 80% of use cases, hide complexity, and educate” @n1kooo #qconnyc https://t.co/KKMDtgb3cn
by Emily Nakashima & Rachel Myers
Twitter feedback on this session included:
@micheletitolo: Lots of companies intentionally create distributed systems, whether they are good or bad.

Splitting out by Nouns can cause lots of problems since they weren’t the right boundaries. #QConNYC https://t.co/5l4YU2tQAG
@micheletitolo: SaaS products! Whenever you start using a lot of Saas Products, you are creating a distributed system. At a certain point, you need something custom. #QConNYC https://t.co/a4XVxIy0rQ
@micheletitolo: Buying can help you create a reliable system.

Don’t be afraid to buy, but make sure to do due diligence. #QConNYC https://t.co/LMMOq08bt6
@micheletitolo: IaaS, PaaS, BaaS, FaaS: more specialized to the left. Fewer use cases, and you’ll outgrow their usefulness faster. #QConNYC https://t.co/a0mUIEynav
@micheletitolo: Figure out how to use specialized tools.

They increase cognitive load. So does putting a simple service in a complex system #QConNYC
@micheletitolo: Browsers are a distributed system! You can’t SSH into it, and little opportunity to instrument. Front-end complexity is ever increasing and therefore bugs are getting more complex. #QConNYC https://t.co/TRiNkoM7hc
@micheletitolo: Final conclusions: 1. We are all distributed systems operators and 2.

You need to be able to trust your tools #QConNYC https://t.co/uJ09IudtgA
Twitter feedback on this session included:
@micheletitolo: Takeaways from A Refactoring Story by @kytrinyx, probably the best refactoring talk I’ve seen #QConNYC
@micheletitolo: The first thing to ask: should I refactor at all? Is there a reason to change the code? The change defines which axis needs to change.

We don’t need infinite flexibility. #QconNYC
@micheletitolo: Rearrange the code to get flexibility we need, and only then add the new feature #QconNYC
@micheletitolo: When extracting, naming is important. If you get the names wrong, the code is harder to change, because names stick around. Use domain concepts, which are less likely to change #QconNYC https://t.

co/CL8cq3EO0T
@micheletitolo: Next, dissect your code in place. This gives you information before you commit to a new design #QconNYC
@micheletitolo: Isolate the algorithmic code into methods. Trade off: something that looks incredibly simple, now much more complex. But now you see the bones/structure of the algorithm.

#QconNYC
@micheletitolo: First do a parallel implementation, so you can compare new + old. If your tests fail, there’s something else you missed, usually a conditional. Exceptions are the keys to a new insight and unblock abstraction #QconNYC
@micheletitolo: There’s the “primitive obsession” where we want to use basic types, like hashes and strings, but objects help us encapsulate better. #QConNYC
@micheletitolo: Once you’ve solved one problem, move on to the next one favoring fixing duplication.

Chip it away to reveal underlying complexity. Rinse. Repeat. #QconNYC
@micheletitolo: Use “The flocking rules” to guide you: find the things that are the most alike, select the smallest difference between them, make the smallest change that removes that difference. #QconNYC
@micheletitolo: Last step is tacking assumptions, which are usually hardcoded. These need to be made flexible #QconNYC
@micheletitolo: Take small steps.

Each of your small steps need to be safe. The higher the uncertainty the smaller the steps.

#QconNYC https://t.co/yCdWOmVx3a
@micheletitolo: Lastly, with legacy codebases this might mean going backwards. That’s okay! By going backwards you surface complexity. Don’t keep all the details in your head, and make understanding things easier #QconNYC https://t.co/MNzcw4HvmV
Twitter feedback on this session included:
@micheletitolo: “People don’t listen to what leaders say, they look at what leaders do” – @semanticwill #qconnyc
@micheletitolo: “Most teams aren’t ready for a transformation, mostly because people are overburdened, doing unplanned work etc. Fix the system before introducing new things.

” #qconnyc
@micheletitolo: “Overburdening disempowers people and prevents them from doing good work” #qconnyc
Jeanne Boyarsky attended this session:
Goals: send right message to right person at right time using right channel (ex: email, text, etc)…
Build trust without stifling innovation accountability – what do with data, who responsible, continuing to focus on data perception, audit/clean data, make easy to see what data have and how opt out/delete privacy by design – innovate without doing harm, don’t want to get hacked, be user centric, move data to individual so no storing, what is actually PII vs what feels like PII. Anonymize both…
What they did dropped log storage to 30 days. Have 30 days to comply with requests to delete data.

So handled by design for log files hash email recipients Kept anonymized PII data, support inquiries, etc some customers feel 30 days is too long so looking at going beyond law
Twitter feedback on this session included:
@lizthegrey: Senders need to know what recipients have done with the messages they sent. Four key topics: consumer trust, privacy regulations, recent key issues/lessons, and doing it right. #QConNYC
@lizthegrey: Goal of marketing industry: sending the right message to the right person at the right time (and via right channel). #QConNYC
@lizthegrey: Only recently has it become possible to gather enough data to accomplish this. But this requires data handling. Three projects: GDPR compliance, and two feature enhancements. #QConNYC
@lizthegrey: 2/3 of consumers don’t trust brands with PII.

And employees don’t trust their company to be GDPR compliant (63% not confident). *after* GDPR 90% don’t believe consent is accurately described yet 31% don’t think they’re personally responsible. #QConNYC
@lizthegrey: ^^ that’s 90% of employees of companies that don’t think their employer’s GDPR disclosures are accurate. 90%. #QConNYC
@lizthegrey: Do we deserve that trust? Well… ~500M identities known to have been stolen to date this year (e.g.

email addresses, hashed passwords). #QConNYC
@lizthegrey: Example: Ticketfly had to shut down after losing 27M peoples’ data. Panera: 37M identities stolen, including partial credit card numbers #QConNYC
@lizthegrey: 92M identities stolen off MyHeritage. Fortunately no DNA data stolen, just passwords and emails. 150M (and counting) identities stolen off myfitnesspal. #QConNYC
@lizthegrey: The minimum threshold isn’t just not selling your data, it’s safeguarding it against a breach. If someone phishes your employees’ accounts or gets an S3 token, you don’t want to be a Panera.

#QConNYC
@lizthegrey: Landscape of regulations: CASL, CAN-SPAM, EU-US Privacy Shield, & GDPR. And Germany and France are crafting separate regulation. #QConNYC
@lizthegrey: It’s not about the explicit laws, but instead the idea that our customers’ trust matters and that data has an impact upon our brand. Customers will leave us if we break their trust.

#QConNYC
@lizthegrey: Demand is on the rise for data scientists/engineers (50% higher than supply). 650% growth in roles over 6 years. #QConNYC
@lizthegrey: Key issues in trust to cover: accountability, privacy by design, and continued innovation. #QConNYC
@lizthegrey: (1) accountability — get,stay,show clean.

Audit data inventory, have processes for new data, and providing transparency/opt-out. #QConNYC
@lizthegrey: “We were annoyed at the mess of data marketing was retaining… until we looked at our own logs, in which we kept all production data with no fixed retention.” — @AmieDurr #QConNYC
@lizthegrey: Dropping storage to 30 days makes it the default behavior in compliance with the law, rather than requiring manually cleaning data after an opt-out.

#QConNYC
@lizthegrey: Hashing data to make it pseudonymous. Removed unnecessary tracking tags/cookies.

Educated customers not to do stupid things like put PII in subject lines/content. #QConNYC
@lizthegrey: Separate your abuse logging/heuristics from your other data, and communicate clearly about it in your privacy policy.

#QConNYC
@lizthegrey: You don’t have to throw everything out, but you need to know what you have and what you’re using it for to make appropriate decisions. #QConNYC
@lizthegrey: Data protection is a shared responsibility that needs to be continuously done. #QConNYC
@lizthegrey: 7 principles of GDPR; be user centric, and continuously stay clean. #QConNYC
@lizthegrey: Case study: engagement message events from message opens. Used to store the encoded link containing the crypto unhashed & other customer data.

Instead migrated that data to no longer live in the links. #QConNYC
@lizthegrey: Best practice is to have the messages forward links for a year, but needed to change behavior: if <30 days, do the join, if >30 days, just pass the link along. #QConNYC
@lizthegrey: Can now no longer see who the messages were to retrospectively unless the user engages.

#QConNYC
@lizthegrey: Distinctions between strict PII and possible PII (e.g. try to be non-creepy about geo_ip data even if city level). Don’t just follow the law, look after your consumers and what feels right to them.

#QConNYC
@lizthegrey: Stop storing non-anonymized PII, and encrypt/encode data you do need to keep. Aggregate. Explicitly include data management in design docs. #QConNYC
@lizthegrey: “If you do not [include privacy in your design docs], you will forget about it.

And make the DPO your best friend.” — @AmieDurr #QConNYC
@lizthegrey: If you delete by default there’s little you have to do around GDPR’s deletion policy. #QConNYC
@lizthegrey: Smart send case study: don’t send mails to disengaged users. So we started backfilling 6 months of data for our beta feature… except to discover it wasn’t hashed. Switched to an appropriate source.

But the team missed privacy by design. #QConNYC
@lizthegrey: Someone proposed a feature to compare subject lines within the same industry. Need to communicate early/often with the DPO. Do we need to update our privacy policy? But don’t be afraid to innovate. #QConNYC
@lizthegrey: What’s next? Regulation and innovation aren’t in opposition to each other. We do both — learn from your peers, ask questions, innovate, and build a system of trust.

#QConNYC
@lizthegrey: If you get,stay,show clean, consumers will trust you. [fin] #QConNYC
by Kathy Pham
Twitter feedback on this session included:
@lizthegrey: Why was the US government spending billions of dollars on software that didn’t work? In part, because it didn’t understand the needs of the community. #QConNYC
@lizthegrey: Call to action: honor all expertise across academia and industry to build better software. Better outcomes from interdisciplinary collaboration (e.g.

at academic institutions) #QConNYC
@lizthegrey: The hierarchy of engineering/tech roles over other disciplines being less valued is unhealthy and produces software that doesn’t serve people. #QConNYC
@lizthegrey: Amazon reconstructed redlining with same-day delivery by not being critical about the data they were using. #QConNYC
@lizthegrey: The incentives aren’t aligned for serving users’ needs and we wind up with tools that don’t work. #QConNYC
@lizthegrey: In recent news: Google, Microsoft, and Amazon in the headlines for employee protests/mobilization against problematic government contracts #QConNYC
@lizthegrey: How do we train people? If we look at the CS curriculum now, people have many choices of focuses, but no dedicated focus on ethics [ed: may have mis-transcribed]. There are at least 197 courses across 188 universities. #QConNYC
@lizthegrey: They exist, but they may not work — otherwise we wouldn’t be in the state we’re in. What can we do to fix this? Embed ethics into the data science curriculum instead of making it a separate class.

#QConNYC
@lizthegrey: Make people question what they’re building *as* they’re building it. It’ll take time to tell how effective it’ll be. #QConNYC
@lizthegrey: We need to empower and connect everyone. There’s a lot of power in individual contributors, especially among engineering individual contributors. We are perceived by management as some of the most valuable employees and can utilize that power. #QConNYC
@lizthegrey: Surprisingly, people with social science and computer science backgrounds have a harder time finding jobs than people from pure CS backgrounds, because they don’t look like the “standard template”. #QConNYC
@lizthegrey: Use your voice. “With great power comes great responsibility” by Spiderman.

#QConNYC
by Liz Fong-Jones
Twitter feedback on this session included:
@lizthegrey: “Data is a vital tool that helps us describe what is happening in our cities, predict what can happen for people in situations, and evaluate what’s happening around us.” –@QuietStormnat #QConNYC
@lizthegrey: Data determines things for us as individuals — how fast we get home, how much we pay for healthcare, and so forth. #QConNYC
@lizthegrey: How do we develop a better trust with our users and build tools responsibly? “If we don’t do this, the data goes away, and if the data goes away, the innovation goes away.” #QConNYC
@lizthegrey: The power of the individual person matters. We want a vision and understanding of how the world should be. We’re not living up to the promise of what America should be, but with technology, we can in the future.

#QConNYC
@lizthegrey: “Not everyone should follow the same code of ethics. Ethics are shaped by how we grew up.

We should be respected for our differences.” #QConNYC
@lizthegrey: You have to understand how you define responsible behavior to call other people on it. It’s hard to do data science and do innovative things for social good. #QConNYC
@lizthegrey: Breaches are a large reason for social outrage — “people aren’t screaming because companies aren’t complying with legal regulations, they’re screaming when companies are _violating their trust_. And that’s having an impact.” @QuietStormnat #QConNYC
@lizthegrey: We need to balance individual rights with company profits. A growing tension, and outrage when that tension isn’t respected and addressed. #QConNYC
@lizthegrey: GDPR says that people have not just the right to know what’s being done with their data, but have a say in how their data is used.

But it’s expensive and hard to be more nuanced about consent. #QConNYC
@lizthegrey: Ethics is driven by culture, measured by compliance, and determined by society. Building trust requires empowering people to speak about data use, have processes to standardize practices, and leverage technology to hold us accountable. –@QuietStormnat #QConNYC
@lizthegrey: “Data is not technology — data is _people_.

” –@QuietStormnat #QConNYC
@lizthegrey: First post-USDS project: Community-driven Principles for Ethical Data Sharing (CPEDS). What does responsible use for data look like? How should we share data? #QConNYC
@lizthegrey: Goal was to define a foundation for the principles of other peoples’ work. e.g.

a Hippocratic Oath for Data Scientists. It’s okay for it to fork — the same has happened in medicine. #QConNYC
@lizthegrey: How are we analyzing data and developing software/algorithms? Is it fair? Fostering diversity? Considering unintended consequences?”Ethics is not about being perfect, it’s about being intentional and responsible.” –@QuietStormnat #QConNYC
@lizthegrey: Audience participation exercise: describing how users interact with data, key data lifecycle points of contact, and key questions to check ourselves.

#QConNYC
@lizthegrey: e.g. data as “opportunity” or “risk”. requires “trust”.

#QConNYC
@lizthegrey: “When we communicate about the APIs we produce, we need to not only say what it _can_ do, we need to say what it _cannot_ do.

” –@QuietStormnat #QConNYC Privacy Ethics – A Big Data Problem
Twitter feedback on this session included:
@lizthegrey: General Data Protection Regulation passed in Europe. Why? “People are more aware of what kind of information companies are storing about them, and are worried about how that data might be used.” — @rags_den #QConNYC
@lizthegrey: Examples: leaks of information from credit bureaus and threat of identity theft.

“The cost of storage and compute have decreased so much they’re practically free. Companies are recording more and more data for the benefit of their bottom line revenue.” #QConNYC
@lizthegrey: Surprising discoveries upon acquiring a company: credit card numbers stored in the clear in images that could be OCRed, accessible to any engineer. #QConNYC
@lizthegrey: Large IoT data lakes.

We need to understand how to segregate and anonymize our data sets.

[ex: Strava heatmaps exposing military bases] #QConNYC
@lizthegrey: Your client list may or may not be sensitive information. Think law firms, doctors… #QConNYC
@lizthegrey: Application logging as another privacy pain point: verbose logging can let developers quickly resolve problems, but the challenge is that our constraints change. Think recording webform answers. What if a sensitive field is added? #QConNYC
@lizthegrey: Usernames can be correlated to other sites and reveal a huge amount of PII about the user. The mapping from transaction to username is a problem. #QConNYC
@lizthegrey: Aggregation services like Splunk and Sumologic makes it easier to access logs, bypassing controls on the production environment.

Can you audit access? #QConNYC
@lizthegrey: Additional challenge: biased algorithms. e.g. Target outing minors who were pregnant by making inferences from their purchases.

#QConNYC
@lizthegrey: Recap: privacy vs. security. Privacy relates to your fundamental rights and how your data is used; very contextual. Security relates to how we protect data and implement controls.

#QConNYC
@lizthegrey: Solutions: cultural change is needed. Employees need to understand how to handle data and address customer concerns about how data will be used.

#QConNYC
@lizthegrey: For the managers in the room: do you treat privacy/ethics/data-handling as part of your performance review process? #QConNYC
@lizthegrey: Is it cheaper to spend 4% of your global revenue on GDPR fines, or invest more in privacy (current industry average: 0.0004% of revenue spent on privacy engineering) #QConNYC
@lizthegrey: Second solution approach: security. Encrypt data in transit, rest, and your backups. [ed: and encrypting backups lets you delete data faster by deleting the key!] #QConNYC
@lizthegrey: Make sure you have robust, unified authentication mechanisms. Have anonymization/masking and pseudonymization processes. #QConNYC
@lizthegrey: Use secure credential storage such as HashiCorp Vault.

Retain detailed audit logs on who is accessing things. #QConNYC
@lizthegrey: Solution 3: design. classify all data you store in terms of sensitivity. Salesforce is a common dumping ground for data #QConNYC
@lizthegrey: Ensure that you are implementing multifactor/multiparty authorization to access data. #QConNYC
@lizthegrey: Obscure data by design; don’t show everything at once such that people can get access to data they don’t immediately need. #QConNYC
@lizthegrey: Don’t make aggressive assumptions about the consent that you’re getting from users.

Permission to use for one use is different from permission to use for marketing etc. #QConNYC
@lizthegrey: Provide visibility & transparency to users. Solution 4: process e.g. privacy impact assessments. #QConNYC
@lizthegrey: Ensure that minimal use of data is made; ask questions about what legitimate business purposes we have for asking for information.

Give users self-service access to manage data. #QConNYC
@lizthegrey: Honor user consent when processing data. “Challenge your product manager and ask for the consent database.” –@rags_den #QConNYC
@lizthegrey: Store user data only as long as is necessary. What you can delete does depend upon business function/use eg fraud [ed: but you could mask fields, even if not delete..

.]. #QConNYC
@lizthegrey: Solution 5: Automate. do discovery of data at rest and in motion – label/tag data sources.

(this is what Integris does) #QConNYC
@lizthegrey: You can false positive on noise [ed: e.g. that things can look like a phone number that aren’t] so have confidence scores. #QConNYC
@lizthegrey: Make sure you have the ability to access all of your data across all your environments and across all of your data formats.

#QConNYC
@lizthegrey: Be aware both of the data in fields as well as the metadata (e.g. is “Jordan” a country, name, or a shoe brand?) #QConNYC
by Erica Windisch
Twitter feedback on this session included:
@lizthegrey: @ewindisch @IOpipes .@ewindisch has seen the challenges of running infrastructure at scale, and wants to help people go beyond infrastructure — making sure that our applications are working for our users. #QConNYC
@lizthegrey: “Are you sure?”, she says. It’s a trick question, because we need to define working.

“What is a working application? Up is not online.” #QConNYC
@lizthegrey: There are lots of tools that will do tests like pingdom.

[ed: omg preach]. “Can you send an HTTP request and does it return a response is the bare minimum and doesn’t actually tell you if your app is returning API responses.” #QConNYC
@lizthegrey: “Up means that your application needs to be useful for your users. It goes beyond uptime.

” #QConNYC
@lizthegrey: If we trust your cloud provider, your application should always be up [ed: for some SLA based value of always]. Or if you trust your k8s install to be flawless, that your containers are always running. #QConNYC
@lizthegrey: But again, that’s infrastructure-level uptime, not application availability. With serverless, you can assume that uptime of your infrastructure is not your problem — there’s nothing you can do about it other than waiting for your provider to fix it. [ed: or having CREs] #QConNYC
@lizthegrey: “Your code always does what you write it to do, but does it do what you *want* it to do?” –@ewindisch #QConNYC
@lizthegrey: Does it help to know assembly and about file descriptors? Yes. “But you shouldn’t *have* to.

Making it easier lifts up our developers and teams so that they don’t have to worry about things that they shouldn’t have to worry about.” –@ewindisch #QConNYC
@lizthegrey: Most of the tooling around serverless is completely different from the tooling around containers.

We can’t use the same tools we use for k8s in the serverless world. Aggregating data into minutes/seconds and smashing it together doesn’t work for serverless. #QConNYC
@lizthegrey: We need to be able to store more data, get more insights, and get data on singular requests, rather than being limited by number of custom metrics and number of processes. #QConNYC
@danielbryantuk: “Traditional monitoring focused on deployment on uptime.

Now we should be asking what value do you provide to your business?” @ewindisch #qconnyc https://t.co/g91Dh2D1ZW
@lizthegrey: The traditional story centers around deployment and uptime. But what value do we provide to our business, and what do we cost? Can we describe what we save the business? It’s hard to show what you mitigated/prevented. #QConNYC
@lizthegrey: Are we *pleasing* our users? This is the metric for if our application is working. We’re supposed to test-learn-repeat. Make sure that the changes you’re making are resonating with users. #QConNYC
@lizthegrey: One way we can get this data is empowering your data scientists.

Even if you’re a one-person company, congratulations, YOU are the data scientist. #QConNYC
@lizthegrey: Involve your data scientists the same way that you’d involve operations in software engineering for devops. Make sure data scientists are understanding how users use the application. #QConNYC
@lizthegrey: Key Performance Metrics/Indicators re critical. What is actionable? How many people are using a feature and fall off? #QConNYC
@lizthegrey: The overhead of debugging is irrelevant and will save you in the long term. If you are horizontally scalable with serverless, it’s really cheap to run your debugging all the time. You won’t run out of CPU #QConNYC
@lizthegrey: Data isn’t useful in a vacuum; we can’t just throw it into kafka. Be able to correlate the data.

If someone compromised your application, what telemetry do you have? You *can’t* look at the containers, they’re gone already. #QConNYC
@lizthegrey: Can you figure out what’s making your database slow by looking at what queries are correlated with the slowness? #QConNYC
@lizthegrey: How many users of our Alexa skills thanked us? How many cursed us? User happiness matters. #QConNYC
@lizthegrey: Application metrics > infrastructure metrics. They both matter, but application metrics are a superset. Knowing if your application is working can’t be determined just from the infrastructure metrics.

#QConNYC
Twitter feedback on this session included:
@danielbryantuk: “Write software that the average developer can support — because sooner or later…” @JoeEmison #QConNYC https://t.co/xVDeLUZ3KY
@danielbryantuk: Great takeaways on @JoeEmison’s #qconnyc serverless patterns and antipatterns https://t.co/c3btCwAuQn
by Bernd Rücker
Twitter feedback on this session included:
@lizthegrey: @berndruecker 3 hypotheses to discuss today: (1) event-driven architectures decrease coupling, (2) orchestration can be avoided for event systems, and (3) workflow engines are painful and aren’t needed with microservices.

#QConNYC
@lizthegrey: To simplify, we need to pay for the item, fetch it, and ship it.

How do we implement it? We might have bounded contexts of checkout, payment, inventory, and shipment services, implemented as microservices. #QConNYC
@lizthegrey: But we might have a separate set of application processes, infrastructure underlying it, and separate development teams for each microservice.

#QConNYC
@danielbryantuk: “Autonomy is a vital theme for microservices” @berndruecker #qconnyc https://t.co/F2Jrvl8jZ7
@lizthegrey: This lets us decouple the request/response dependency. So we have a hammer, everything looks like a nail… can we do the whole chain with events? Checkout broadcasts that an order was placed. #QConNYC
@lizthegrey: Martin Fowler: “it’s easy to make decoupled systems with event notification, without realizing that you’re losing sight of the larger-scale flow.

” #QConNYC
@danielbryantuk: “You definitely don’t want to turn microservices development into a three-legged race. Teams should be decoupled” @berndruecker #qconnyc https://t.co/ZWckQqs73I
@lizthegrey: We need to have orchestration, it’s not evil. Implement Martin Fowler’s idea of smart endpoints and dumb pipes. Make the event bus as dumb as possible.

#QConNYC
@danielbryantuk: With a hat tip to @samnewman, at #qconnyc @berndruecker states that within a microservice-based system a god service is only created by bad API design, but… “Clients of dumb endpoints easily become a god service” https://t.co/cSbjqiR0OM
@lizthegrey: Smart endpoints potentially must keep transactions running for a long time, rather than immediately handing back success/failure. #QConNYC
@lizthegrey: So we need some kind of state. We can either persist individual things, or we can use state machines.

People typically think workflow engines are painful, because they’re using the wrong tools. “Death by the property panel.” #QConNYC
@lizthegrey: People have built more modern tools (Cadence by Uber, Conductor by Netflix…) #QConNYC
@lizthegrey: There are lightweight open source options (zeebe, jBPM, Activitr), and some horizontally scale. #QConNYC
@danielbryantuk: “The cloud vendors and silicon valley companies are recognizing the power of orchestration via a workflow engine within a microservices architecture. There are now open source offerings too” @berndruecker #qconnyc https://t.co/MEdUj2teHT
@lizthegrey: So onto our distributed systems, we need to think about failures — what happens if we never hear back from the request to try the credit card? We need to clean up state.

#QConNYC
@lizthegrey: we don’t necessarily have ACID transactions, so we need to instead define ‘undo’/’compensation’ actions for each action that could potentially fail. #QConNYC
@danielbryantuk: “Workflows live inside service boundaries — there is no central orchestrator. The beauty of using workflow engines is that you get observability into the business processes” @berndruecker #qconnyc https://t.co/z7WBYBDKxr
@lizthegrey: As a summary, the answer to all three questions is “sometimes”.

[ed: hah!] [fin] #QConNYC
@danielbryantuk: Great tour de force of using events and (modern) workflow engines by @berndruecker at #qconnyc https://t.co/JaYKKHCPYr
by Liz Fong-Jones & Adam Mckaig
Twitter feedback on this session included:
@thenewstack: When you find a new type of performance issue, the temptation is to add a new set of metrics to a dashboard.

Most of the time this is not a good idea. Overly busy dashboards can quickly lead to cognitive overload — Google’s @lizthegrey on #microservices debugging #qconnyc https://t.co/V5KJjBhdTg
by Jonas Bonér
Twitter feedback on this session included:
@lizthegrey: @jboner “Before you drink the Kool-Aid, take a step back and think about whether you really need to do microservices.

” Reason *to* might include scaling your organization.

Reasons to not *necessarily* do it are to scale up your system. #QConNYC
@lizthegrey: Many people building microservices wind up building microliths with synchronous, blocking RPCs between microservices directly replacing API calls within the original monolith.

And using synchronous datastores.

#qconnyc
@lizthegrey: You may have solved the organizational scaling problem, but you haven’t gained the additional benefits of microservices. We can do better than that by thinking in events with domain-driven design. #QConNYC
@randyshoup: “The right reason to do Microservices is to scale the organization” @jboner at #QConNYC
@lizthegrey: Modeling events forces you to think about the behavior rather than the structure of the system — Greg Young. #QConNYC
@danielbryantuk: “When designing microservice systems don’t focus on the things; focus on what happens” @jboner #QConNYC https://t.co/gZWGar5iUx
@lizthegrey: Information travels at the speed of light and we’ll always have non-zero latency. There is no now, and information is always from the past. #QConNYC
@lizthegrey: Distributed systems are non-deterministic.

We live in a scary world where messages get lost, and where systems fail in new ways. [ed: yes! finally :)] #QConNYC
@lizthegrey: We need to model uncertainty and account for it in our business logic. #QConNYC
@lizthegrey: Autonomous components can only promise its own behavior; making everything local improves stability. [ed: this is outsourcing *all* your risk onto your event bus. Make note that GCP’s PubSub is 99.95% available, for instance] #QConNYC
@lizthegrey: Think of modeling things as State, Commands, and Events.

You’ll never fully converge. #QConNYC
@lizthegrey: There is no now, and resilience is by design.

[ed: go on, tell me more..

.] We need to manage failure rather than avoid it. #QConNYC
@danielbryantuk: “A system of microservices is a never ending stream towards convergence” @jboner #QConNYC https://t.co/PM0o83ymnX
@lizthegrey: Good failures are contained to avoid cascading failures, reified into events, signaled async, observed by many, and managed outside the failed context.

#QConNYC
@lizthegrey: Events need to be persisted. How do we transition from a CRUD system to events? Atomically double-write both to a table and to the event bus [ed: no details provided on how to do this…]. #QConNYC
@lizthegrey: We don’t get full consistency, but we get eventual consistency by subscribing to the event bus.

#QConNYC
@lizthegrey: The truth is the log, and the database is the cache of the subset of the log. “It’s cheap to store data, why not store the entire log?” [ed: the Privacy/Ethics track members would… disagree with this premise, as well as the latency of going through the entire log] #QConNYC
@danielbryantuk: “Update-in-place strikes systems designers as a cardinal sin…” @jboner #QConNYC https://t.co/ISPDGfsbIm
@lizthegrey: Event sourcing can act as a cure for destructive updates: log all state-changing events to maintain strong consistency and durability. #QConNYC
@lizthegrey: To recover from failures, rehydrate events from the event log and re-run the internal state, and don’t run the side effects.

[ed: what happens if the side effects didn’t run at least once?] #QConNYC
@lizthegrey: We get one source of truth with all history, and smaller durable in-memory state. Avoids in-memory object to stored relational mismatch. #QConNYC
@lizthegrey: Another pattern to deploy is CQRS ( https://t.

co/n28mfZwoRK ) for separating reads and writes. #QConNYC
@wesreisz: Genius… @jboner comparing the event log to an accounting ledger. Would you ever destroy data by overwriting it in a financial ledger? Why do keep doing it with CRUD? #QConNYC
@lizthegrey: Time travel lets us do historical debugging, auditing, failure recovery, and replication all for free. #QConNYC
@lizthegrey: Key takeaways: use event-first design to modernize and reduce risk.

Event logging avoids CRUD/ORM and lets us retain history, balancing strong/eventual consistency. you can start with https://t.

co/3RpnUSLJyM or read https://t.co/Q8BdgPjWVw [fin] #QConNYC
by Michael Bryzek
Twitter feedback on this session included:
@danielbryantuk: “Sometimes the quality of software architecture is only revealed after several years of an application being worked on” @mbryzek #QConNYC https://t.co/BMiUneKnqJ
@danielbryantuk: Common microservices misconceptions, courtesy of @mbryzek at #QConNYC https://t.co/LI2J1xEHcR
@philip_pfo: “Automation tooling enables everyone to benefit from specialist expertise without needing to be a specialist” @mbryzek #QconNYC
@danielbryantuk: “At Flow we have a single developer CLI tool that contains all of our automation and workflow utils” @mbryzek #QConNYC https://t.

co/T6BrRMZwMD
@danielbryantuk: “Continuous delivery is a prerequisite to managing microservices architecture. This should be 100% automated, and 100% reliable” @mbryzek #QConNYC https://t.co/roHANR6ZH6
@danielbryantuk: Critical decisions with a microservice architecture, via @mbryzek at #QConNYC

Leave a Reply

Next Post

Bitcoin Cash's First Birthday Marks Expansion Beyond Just Money - BitNewsBot

Telegram Bitcoin Cash’s First Birthday Marks Expansion Beyond Just Money Hundreds gathered in Hong Kong yesterday to mark the first birthday of Bitcoin Cash. Interestingly, though, projects in Asia and beyond are now focusing more on BCH ’s capacity for more complex transaction code to enable smart contracts , rather than the simple “large blocks,…

Subscribe US Now