A Tale of a Broken Drone

A Tale of a Broken Drone

Thursday

05:17

Lara finally woke up and snoozed an alarm. Today, she has to start her day very early. She plans to attend her first jiu-jitsu training this evening and wants to finish her working shift earlier. Lara is working for a local startup RoboPlanter as a drone operator. RoboPlanter is developing a technology to use drones as the tree-planting machines. Her job is to drive vehicle carrying drones and seeds (called “mothership”) as close as possible to the target area, launch and recharge drones when necessary. As drones fly autonomously, Lara’s primary responsibility is to supervise the process.  

She is one of the 14 drone operators on a shift today. Majority of them start their work at 8:00 am. Usually, they drive 50-70 km to their target locations.

Lara did all her morning routines, filled her thermos with coffee, and packed two chocolate bars into a small backpack. Still sleepy, she got into her company’s 4x4, started the engine and departed to today's “target area”—a hill in the remote forest where the trees were clear-cut few years ago.

07:06

Mike left his apartment and headed to a bus station. He was listening to a recent podcast via his headphones. Podcasts and audiobooks made his daily commutes less boring. Every day Mike had to ride a bus for about an hour. The RoboPlanter office where he worked as DevOps engineer was located far away from his place. It was Thursday morning. No new product releases were planned. It seemed like it's going to be an easy day.  

07:54

“Target area” was at the end of a muddy and rough dead-end road. Lara parked the car and started her usual activities. The area turned out to be a canyon with steep banks at both sides.

She unloaded and assembled drones, lined them up on the ground, got back into the car and turned on her laptop. After some regular checks, she hit a “launch mapping” button. The messages on the screen started to flicker. First, “calculating mapping route”, then “loading mapping route” and finally, “ready to fly”.

“Off you go!”, said Lara pressing the button on a keyboard. One of the drones lazily lifted up and flew out to a distant corner of the canyon.

Satisfied with the progress, Lara took her thermos, poured coffee into the mug and started enjoying her drink with a king-size chocolate bar.

08:15

Mike was surprised by a plan pinned on a whiteboard in the hall, “What do you mean by “Disaster Recovery Drill”?  He addressed his question to the RoboPlanter’s CTO.

“The Drill. We are about to execute it at 08:30. Didn’t you get an email?”

“Probably, I missed it. What was it about?”

Mike saw an email with that title, but he was too lazy to open it.

“We shut down all instances in a single availability zone. It will help us test whether we are resilient enough to withstand such loss in real life.”

“Ah. Ok. It should not be too bad. The Load Balancer should regulate the load and move it to another availability zone automatically.”

“Theoretically,  yes. But we haven’t practiced it before.”

“Well, I’ll be doing my job as usually. Drones must fly, right?”

“Yep. Just try recording anything unusual that happens. The whole point of the drill is to learn from mistakes.”

08:24

Lara closely watched as the drone goes around in circles over the canyon. Now it was half to the top of one of the hills. Something went wrong. It was not gaining the sufficient altitude. Usually, a drone flew at 30 meters above ground level. Now it was flying much lower. It has kept the same elevation since the launch.

“That’s bad. That’s very bad,” said Lara out loud.

She was standing about 50 meters away from “mothership” when she realized that drone is about to crash. She ran as fast as she could to her car and pushed the “Return To Home” button on the drone remote control. It aborted a flight program and forced the drone to return to the initial location. But it didn’t help, and suddenly, she heard the sound of plastic rotors hitting the ground.

“It’s going to be a long day,” Lara sighed.

08:47

Lara was standing on a slope next to the pieces of the drone. Company’s policy requires to report a problem to the headquarters, collect as many elements as possible, and launch a second (spare) drone. She decided to make an extra phone call to the DevOps guys. She was 100 % sure drone crash wasn’t accidental.

 “Hello?”

“Hello Mike, it’s Lara. I’ve got a small problem here.”

“Hi, Lara. You are an early bird today. How can I help you?”

“The drone has just crashed into a ground.”

“Damn. Are you alright?”

“Yes, I’m fine. Listen, I am pretty sure it was not a mechanical malfunction.”

“Oh, it’s good that you are OK. It’s not your fault it has crashed. Hardware is still in the “beta” mode. Try to gather as many bits as possible. Hardware specialists will take a look at it tomorrow. Don’t worry much about it, ok?”

“No, you didn’t get it. The drone couldn’t get to the right altitude. It crashed because a flight route wasn’t adjusted to my area.”

“It’s not possible. We have the elevation correction and obstacles avoidance. It had to be something mechanical.”

“Mike, I’m telling you. I was watching the drone the last minutes before it crashed. It couldn’t gain a sufficient height. I’m sure if I launch the second drone, it will also fall. You guys had messed up something in the routing.”

“Hm...ok I will check this out. What is the drone ID? Does it end with “f5c2”?”

“Give me a sec here...No. It says “a4z9”

“Ok, found it. I’ll check what route this bird was flying. Thank you for the call.”

“Yeah, but will you call me back? I don’t want to launch the second one unless I’m sure routes are correct.”

“Sure, will do, no worries.”

“Ok, waiting. Bye!”

“Yes, take care.”

09:12

“That’s bad, that’s very bad,” said Mike out loud.

He started at a zig-zag mapping route taken by the crashed drone. He looked for the elevation corrections. But there were none. The entire path had no elevation changes. The drone was directed to fly at 30 meters throughout the whole route. Mike checked the terrain elevation data at the location where Lara was working this morning.

The elevation difference between the top and the bottom of a canyon was about 40 meters. Lara was right. It wasn’t adjusted to the elevation.

There were around 15-20 minutes until the other drone operators arrive at their target areas and start launching their drones. If their mapping routes are not adjusted to the elevation, there will be more crashed drones this morning.

He pulled map routes generated yesterday. All of them had an elevation correction.

Mike concluded that problem was introduced after yesterday’s upgrade of Routing Service. Lara was the first to get the route from an updated web service.

Mike decided to rollback route mapping service to the previous version. It was running on EC2 in the auto scaling group. Mike had to create a new launch configuration, terminate existing instances, and wait until the new instances are created.

09:27

"Hello?"

"Lara, it’s Mike. You were right. It wasn’t mechanical."

"I told you. Ok, what should I do next?"

"I’m rolling back Routing Service. It will be back online in a few minutes."

"So I’m waiting 15 minutes, and launching another drone?"

"Yes, it should be fine in 15 minutes. Thanks for noticing, we could have lost more drones today because of this bug."

"No problem Mike. Hope it will work this time."

"Sure, take care."

"Bye!"


09:45

Lara repeated routine checks for the second drone and hit “launch mapping” button for the second time this morning. Message on a screen stated “calculating mapping route”. After a few minutes, the message remained the same. Lara opened thermos again and made a long sip of the hot coffee. “It’s going to be a long day,” she whispered while staring at the screen.

10:01

"Hello?"

"Mike, it’s Lara again."

"I know, I am working on it."

"The process freezes on “calculating mapping route”."

"Lara, I know, I already had three calls from our operators." "Everyone has the same problem. We lost one availability zone on AWS. There is only one instance left in the group. You all launched the mapping requests at the same time. It’s choking under the load. I can’t talk right now, sorry."

"Wait. What should I do now?"

"I’ll send an email to everyone when it’s resolved."

10:02

Mike looked at SQS queue size for the routing requests. Now it had 92 items in it, and the number continued to grow. Operator’s client software requested routes, didn’t get the responses and fired other requests one after another. “We are DDoS-ing ourselves!“

But even with only one instance responsible for the routing calculation, it should have worked. Definitely, something else was wrong here. Why weren’t clients getting their routes back?

Mike remembered that route replies were delivered to the clients via the client-specific SQS queues. He pulled the list of the SQS queues to look at their size, but he was surprised to find only one. The request queue was active, and the number of messages was continually growing.  “Reply” lines weren’t present at all.

Mike pulled logs from the routing service. There were multiple error messages such as

“Error: Routing reply publishing failed. Can’t find reply queue for publisher…”

He filtered the error messages to find out more clues. The earliest error message from 09:31:16 said:

Client queues creation failed. Can’t reach drone directory. CURLcode:CURLE_COULDNT_RESOLVE_HOST

Mike’s thoughts started running like wild horses.

“Can’t reach drone directory”.

“Can’t reach drone directory.”

Drone Directory was a separate service with a simple API that provided other services with updated information about the drones fleet.

It was important but rarely used. The Disaster Recovery Drill shut down all the instances in one availability zone. Most probably Drone Directory was not well-balanced over two availability zones and was unavailable right now.


When Routing Service shut down, it deleted existing SQS reply queues. It should have created them from scratch at the launch. It should have pulled the list of active drones and formed reply SQS queues for the individual drones, with drone ID as queue identification. Without drone IDs, SQS reply queues weren’t created, and routes were not delivered.

Now puzzle was solved.

10:12

“Hello?”

“Mike, it’s me again.”

“Lara, I’m almost done.”

“Great, but I have something interesting for you. I know why obstacle avoidance failed in my crashed drone.”

“Why?”

“Obstacle avoidance jumper switch on the board was set to “debug.”

“Damn. It explains a lot.”

“Yes, sir. I’ll kick those hardware experts’ asses once I get back. This drone was on a repair a week ago, and all this time it functioned without the obstacle avoidance option.”

“Hmm... Formally they are not to blame.”

“What? Then who is? I lost a drone and 2 hours of my working shift because of that!”

“Listen, I’m doing my best to get things done. I’ll drop an email to all drone operators to check jumper switches on their birds so that we won’t lose more gear in the future.”

“Whatever, it doesn’t matter. It won’t help my bird.”

“Anyway, thank you for noticing that. I have to get back to work. Take care.”

10:17


CTO walked in and stopped near Mike’s desk. Mike gave him a long look.

“How bad it is?” asked CTO friendly.

Surprised by CTO’s relaxed mood, he replied:

”Well, we lost one drone. A missing elevation correction caused the crash. I rolled back Routing Service to the previous version. But the rest of the drones can’t be launched because of the Routing Service. It doesn’t deliver routes. Nobody in the field is flying right now.”

CTO seemed genuinely happy about the outcomes, ”Do you know why the Routing Service fails?”

”I have an error log and unproven hypothesis. I believe the key problem lies in the unavailability of the Drone Directory caused by the Disaster Recovery Drill.”

CTO was getting even happier. What’s wrong with this guy?

”Nice! I guess it’s enough learning for today. I can say that Disaster Recovery Drill was successfully completed. I am going to bring the second availability zone back online.”

Mike sigh in relief, CTO continued:

”Mike, thank you for your efforts this morning. You were awesome. Please make sure our birds can fly in the shortest time possible. I’ll send the dev team lead to assist you. When you’re done, please prepare a short presentation about what had happened today and share your experience with the dev team. It’s good for them to know.”

10:27

The message on the screen finally changed from “calculating mapping route” to “loading mapping route”, and finally,“ready to fly”.

Lara hesitated for a few minutes. But then she launched a drone and carefully observed it’s altitude. It worked well this time; the drone managed to fly a right distance from the ground. There were no more surprises during rest of the day. She even managed to complete her evening plans. She promised herself to fill in “incident report” the first thing next morning. Friday was a no-fly day, and all drone operators were working at the company’s office getting ready for the next flying week.

Friday

10:30

The large meeting room was full. Today was company’s weekly meeting, where management updated staff on the recent situation. Everyone was relaxed and easygoing. The rumors about the crashed drone and bugs in the routing algorithms spread like wildfire. The company leadership understood the importance to explain what is going on to the whole team.

First, Mike was asked to report at the company event the day before. He briefly spoke about the problems and how he managed to resolve them. After Mike mentioned that the artificial downing of availability zone caused a delay in recovery, people in the venue exchanged surprised looks. Who on earth would deliberately disrupt working system? After Mike finished his speech, CTO took a word.

“Mike, thanks for your briefing and your professional behaviour yesterday. Indeed, we executed Disaster Recovery Drill yesterday morning to test our resilience. Our service depends on the multiple third-party service providers, their availability, and reliability. There is no way we can ensure all our dependencies work as expected, predict what exact services or at what point of time they will be unavailable. It’s also impossible to anticipate all the potential variations of the inputs. In our case, those are the target area conditions, terrain, seeding requirements, etc.

Although we do rigorous testing, it’s quite challenging to prepare for all the possible combinations of the input parameters, third party service availability, and user-data.

The only way we can carry on is to flip our perception of failure. Embracing and learning from failure is the cornerstone principle. As we cannot predict what may happen with the real-world system, we should anticipate and embrace errors, learn from them, and not allow the same mistakes to occur twice. Yesterday we deliberately turned off a part of our infrastructure, trying to determine what impact it might have. We did multiple tests beforehand in the local environments, but we are not able to anticipate complications in the real situation.

As you already know from Mike’s explanation, we lost one drone yesterday. Similar to every aviation crash, it was caused by multiple, non-critical malfunctions. Together they created a fatal outcome.

First, our Router Service had a bug that excluded elevation from the flight route. It reproduced only for the convex, canyon-like areas. Moreover, this flaw was accompanied by the drone obstacle avoidance system defect.

Second, crashed drone had his jumper switch for obstacle avoidance set to “debug.”

Both of those flaws are already mitigated. Now we have a testing suite for the canyon areas. Also, we updated a routine for the hardware maintenance. From now on, the drone will not leave a maintenance zone with the wrong jumper switches.

Cost of the drone is fully compensated by the insights we got. Moreover, Disaster Recovery Drill revealed another potentially dangerous problem with the Drone Directory. It wasn’t load balanced across two availability zones, and as a result, it was not accessible to other services. We have already taken steps to correct that.

This learning would not be possible without open and honest communications inside a team. I would like to thank both Lara and Mike for the way they communicated yesterday. They were proactive in solving the situation. Let’s keep it up that way in the future.

We will continue doing the drills in the future. We want to make this a part of our company culture. If we want to remain successful, we have to be well-prepared for any Black Swan. Let’s get back to work now! We have a mission to reforest half of the continent!




To view or add a comment, sign in

More articles by Ostap Soroka

  • Rockets, Engines, and IoT: Lessons Learned from the USSR Moon Landing Program

    “…I believe that this nation should commit itself to achieving the goal, before this decade is out, of landing a man on…

    1 Comment
  • AWS architecture and a human body

    Living organisms are good at surviving. All living forms we can observe are a product of natural selection, that took…

    2 Comments
  • Pitfalls of outstaffing

    Out-staffing isn’t a magic wand that will solve all of your problems. Here are a few tips that will prevent you from…

    1 Comment

Insights from the community

Others also viewed

Explore topics