INTRODUCTION – Crashing Programs
In this module, you will come across the ancient “Why has my program crashed?” You will learn some analysis on system and application crashes and maybe try pointing to the tools that may shed light on the unknown causes or the log files where you may find clues for the situations of interest. This one will lead you to conduct your research into why code crashes; this will give you an insight to prevent the aforementioned failure mode. You will understand the state of unhandled errors and exceptions, and shortly it will capture what you should be doing in reaction to these failures-the technical mechanism of unhandled erred events and exception.
In addition, the chapter describes incident handling at a large scale. An example is explained where an error is generated every 1 out of 5 times from an e-commerce portal. Going forward, the chapter also delineates some important postmortems events, discussing what is meant by postmortems and how they can be transformed to find these very causes of incidents in the future.
Learning Objectives:
- Distinguish between application and system crashes.
- Acquire debugging skills and reading logs to establish reasons for crashes.
- Understand the types of code crashes and reasons as the outcome of wrong kinds of memory errors.
- To resolve for unhandled errors and exceptions printf-style debugging technique be used due to unhandled errors and exceptions.
- Knowing the importance of communications and documentation during times of outage and error.
- An understanding of what constitutes a postmortem and what should be covered within a postmortem.
PRACTICE QUIZ: WHY PROGRAMS CRASH
1. When using Event Viewer on a Windows system, what is the best way to quickly access specific types of logs?
- Export logs
- Create a custom view (CORRECT)
- Click on System Reports
- Run the head command
Nailed it! Create custom views action is used for filtering logs based on specific criteria such that the user views the log entries that apply only with the defined conditions.
2. An employee runs an application on a shared office computer, and it crashes. This does not happen to other users on the same computer. After reviewing the application logs, you find that the employee didn’t have access to the application. What log error helped you reach this conclusion?
- “No such file or directory”
- “Connection refused”
- “Permission denied” (CORRECT)
- “Application terminated”
Keep it up! The error message of a “Permission denied” issue has indicated that the user does not possess permission to reach or run it.
3. What tool can we use to check the health of our RAM?
- Event Viewer
- S.M.A.R.T. tools
- memtest86 (CORRECT)
- Process Monitor
Awesome! The development of these memory testing tools stems from the need to have effective input tests for x86 ARC systems, particularly when dealing with the memory. What is involved is pattern writing to various memory locations, and then reading them again for confirmation as a method of checking errors which could prove to show up in the memory of the system.
4. You’ve just finished helping a user work around an issue in an application. What important but easy-to-forget step should we remember to do next?
- Fix the code
- Report the bug to the developers (CORRECT)
- Reinstall the program
- Change the user’s password
Right on! When somebody notices an error that can be repeated in a program, the best thing to do is to provide the problem to the developer in a very detailed way with all those particulars, like how to reproduce the error, how it should behave, what it actually did and any error messages or logs that may be pertinent. This will allow faster identification and resolution of the issue by developers.
5. A user is experiencing strange behavior from their computer. It is running slow and lagging, and having momentary freeze-ups that it does not usually have. The problem seems to be system-wide and not restricted to a particular application. What is the first thing to ask the user as to whether they have tried it?
- Adding more RAM
- Reinstalling Windows
- Identified the bottleneck with a resource monitor (CORRECT)
- Upgrade their HDD to an SSD
Woohoo! The first order of business in troubleshooting any problem is to figure out exactly what the root cause of the problem is. The way to detect the bottleneck-in this case, if there’s an increased CPU utilization or huge memory usage-is through resource monitors such as Activity Monitor (MacOS), top (Linux and MacOS), and Resource Monitor (Windows). The identification of a bottleneck (and whether it is related to CPU consumption or memory consumption) gives us an idea that further analysis and resolution of the problem should be done.
6. A user reported an application crashes on their computer. You log in and try to run the program and it crashes again. Which of the following steps would you perform next to reduce the scope of the problem?
- Check the health of the RAM
- Switch the hard drive into another computer
- Check the health of the hard drive
- Review application logs (CORRECT)
Awesome! Analyzing logs to determine whether or not they reveal something that could explain the crash is typically the next step in troubleshooting. They often contain error messages, stack traces, or unusual activities that can prove to be very good indicators pointing to the cause and, to an extent, subsequent troubleshooting directions.
7. Where should you look for application logs on a Windows system?
- The /var/log directory
- The .xsession-errors file
- The Console app
- The Event Viewer app (CORRECT)
Great job! The Event Viewer app contains logs on a Windows system.
8. An application fails in random intervals after it was installed on a different operating system version. What can you do to work around the issue?
- Use a wrapper
- Use a container (CORRECT)
- Use a watchdog
- Use an XML format
Nice work! A single delivery unit will protect individual software applications so that every time a program is installed in a container, none of the existing software currently running is affected. This dependency isolation is indispensable peculiarly for software allowing that added control of a variety of resources without causing interference in negligible terms of managed physical resources and in the near future, management of virtual memory publishing features.
9. Where is a common location to view configuration files for a web application running on a Linux server?
- /etc/<app folder> (CORRECT)
- /var/log/<app folder>
- /srv/<app folder>
- /<app folder>
Right on! The /etc directory will contain the application folder that stores configuration files.
PRACTICE QUIZ: CODE THAT CRASHES
1. Which of the following will let code run until a certain line of code is executed?
- Breakpoints (CORRECT)
- Watchpoints
- Backtrace
- Pointers
Way to go! Breakpoints let code run until a certain line of code is executed.
2. Which of the following is NOT likely to cause a segmentation fault?
- Segmentation fault
- backtrace
- The No such file or directory error
- Off-by-one error (CORRECT)
Nice work! The so-called off-by-one (OB1) programming bug results from an unfortunate mistake: the iterative structure, whether looped or not, iterates one more or less than what’s normally required. Essentially, erroneous input causes improper setting of boundary conditions of the loop, which results to unintended behavior, say accessing a nonexistent array element, return without really executing an operation, and visiting only the place of interest in the beginning.
3. A common error worth keeping in mind happens often when iterating through arrays or other collections, and is often fixed by changing the less than or equal sign in our for loop to be a strictly less than sign. What is this common error known as?
- Normalize data (CORRECT)
- Collect and aggregate data
- Analyze data
- Centralize data
The very first process in a SIEM process would be data collection and aggregation. Here, the SIEM collects data from several sources and aggregates them. Normalization is done to turn the raw data into log records that are consistent with one another. It involves cleaning the data and removing unnecessary attributes.
4. A very common method of debugging is to add print statements to our code that display information, such as contents of variables, custom error statements, or return values of functions. What is this type of debugging called?
- Backtracking
- Log review
- Printf debugging (CORRECT)
- Assertion debugging
Excellent! Pragmatism from printf debugging is using printf() function in C++ for making debug informations and the name has remained. This into-called by appending line of print statements into a code to keep track of variables, check out the control flows, or program behavior in various points. It is a most important debugging techniques in many programming languages.
5. When a process crashes, the operating system may generate a file containing information about the state of the process in memory to help the developer debug the program later. What are these files called?
- Log files
- Core files (CORRECT)
- Metadata file
- Cache file
Right on! Essentially, core files (or core dump files) catch an image of the memory and status of a running process during the time of crash. These are the essential pointers that can be used to track and analyze whatever information is necessary to identify what happened to crash cases, such as the state of variables, stack traces, and other details of their runtime applications.
6. You are a software developer who has been asked to write a program for a banking company. Your manager suggests you use assert statements in your code. What is the purpose of using assert statements in code?
- To determine the code’s runtime
- To translate the code to a different language
- To catch issues and debug your code during development (CORRECT)
- To rewrite code that produces errors
That’s right! Developers find early bugs easily while coding using the assert statement; this statement checks whether a given condition is true and causes the program to terminate and report an error when it is false. This helps in finding the source of an error because it can catch logical inconsistencies.
7. How does the print statement help programmers debug codes?
- *A: It produces the output of the code. (CORRECT)
- B: It fixes the errors of the code.
- C: It prints the details of the code.
- It recommends the correct code.
Correct. The print statement sends messages or prints out the values to the output screen. If the code has errors, the command will produce the error statement as output.
8. Visual Studio Code or VS Code, popular among programmers, utilizes breakpoints. What is a breakpoint?
- A: An open-source product from Microsoft
- *B: A debugging technique (CORRECT)
- C: An integrated developer environment (IDE)
- D: A location where the code error occurs
Correct. Boucle conditionnelle:Dans un programme informatique, on peut écrire soi-même des expressions conditionnelles éventuellement fausses. Une boucle est un ensemble d’une instruction ou d’un groupe d’instructions répétées exécutées tant qu’un test de condition reste valable. On les utilise pour définir des répétitions dans le monde de la programmation qui cherchent des blocs de code à exécuter s’ils restent vrais.
9. Imagine that you’re working on a new feature for a web application. As you’re writing the code, you realize that certain sections might produce runtime errors. Which Python mechanism allows you to handle runtime errors without crashing the program?
- Print debugging
- Assert statements
- Try and except blocks (CORRECT)
- If-else conditions
That’s right! Try and exception blocks in Python is basically made for capturing errors and so that we can deal with them (run-time errors). The codes possible to catch an error are placed inside a try block. When some error occurs, it terminates trying block execution and proceeds with the except block code, thereby allowing the application to manage the failure event. This ultimately means that more solid, fault-tolerant programs would be built.
10. You’re a web developer for an e-commerce website, and you’re noticing an increase in unexpected behaviors and errors as the site’s user base grows. Why might you choose to implement the Python logging module in your e-commerce website over traditional print() statements?
- The logging module can only display messages on the console, similar to the print() function.
- The logging module allows categorization of log messages based on their severity, such as DEBUG, INFO, WARNING, ERROR, and CRITICAL. (CORRECT)
- The logging module can only capture error messages.
- The logging module requires a third-party library to be installed.
That’s right! In task logging module, “log severity levels” is one of the key features, being able to categorize log messages using different levels of severity. Such levels are DEBUG, INFO, WARNING, ERROR, and CRITICAL, which make the filtration of issues our priority. Different severity levels are used to control the amount and relevance of logging; this makes it easier to diagnose and fix problems efficiently.
11. A team of software developers is excited to use an AI tool to help with writing and debugging some pieces of their code. Which of the following is true about using AI tools with code?
- AI tools can provide answers to your questions in seconds. (CORRECT)
- The answers provided by AI tools are always correct.
- The AI tools have been used by developers for decades.
- The AI tools have been through many development iterations and are considered as perfect tools.
That’s right! AI tools can provide you with feedback to your question in a matter of seconds.
12. Which of the following can assist in finding out if invalid operations are occurring in a program running on a Windows system?
- Valgrind
- Dr. Memory (CORRECT)
- PBD files
- Segfaults
You got it! Dr. Memory is a debugging tool especially made for debugging memory issues such as invalid use of memory, unmanaged behavior of memory, and other errors that might arise in running code in Linux and Windows. It runs by looking at the memory operations done by the program, which can be indicative of errors related to memory management that may result in crashes or strange behaviors.
13. After getting acquainted with the program’s code, where might you start to fix a problem?
- Run through tests
- Read the comments
- Locate the affected function (CORRECT)
- Create new tests
Nicely done! Start working on the function that produced the error, and the function(s) that called it.
14. When debugging code, what command can you use to figure out how your program reached the failed state?
- gdb
- backtrace (CORRECT)
- ulimit
- list
Nice job! Indicate the possible reasons behind why the bactrace command fails you with links that would lead the user to more details about why failures occur-however, only the experts follow the lead: it confirms how interpreted absences of the information, where the interest to learn can finally answer the peers’ questions at random moments in the future.”.
15. When debugging in Python, what command can you use to run the program until it crashes with an error?
- pdb3
- next
- continue (CORRECT)
- KeyError
Awesome! Running the continue command after starting the pdb3 debugger will execute the program until it finishes or crashes.
PRACTICE QUIZ: HANDLING BIGGER INCIDENTS
1. Which of the following would be effective in resolving a large issue if it happens again in the future?
- Incident controller
- Postmortem (CORRECT)
- Rollbacks
- Load balancers
Keep it up! A postmortem is an elaborate document where the roots of a problem are established and documented, coupled with the steps that were taken to resolve that issue-something particularly useful for big, matters involving complexity.
2. During peak hours, users have reported issues connecting to a website. The website is hosted by two load balancing servers in the cloud and are connected to an external SQL database. Logs on both servers show an increase in CPU and RAM usage. What may be the most effective way to resolve this issue with a complex set of servers?
- Use threading in the program
- Cache data in memory
- Automate deployment of additional servers (CORRECT)
- Optimize the database
You got it! To handle an increased demand request during peak hours, additional servers are automatically added to scale and help to correct the finding of fault in a complex server.
3. It has become increasingly common to use cloud services and virtualization. Which kind of fix, in particular, does virtual cloud deployment speed up and simplify?
- Deployment of new servers (CORRECT)
- Application code fixes
- Log reviewing
- Postmortems
Right on! Virtualization makes deployment of VM servers in the cloud a fast and relatively simple process.
4. What should we include in our postmortem? (Check all that apply)
- Root cause of the issue (CORRECT)
- How we diagnosed the problem (CORRECT)
- How we fixed the problem (CORRECT)
- Who caused the problem
Let’s try and spot the flashpoints causing it to occur.
By narrating the identification of the problem, it will be easy to detect it in the future.
It is essential to let your reviewers know how we solved it this time around.
5. In general, what is the goal of a postmortem? (Check all that apply)
- To identify who is at fault
- To allow prevention in the future (CORRECT)
- To allow speedy remediation of similar issues in the future (CORRECT)
- To analyze all system bugs
So I went straight to the root cause of the issue and maybe then cause there is no reference to this happening again.
Thus, I think just to document the solution much better—it actually helps other people and this is possible even for ourselves later.
6. A website is producing service errors when loading certain pages. Looking at the logs, one of three web servers isn’t responding correctly to requests. What can you do to restore services, while troubleshooting further?
- Deploy a new web server
- Roll back application changes
- Remove the server from the pool (CORRECT)
- Create standby servers
Great job! Removing the server from the pool will provide full service to users from the remaining web servers
7. Which of the following persons is responsible for communicating with customers that are affected by an access issue with a website?
- Communications lead (CORRECT)
- Manager
- Incident controller
- Software engineer
Nice work! The communications lead provides timely updates on the incident and answers questions from users.
8. When writing an effective postmortem of an incident, what should you NOT include?
- What caused the issue
- Who caused the issue (CORRECT)
- What the impact was
- The short-term remediation
Nailed it! I couldn’t disagree more! A post mortem should focus on the circumstances and the processes that were mucked up, not on blaming anyone. It should look at what went wrong and how it can be prevented going forward. It is traingulation, not accusation. Would you like a hand in structuring a post mortem around the above intentions?
FIXING ERRORS IN PYTHON SCRIPTS
1. How can you use pip3 to address the ImportError issue in your Python script, particularly when a module like matplotlib is missing?
- Reinstall the Python script.
- Update the Python interpreter.
- Install the missing module using pip3. (CORRECT)
- Execute the script with Python 2.
Correct
2. What type of error occurred when attempting to run the Python script located in the /usr/bin directory, as indicated by the provided output?
- ImportError (CORRECT)
- SyntaxError
- IndexError
- ValueError
Correct
3. How is the matplotlib Python library beneficial to programmers?
- It offers a wide range of data visualization tools and features. (CORRECT)
- It simplifies the process of running Python scripts concurrently.
- It enables the creation of web applications with Python.
- It provides a Python code editor for writing and debugging scripts.
Correct
4. How did you resolve the MissingColumnError?
- You added the missing column name to the data.csv file. (CORRECT)
- You used the “ls” command to check for errors.
- You rewrote the Python script from scratch.
- You used the “chmod 777” command.
Correct
5. You are working to debug a recurring problem in a Python program. Which of the following approaches do you think would be the most effective way to solve it?
- Increase the system’s memory.
- Restart the system.
- Upgrade the system’s software.
- Identify the sequence of events leading to the problem. (CORRECT)
Correct
6. In anticipation of encountering future errors or unexpected behavior in your Python scripts, which proactive debugging techniques and best practices would you incorporate into your development process? Select all that apply.
- Establishing a systematic approach to isolate and fix errors as they arise.
- Regularly reviewing error messages and stack traces from previous runs. (CORRECT)
- Reinstalling Python libraries to prevent potential errors.
- Using version control systems to track code changes and facilitate error identification. (CORRECT)
- Implementing comprehensive unit testing to catch errors early. (CORRECT)
Correct
7. What indication did you get in the lab when you successfully completed debugging the infrastructure script?
- The infrastructure script prompts you to press a key to continue.
- The infrastructure script runs without displaying any errors. (CORRECT)
- The infrastructure script writes a message to its log file.
- The infrastructure script displays a message stating the program ran successfully.
Correct
8. What is the third step in the process of debugging, following the identification of a bug’s cause?
- Writing new code to fix the bug
- Reporting the bug to a supervisor
- Reproducing the bug (CORRECT)
- Deleting the entire codebase
Correct
9. After successfully fixing the code and resolving the errors in the provided content, what is the recommended next step to ensure the continued functionality and reliability of the script?
- Delete the script to start fresh with a clean slate.
- Test the script thoroughly to confirm that the errors are resolved. (CORRECT)
- Immediately share the fixed code with colleagues.
- Reinstall the Python interpreter for optimal performance.
Correct
10. What is the cause of the MissingColumnError in the lab?
- The infrastructure script references a column that doesn’t exist.
- The column name for the column with the company information is missing in the CSV file. (CORRECT)
- The company information is missing in the CSV file.
- The infrastructure script does not have the necessary permissions for accessing the data.csv file.
Correct
11. What is pip3?
- A numeral mathematics extension of Matplotlib
- A command to search for missing information
- A Python package installer (CORRECT)
- A plotting library for the Python programming language
Correct
12. In the given scenario where a Python script located in the /usr/bin directory produces an ImportError due to a missing module (i.e., matplotlib), which of the following actions should you take to address the issue?
- Modify the script’s code to bypass the missing module.
- Reinstall the Python interpreter.
- Install the missing module using pip3. (CORRECT)
- Delete the script and recreate it from scratch.
Correct
13. What is the purpose of the following command: pip3 install matplotlib in the context of resolving the issues with the Python script and matplotlib?
- It successfully installs the matplotlib library, enabling visualization of data. (CORRECT)
- It removes the matplotlib library from the system.
- It updates the Python interpreter to the latest version.
- It installs a Python code editor for script development.
Correct
14. Why is effective debugging an essential skill for Python developers, and how does it contribute to the overall success of a project?
- Debugging allows developers to showcase their coding skills.
- Debugging helps identify and rectify errors, leading to improved script functionality and reliability. (CORRECT)
- Debugging helps developers gain a deeper understanding of Python syntax.
- Effective debugging ensures that scripts run without any issues.
Correct
15. In the lab, which step(s) must you take to fix an ImportError?
- Put the missing package in the correct folder.
- Change the permissions for the package.
- Install pip3. (CORRECT)
- Install the missing package. (CORRECT)
Correct
16. In the lab, what caused the NoFileError message? Select all that apply.
- Renaming the data.bak file to data.csv. (CORRECT)
- Changing the permissions on the data.csv file.
- Moving the data.csv file to the working folder.
- Checking the working folder for the data.csv file. (CORRECT)
Correct
17. How did the chmod 777 command contribute to resolving the issues with the Python script?
- The command changed file permissions to make the data.csv file writable. (CORRECT)
- The command uninstalled the Matplotlib library.
- The command modified the script’s code to fix the issue.
- The command was used to install a missing Python library.
Correct
18. In the lab, you ran the infrastructure script and received a NoFileError message about the file named data.csv. What caused this error?
- The infrastructure program encountered a permission problem when opening a file.
- The infrastructure program must be in the same folder as the file it called.
- The infrastructure program has a typo in the name of the file it calls.
- The infrastructure program called a file that can’t be found. (CORRECT)
Correct
19. Which sequence of actions effectively addressed the issue, starting from identifying the file extension problem to ultimately resolving the MissingColumnError in the Python script?
- Adding the missing column name > Renaming data.bak to data.csv > Checking the data.csv file
- Granting permissions to data.csv > Checking the data.csv file > Adding the missing column name
- Renaming data.bak to data.csv > Adding the missing column name > Granting permissions to data.csv (CORRECT)
- Checking the data.csv file > Renaming data.bak to data.csv > Granting permissions to “data.csv”
Correct
20. What is the function of pip3 in Python?
- Creates graphical user interfaces
- Runs Python scripts
- Acts as a plots library for Python
- Downloads and configures new python modules (CORRECT)
Correct
21. If you did not write the Python program and don’t have access to the source code, what should you examine to determine where the program is running and any errors that are occurring?
- The results of the grantaccess command
- The results of the pip3 command
- You should examine the program’s surrounding environment. (CORRECT)
- Figure out where the program is executing and identify any errors.
Correct
22. The lab presents a scenario where you’re tasked with troubleshooting a Python script named infrastructure that is generating errors. You didn’t create the script and don’t have access to its source code. What steps in the lab enable you to troubleshoot this program? Select all that apply.
- Run the infrastructure script to determine whether it generates any errors. (CORRECT)
- Obtain the source code and debug it.
- Research any displayed error messages to identify their cause. (CORRECT)
- Take steps to resolve any errors displayed. (CORRECT)
Correct
23. In the context of code debugging and error resolution, what is the significance of conducting comprehensive testing after fixing the code, and how does it contribute to the overall quality of the script and project success?
- Testing allows developers to showcase their coding skills.
- Testing ensures that the code adheres to the latest programming standards.
- Testing primarily focuses on optimizing the code for speed.
- Comprehensive testing verifies that the code functions as expected after fixes, enhancing script reliability and project success. (CORRECT)
Correct
24. What is one of the primary purposes of matplotlib?
- It focuses on numerical mathematics and extends the Python language.
- It is primarily used for visualizing 2D plots of arrays and data. (CORRECT)
- It provides an object-oriented API for creating graphical user interfaces.
- It serves as a Python code editor for writing and debugging scripts.
Correct
25. Having completed the lab and worked through the process, which of the following would you want to check when debugging or troubleshooting later iterations of the same software? Select all that apply.
- Current bug reports (CORRECT)
- Redesigns of the user interface
- Future software upgrades
- More users (CORRECT)
Correct
26. Why is it essential for developers to isolate specific issues or errors in software when troubleshooting? Select all that apply.
- Isolating issues simplifies the code and eliminates unnecessary complexity. (CORRECT)
- Isolating issues reduces the need for comprehensive testing.
- Effective isolation enables targeted problem-solving and prevents broader system disruptions. (CORRECT)
- Isolating issues allows developers to work on multiple problems simultaneously.
Correct
CONCLUSION – Crashing Programs
Indeed, this module has given and answered questions on crashes prevailing inside an operating system and inside an application and has led exercises for troubleshooting an operating system crash or an application crash with all tools at your service. You can bat into it regarding your tools and log files easily and become the most powerful cause for crashing. Code crashes, on which you have embarked will be conclusive tedious approaches to eliminating them. Getting unhandled errors and exceptional exceptions about the light on debugging capabilities has more than heightened capacity to detect and solve errors.
Apart from this, the module has increased your knowledge into managing accidents and cases even more on a bigger scale-example of this by a mock situation of a high error rate engendered by an eCommerce site. Here, however, you get familiar on how the communication, documentation, and post-mortems can be obtained when you apply incident management strategies, and the different ways by which you could avoid these failures in the future are what contribute to better software reliability and also a culture of continuous improvement.