Sabtu, 09 November 2013

How to fix bugs, step by step

Step 1: Enter the bug in your case tracking system

 At the end of all these steps is a phase where you are tearing your hair out and still haven't gone home, yet. Then you will realize one of two things:
  1. you've forgotten some crucial detail about the bug, such as what it was, or
  2. you could assign this to someone who knows more than you.
 A case tracking system will prevent you from losing track of both your current task and any that have been put on the backburner. And if you're part of a team it'll also make it easy to delegate tasks to others and keep all discussion related to a bug in one place.

 You should record these three things in each bug report:
  1. What the user was doing
  2. What they were expecting
  3. What happened instead
These will tell you how to recreate the bug. If you can't re-create the bug on demand, then your chances of fixing it will be nil.

Step 2: Google the error message

 If there is an error message then you're in luck. It might be descriptive enough to tell you exactly what went wrong, or else give you a search query to find the solution on the web somewhere. No luck yet? Then continue to the next step.

Step 3: Identify the immediate line of code where the bug occurs

 If it's a crashing bug then try running the program in the IDE with the debugger active and see what line of code it stops on. This isn't necessarily the line that contains the bug (see the next step), but it will tell you more about the nature of it.

 If you can't attach a debugger to the running process, the next technique is to use "tracer bullets", which are just print() statements sprinkled around the code that tell you how far a program's execution has got up to. Print to the console (eg: Console.WriteLine("Reached stage 1"), or printf("Reached stage 1")) or log to a file, starting very granular (one print per method, or major operation), then refining it until you've found the one single operation that the crash or malfunction occurs on.

Step 4: Identify the line of code where the bug actually occurs

 Once you know the immediate line, you can step backwards to find where the actual bug occurs. Only sometimes will you discover that they're both one and the same line of code. Just as often, you'll discover that the crashing line is innocent and that it has been passed bad data from earlier in the stack.

 If you were following program execution in a debugger then look at the Stack Trace to find out what the history of the operation was. If it's deep within a function called by another function called by another function, then the stack trace will list each function going all the way back to the origin of program execution (your main()). If the malfunction happened somewhere within the vendor's framework or a third-party library, then for the moment assume the bug is somewhere in your program--for it is far more likely. Look down the stack for the most recent line of code that you wrote, and go there.

Step 5: Identify the species of bug

 A bug can manifest in many bright and colorful forms, but most are actually all members of a short list of species. Compare your problem to the usual suspects below.
  1. Off-By-One
    You began a for-loop at 1 instead of 0, or vice-versa. Or you thought .Count or .Length was the same as the index of the last element. Check the language documentation to see if arrays are 0-based or 1-based. This bug sometimes manifests as an "Index out of range" exception, too
  2. Race condition
    Your process or thread is expecting a result moments before it's actually ready. Look for the use of "Sleep" statements that pause a program or thread while it waits for something else to get done. Or perhaps it doesn't sleep because on your overpowered and underutilized development machine every query was satisfied in the milliseconds before your next statement executed. In the real world things get delayed and your code needs a way to wait properly for things it depends on to get done. Look into using mutexes, semaphores, or even a completely different way of handling threads and processes
  3. Configuration or constants are wrong
    Look at configuration files and any constants you have defined. I once spent a 16-hour day in hell trying to figure out why a web site's shopping cart froze at the "Submit Order" stage. It was traced back to a bad value in an /etc/hosts file that prevented the application from resolving the IP address of the mail server, and the app was churning through to a timeout on the code that was trying to email a receipt to the customer
  4. Unexpected null
    Betcha you got "Value is not initialized to an instance of an object" a few times, right? Make sure you're checking for null references, especially if you're chaining property references together to reach a deeply nested method. Also check for "DbNull" in frameworks that treat a database Null as a special type
  5. Bad input
    Are you validating input? Did you just try to perform arithmetic when the user gave you a character value?
  6. Assignments instead of comparisons
    Especially in C-family languages, make sure you didn't do = when you meant to do ==
  7. Wrong precision
    Using integers instead of decimals, using floats for money values, not having a big-enough integer (are you trying to store values bigger than 2,147,483,647 in a 32-bit integer?). Can also be subtle bugs that occur because your decimal values are getting rounded and a deviation is growing over time (talk to Edward Lorenz about that one)
  8. Buffer overflow & Index Out-of-range
    The number-one cause of security holes. Are you allocating memory and then trying to insert data larger than the space you've allocated? Likewise, are you trying to address an element that's past the end of an array?
  9. Programmer can't do math
    You're using a formula that's incorrect. Also check to make sure you didn't use div instead of mod, that you know how to convert a fraction to a decimal, etc.
  10. Concatenating numbers and strings
    You are expecting to concatenate two strings, but one of the values is a number and the interpreter tries to do arithmetic. Try explicitly casting every value to a string
  11. 33 chars in a varchar(32)
    On SQL INSERT operations, check the data you're inserting against the types of each column. Some databases throw exceptions (like they're supposed to), and some just truncate and pretend nothing is wrong (like MySQL). A bug that I fixed recently was the result of switching from INSERT statements prepared by concatenating strings to parameterized commands: the programmer forgot to remove the quoting on a string value and it put it two characters over the column size limit. It took ages to spot that bug because we had become blind to those two little quote marks
  12. Invalid state
    Examples: you tried to perform a query on a closed connection, or you tried to insert a row before its foreign-key dependencies had been inserted
  13. Coincidences in the development environment didn't carry over to production
    For example: in the contrived data of the development database there was a 1:1 correlation between address ID and order ID and you coded to that assumption, but now the program is in production there are a zillion orders shipping to the same address ID, giving you 1:many matches
 If your bug doesn't resemble any of the above, or you aren't able to isolate it to a line of code, you'll have more work to do. Continue to the next step.

Step 6: Use the process of elimination

 If you can't isolate the bug to any particular line of code, either begin to disable blocks of code (comment them out) until the crash stops happening, or use a unit-testing framework to isolate methods and feed them the same parameters they'd see when you recreate the bug.

 If the bug is manifesting in a system of components then begin disabling those components one-by-one, paring down the system to minimal functionality until it begins working again. Now start bringing the components back online, one by one, until the bug manifests itself again. You might now be able to go try going back to Step 3. Otherwise, it's on to the hard stuff.

Step 7: Log everything and analyze the logs

 Go through each module or component and add more logging statements. Begin slowly, one module at a time, and analyze the logs until the malfunction occurs again. If the logs don't tell you where or what, then proceed to add more logging statements to more modules. 

 Your goal is to somehow get back to Step 3 with a better idea of where the malfunction is occurring, and it is also the point where you should be considering third-party tools to help you log better.

Step 8: Eliminate the hardware or platform as a cause

 Replace RAM, replace hard drives, replace entire servers and workstations. Install the service pack, or uninstall the service pack. If the bug goes away then it was either the hardware, operating system or runtime. You might even try this step earlier in the process--per your judgement--as hardware failures frequently masquerade as software dysfunction.

 If your program does network I/O then check switches, replace cables, and try the software on a different network. 

 For shits and giggles, try plugging the hardware into a different power outlet, particularly one on a different breaker or UPS. Sound crazy? Maybe when you're desperate.

 Do you get the same bug no matter where you run it? Then it's in the software and the odds are that it's still in your code.

Step 9: Look at the correlations

  1. Does the bug always happen at the same time of day? Check scheduled tasks/cron-jobs that happen at that time
  2. Does it always coincide with something else, no matter how absurd a connection might seem between the two? Pay attention to everything, and I mean everything: does the bug occur when an air-conditioner flips on, for example? Then it might be a power surge doing something funny in the hardware
  3. Do the users or machines it affects all have something in common, even if it's a parameter that you otherwise wouldn't think affects the software, like where they're located? (This is how the legendary "500-mile email" bug was discovered)
  4. Does the bug occur when another process on the machine eats up a lot of memory or cycles? (I once found a problem with SQL-Server and an annoying "no trusted connection" exception this way)

Step 10: Bring-in outside help

 Your final step will be to reach out to people who know more than you. By now you should have a vague idea of where the bug is occurring--like in your DBM, or your hardware, or maybe even the compiler. Try posing a question on a relevant support forum before contacting the vendors of these components and paying for a service call. 

 Operating systems, compilers, frameworks and libraries all have bugs and your software could be innocent, but your chances of getting the vendor to pay attention to you are slim if you can't provide steps to reproduce the problem. A friendly vendor will try to work with you, but bigger or understaffed vendors will ignore your case if you don't make it easy for them. Unfortunately that will mean a lot of work to submit a quality report.

Read more : click here

Tidak ada komentar:

Posting Komentar