How to make hardware reliability and improvement

07/06/2021 Seektronics


In general, the system is always composed of multiple subsystems, which in turn are composed of smaller subsystems, until the subsystems are subdivided into a complex combination of resistors, capacitors, inductors, transistors, integrated circuits, mechanical parts, and other small components, any one of which will be the cause of the system failure. Therefore, the hardware reliability design should consider both the reliability design of a single control unit and the reliability design of the entire control system based on ensuring the reliability of the components.

 

1. Factors affecting hardware reliability

 

(1) Component failure. There are three kinds of component failure: one is the defect of the component itself, such as silicon crack, air leakage, etc.; the second is the processing process, environmental conditions change to accelerate the failure of components and assemblies; the third is the process problems, such as poor welding, screening, etc.

 

(2) Improper design. In the computer control system, many components occur in the failure is not the component itself, but the system design is unreasonable or improper use of components caused by.

 

In the design process, how to use various types of components or integrated circuits correctly is an important factor to improve the reliability of hardware.

 

(1) Electrical performance: The electrical performance of a component refers to the capacity of voltage, current, capacitance, power, etc. that a component can withstand. Attention should be paid to the electrical performance of the component during use, and it should not be used beyond the limit.

 

(2) Environmental conditions: the working environment of the computer control system is sometimes quite harsh. Due to the influence of environmental factors, many systems have good laboratory tests, but they often fail when installed on the site and run for a long time. The reasons are many, including temperature, interference, power supply, field air and other effects on the hardware. Therefore, when designing the system, the influence of environmental conditions on hardware parameters should be considered, and the components and equipment should be treated by aging test.

 

(3) Assembly process: in hardware design, the assembly process directly affects the reliability of the hardware system. It is difficult to locate and eliminate the faults caused by process reasons, and the false welding or non-connection of a solder joint may lead to the abnormal working phenomenon of the whole system from time to time in the working process. In addition, the design of printed circuit board should consider the layout of components, lead direction, lead sorting and so on.

 

 

2. General methods to improve hardware reliability

 

In the overall design of computer control system, how to improve the reliability of the system hardware is the key to the whole system design. The system hardware design often needs to adopt the necessary reliability measures:

 

(1) Circuit design. According to statistics, about 45% of the factors affecting the reliability of a computer control system come from the system design. To ensure the reliability of the system, the most extreme cases should be considered when designing its circuitry.

 

The characteristics of various electronic components can not be a constant value, always within a certain range of their rated (typical) parameters; At the same time, the power supply, voltage also have a fluctuation range. The worst design approach is to consider the tolerances of all components and take their most adverse numerical accounts for each specified characteristic of the circuit. If this set of parameters ensures that the circuit works properly, then all the other component values within the tolerance range allow the circuit to work reliably.

 

When designing the circuit of the application system, corresponding measures should be taken according to the failure characteristics of the components and the place where they are used. The parts that are easy to produce short circuits should be duplicated in series, and the parts that are easy to produce open circuits should be duplicated in parallel.

 

(2) Component selection. After determining the component parameters, the component type is also determined, which depends mainly on the tolerance range allowed by the circuit. Due to the limitations of the manufacturing process, the tolerance range of some component parameters may be large, such as capacitor capacity. In addition, the rated operating conditions of the component or device include several aspects (e.g. current, voltage, frequency, mechanical parameters and ambient temperature, etc.), and the design should take into account the parameter margins and ensure that the design operating temperature of the component is as close as possible during operation.

 

(3) Structural design. Structural reliability design is the final stage of hardware reliability design. When designing the structure, attention should first be paid to the way components and parts are installed, followed by the conditions of the control system's working environment (e.g. ventilation, dehumidification, dustproof, etc.).

 

(4) Noise suppression. The impact of noise on analogue circuits can directly affect system accuracy, and noise on digital circuits can also cause malfunctions. Therefore, noise suppression and shielding measures must be used in the engineering design. For analogue applications, some low-pass filtering circuits can be added at the power supply side to suppress the interference introduced by the power supply; for digital systems, filters and grounding systems are usually used; at the same time, attention should be paid to the location of components and the direction of signal lines when the overall structure is laid out. For electromagnetic interference and electric field interference, electromagnetic shielding and electrostatic shielding can be used to isolate the noise, and grounding and decoupling capacitors can also be used to reduce the impact of noise.

 

(5) Redundant design. Hardware redundancy design can be carried out at component level, subsystem level or system level, which inevitably increases hardware and cost. Therefore, the design should carefully weigh the pros and cons of using hardware redundancy. In the computer control system, control unit redundancy and control system redundancy is mainly used to improve the system hardware reliability.

 

 

3. Unit reliability design

 

The control and interface unit is a functional module that can perform certain measurement and control functions independently. Its reliability design mainly includes the redundancy design of the microprocessor system, the suppression of input and output channel interference, the suppression of power system interference, and the monitoring of the operation status of the control unit.

 

(1) I/O channel interference suppression

 

The frequency of the analog input channel normal interference is usually higher than the frequency of the measured signal, so the filtering network can be considered to filter the analog input signal. Various forms of metal shielding can be used to do a good job of signal transmission line shielding, the signal line and the external electromagnetic field effectively isolated; in the system both analog circuit and digital circuit, digital ground and analog ground to be separated, and finally only in a point connected to prevent mutual interference. i / o channel should generally use photoelectric coupler for electrical isolation, not only to avoid the composition of the ground loop, but also to effectively suppress noise In addition, the input and output channels should be electrically isolated using optocouplers. In addition, a certain amount of overvoltage protection circuitry should be used on the input and output channels.

 

(2) Suppression of power system interference

 

When there are many high-power devices in the same power supply network, a three-phase isolation transformer can be added between the control unit and the power supply to prevent the interference of the power grid from invading the control system. At the entrance of the power line of the whole machine, power filters can be added to prevent mutual interference between other electronic equipment and the system. Small power filters should be installed on the separate printed boards in the machine to prevent mutual interference between the boards.

 

As switching power supplies have a strong resistance to frequent voltage fluctuations and frequency fluctuations, as well as isolating conducted interference entering from the power supply line, switching power supplies can be used on appropriate occasions. If necessary, the system input and output channels and other equipment can be considered to use independent power supply, the implementation of power supply grouping power supply. In addition, the logic board on the DC power line and grounding line to pay attention to reasonable wiring.

 

(3) Monitoring the running state of the control unit

 

A watchdog timer (WDT) can be used to monitor the running status of the control unit. The output of WDT is directly connected to the interrupt requester end of CPU or the reset end of control unit, and every "timed arrival" overflow pulse signal of WDT can cause the interrupt or reset of CPU. The WDT is CPU-controlled and can be reset with a time constant or refreshed.

 

The timer restarts the timer without any timing interruption or system reset as long as the program is running normally. Once the program execution error or the occurrence of program flying, crash phenomenon, the watchdog timer will produce overflow pulse signal, cause regular interruption or reset, so that the control unit to restart or enter the interrupt service program for error correction.

 

(4) Power-down protection for control units

 

The effective way to deal with instantaneous power failure or sudden voltage drop in the power grid is power-down protection, which can be added to the computer measurement and control system as an uninterruptible power supply (UPS), and to the control unit in the measurement and control system as a power-down protection circuit with careful design. The power-down signal is detected by the hardware circuit and added to the external interrupt input of the control unit CPU. The software interrupt specifies the power-down interrupt as an advanced interrupt so that the control unit CPU can react to the power-down in time. In the power-down interrupt subroutine, the site is first protected and the important status parameters at the time are saved. When the power supply returns to normal, the CPU resets itself, resumes the scene and continues the unfinished work.

 

(5) Control unit redundancy design

 

Common control unit redundancy design includes hot backup parallel redundancy and cold backup parallel redundancy, both of which increase the hardware investment in exchange for the reliability of the system hardware.

 

1) Parallel redundancy refers to the parallel operation of several control units with the same function and the synchronous execution of the same processing program. When at least one control unit in the parallel system works normally, the whole system will maintain normal operation.

 

To improve the reliability and economy of the control unit, dual-machine thermal backup parallel mode is often used. For the controlled system, the dual-machine thermal backup parallel mode is only one of the control units to complete the measurement and control task, and the other control unit is in the standby state of parallel work. However, the two control units execute the same procedure synchronously. Once the self-checking system finds that the main control unit has faults, the standby control unit in the standby state will switch up automatically to replace the main control unit so that the system can continue to run normally. In the design of dual-machine thermal backup system, the following two main problems should be solved:

 

A. Double machine synchronization. Dual-machine synchronization typically takes events as the synchronization token, where events can be defined by the designer. For example, the working process of the system is: the input interface collects the data sent by the sensor, the data collected and the set value are compared and processed in the CPU, and finally the control quantity output is obtained. Then, events can be divided into two events: data acquisition and data processing.

 

When the application system is started, the two machines execute the first event, i.e. the acquisition of status data, at the same time. When the first event is completed, the two results are then compared, and if they are the same, the second event continues; if there is an error, the main control unit is automatically switched and the main control unit is replaced by a backup control unit. As long as the main control unit is working properly, the standby unit is always on standby.

 

When the event is processed with data, if it exceeds the accuracy range, it is considered that one of the data may be wrong. At this time, both machines can be rerouted to the first address of the event and execute again. If there are still errors, go back to troubleshooting. This software rollback method can eliminate the influence of some accidental factors.

 

B. Fault detection. Two machines can be used to self-check their procedures, respectively, to find out the fault of the control unit. If the fault machine is the master control unit, it can be automatically switched to make the program continue to execute the next event. To switch in time, we can set some more events according to the characteristics of the task, so that the number of synchronization checks between the two machines is increased.

 

By switching it is meant that the dual machine states are exchanged with each other via the input and output interfaces so that in the event of an error in one control unit, the other control unit can know about it in time. When the backup control unit finds that the main control unit has a fault, it can send a control signal, so that the main control unit automatically withdraws from control and the backup control unit replaces the main control unit so that the system continues to operate normally.

 

2) In the design of cold backup parallel redundancy, the backup control unit is not normally powered up and is only used in place of the main control unit when it is found to be faulty. The cold backup control unit is identical to the main control unit in terms of hardware structure and software implementation, and all kinds of inline equipment are placed in place, in a cold backup state that can be put into normal operation when the power is turned on.

 

The switch between cold and heat in the cold backup parallel system can be manually operated or automatically switched. In the design of automatic switching, the main control unit must set up the various (or key several) alarm signals. If the over-limit phenomenon is found, the switching signal is output in time to trigger the power contact of the cold backup system, so that the backup unit can be put into normal operation.