CQBIC "Machine Check Bug

New Update (02/08/2006): this bug is solved. I met this bug again and was very frustrated. I tried hard with my logic analyzer. I shot pictures of the chronogram and search for clues. After days of hard work, I concluded that my hardware and software was not causing the problem. Something happened to the VAX3900. I took out the boards, and did two things,

  • I exchanged the order of the two memory boards
  • After finding the leaking Ni-Cd battery caused decay of the front panel and the small PCB attached on the panel, I took out the battery, disambled the board and panel, washed them with brush, soap and baking soda, and put them back.
Without doing anything else, the machine check bug was gone.

old update: Maybe it is my fault that I did not monitor the DMR line when performing DMA. I added the circuit and I haven't met the problem ever since. System settings

  • KA655, 2x MS650-BA, MSCP SCSI board with CDROM and HD
  • NetBSD 1.5.2
Symptoms (Update 2006/02/08, the following diagnoses must be wrong)
  • If the MSCP SCSI board performs two burst mode DMA transactions in one DMA grant, the machine has machine check 10 errors (10us time out when CPU accesses memory that is mapped to the QBUS) from time to time.
  • If the board runs one transaction a grant (single shot DMA), the probability that the problem occurs is much rarer, but it still happens.
  • Some of the DIN cycles are long. It is the CQBIC controller that prolongs the DMA (slow RPLY as the response to DIN). I recorded 6us single short DMA and 9us double shot burst mode DMA. See the "long DMA" diagram below.
Solutions (Update 2006/02/08, the following diagnoses must be wrong)
  • The OS should re-try the memory after machine check and keep going.
Detailed description and first hand materials (Update 02/08/2006, I suspected the strange DMR (no DMG, SACK, only a lonely DMR followed to my DMR) later. I have photos)
  • The NetBSD 1.5.2 installer reported "machine check 10" and "CDAL parity error", while the DMASER (DMA system error register) reported dmaser=0x88 which means 10us time out occured when the CPU was accessing QBUS memory (page 3-97, KA655 CPU technical manual. google "ka655tm1" for an online version). A trap happened after the installer tried to recover. Note that in the first two screen images, the machine check happened when the CPU was executing the same instruction (same PC). The installer was busy unzipping the NetBSD .tgz files.
    • screen image One incident. dmaser bit 7: time out when CPU was access QBUS memory; bit 3: another machine check happed before this one so Qbus memory and DMA address was not captured.
    • screen image Another incident
    • screen image Yet another incident (when compiling kernel). This one have a different pc. It may be the same pc mapped to different virtual address?
    • A long DMA diagram. It is negtive logic. It is burst mode DMA with two transactions. it is about 9us (see the diagram). The Q22 DMA protocal allows up to 4 transactions in one bus grant. The reason why this DMA is long (slow) is that The KA655 wasted a lot of time to respond to DIN with RPLY. Maybe the CPU was competing for Qbus memory access and CQBIC was busy handing that.
  • Some google results. Nobody provided explanation. I googled that some NetBSD versions had this problem with some CMD MSCP SCSI controllers but I failed to find the post the second time.