Подскажите последовательность действий при замене диска в raid1

На удалённом сервер отвалился один диск из программного рейда

cat /proc/mdstat
Personalities: [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2: active raid1 sdb5[2]
11716536 blocks super 1.2 [2/1] [_U]

md1: active raid1 sdb6[2]
466899832 blocks super 1.2 [2/1] [_U]

md0: active raid1 sdb1[2]
9763840 blocks super 1.2 [2/1] [_U]

unused devices: none


fdisk -l
Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000142cd

Device Boot Start End Blocks Id System
/dev/sdb1 * 1 1216 9764864 fd Linux raid autodetect
/dev/sdb2 1216 60802 478618625 5 Extended
/dev/sdb5 59343 60802 11717632 fd Linux raid autodetect
/dev/sdb6 1216 59343 466900992 fd Linux raid autodetect

Partition table entries are not in disk order

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000142cd

Device Boot Start End Blocks Id System
/dev/sda1 * 1 1216 9764864 fd Linux raid autodetect
/dev/sda2 1216 60802 478618625 5 Extended
/dev/sda5 59343 60802 11717632 fd Linux raid autodetect
/dev/sda6 1216 59343 466900992 fd Linux raid autodetect

Partition table entries are not in disk order

Disk /dev/md0: 9998 MB, 9998172160 bytes
2 heads, 4 sectors/track, 2440960 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/md0 doesn't contain a valid partition table

Disk /dev/md1: 478.1 GB, 478105427968 bytes
2 heads, 4 sectors/track, 116724958 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/md1 doesn't contain a valid partition table

Disk /dev/md2: 12.0 GB, 11997732864 bytes
2 heads, 4 sectors/track, 2929134 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/md2 doesn't contain a valid partition table


smartctl -a /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright © 2002-10 by Bruce Allen, smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.11 family
Device Model: ST3500320AS
Serial Number: 5QM2ZF25
Firmware Version: SD15
User Capacity: 500,107,862,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed May 15 13:43:42 2013 MSK

==> WARNING: There are known problems with these drives,
AND THIS FIRMWARE VERSION IS AFFECTED,
see the following Seagate web pages:
seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931
seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207951

SMART support is: Available — device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 642) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 119) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103b) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always — 202670728
3 Spin_Up_Time 0x0003 094 092 000 Pre-fail Always — 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always — 23
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always — 2045
7 Seek_Error_Rate 0x000f 069 058 030 Pre-fail Always — 142044796788
9 Power_On_Hours 0x0032 060 060 000 Old_age Always — 35870
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always — 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always — 23
184 End-to-End_Error 0x0032 100 100 099 Old_age Always — 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always — 0
188 Command_Timeout 0x0032 099 097 000 Old_age Always — 106
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always — 0
190 Airflow_Temperature_Cel 0x0022 075 067 045 Old_age Always — 25 (Lifetime Min/Max 23/25)
194 Temperature_Celsius 0x0022 025 040 000 Old_age Always — 25 (0 14 0 0)
195 Hardware_ECC_Recovered 0x001a 038 010 000 Old_age Always — 202670728
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always — 2
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline — 2
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always — 0

SMART Error Log Version: 1
ATA Error Count: 7 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It «wraps» after 49.710 days.

Error 7 occurred at disk power-on lifetime: 33897 hours (1412 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
— — — — — — — 04 71 04 9d 00 32 e0 Device Fault; Error: ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
— — — — — — — — — — a1 00 00 00 00 00 a0 00 00:04:06.301 IDENTIFY PACKET DEVICE
ec 00 00 00 00 00 a0 00 00:04:06.300 IDENTIFY DEVICE
00 00 00 00 00 00 00 04 00:04:06.140 NOP [Abort queued commands]
00 00 00 00 00 00 00 ff 00:04:05.770 NOP [Abort queued commands]
a1 00 00 00 00 00 a0 00 00:04:00.751 IDENTIFY PACKET DEVICE

Error 6 occurred at disk power-on lifetime: 33897 hours (1412 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
— — — — — — — 04 71 04 9d 00 32 e0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
— — — — — — — — — — ec 00 00 00 00 00 a0 00 00:04:06.300 IDENTIFY DEVICE
00 00 00 00 00 00 00 04 00:04:06.140 NOP [Abort queued commands]
00 00 00 00 00 00 00 ff 00:04:05.770 NOP [Abort queued commands]
a1 00 00 00 00 00 a0 00 00:04:00.751 IDENTIFY PACKET DEVICE
ec 00 00 00 00 00 a0 00 00:04:00.750 IDENTIFY DEVICE

Error 5 occurred at disk power-on lifetime: 33897 hours (1412 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
— — — — — — — 04 71 04 9d 00 32 e0 Device Fault; Error: ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
— — — — — — — — — — a1 00 00 00 00 00 a0 00 00:04:00.751 IDENTIFY PACKET DEVICE
ec 00 00 00 00 00 a0 00 00:04:00.750 IDENTIFY DEVICE
00 00 00 00 00 00 00 04 00:04:00.590 NOP [Abort queued commands]
00 00 00 00 00 00 00 ff 00:04:00.220 NOP [Abort queued commands]
a1 00 00 00 00 00 a0 00 00:03:55.201 IDENTIFY PACKET DEVICE

Error 4 occurred at disk power-on lifetime: 33897 hours (1412 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
— — — — — — — 04 71 04 9d 00 32 e0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
— — — — — — — — — — ec 00 00 00 00 00 a0 00 00:04:00.750 IDENTIFY DEVICE
00 00 00 00 00 00 00 04 00:04:00.590 NOP [Abort queued commands]
00 00 00 00 00 00 00 ff 00:04:00.220 NOP [Abort queued commands]
a1 00 00 00 00 00 a0 00 00:03:55.201 IDENTIFY PACKET DEVICE
ec 00 00 00 00 00 a0 00 00:03:55.200 IDENTIFY DEVICE

Error 3 occurred at disk power-on lifetime: 33897 hours (1412 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
— — — — — — — 04 71 04 9d 00 32 e0 Device Fault; Error: ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
— — — — — — — — — — a1 00 00 00 00 00 a0 00 00:03:55.201 IDENTIFY PACKET DEVICE
ec 00 00 00 00 00 a0 00 00:03:55.200 IDENTIFY DEVICE
00 00 00 00 00 00 00 04 00:03:55.040 NOP [Abort queued commands]
00 00 00 00 00 00 00 ff 00:03:54.670 NOP [Abort queued commands]
60 00 00 50 3e 93 40 00 00:03:46.868 READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


До этого выходил диск из строя, второй в массиве, его хостеровский админ просто поменял на другой и рейд сам завёлся. Теперь же вышел старый первый диск, диск заменили на новый ни при перезагрузке сервер выпадает в rescue menu

grub rescue> no such partition

причём в этом случае ls показывает
(hd0) (hd1) (hd1, msdos1) (hd1, msdos2) (hd1, msdos3)

но ls hd(hdX, Y)/ везде показывает unknown filesystem

Чую всё дело в grub но нет уверености. Буду рад ссылкам и советам.
  • Вопрос задан
  • 7062 просмотра
Решения вопроса 1
JetMaster
@JetMaster Автор вопроса
> Да, похоже, будто при замене диска вы неправильно прописали на него груб.
Да, поэтому сначала, ещё до замены диска сделал
grub-install /dev/sdb

Проверить /etc/initramfs-tools/conf.d/mdadm, нужно что бы стояло BOOT_DEGRADED=true

Дальше замена диска

1. Проверяем наличие диска
cat /proc/scsi/scsi

2. Смотрим fdisk -l и получаем что-то типа «Disk /dev/sda doesn't contain a valid partition table»

3. Переносим структуру партиций между дисками
sfdisk -d /dev/sda | sfdisk --force /dev/sdb
--force нужен иначе будет ругаться на то что новый диск не размечен

4. Проверяем fdisk -l /dev/sda /dev/sdb диски должны быть одинаковые

5. Добавляем новый диск в рейд
mdadm --manage /dev/md0 --add /dev/sda1
mdadm --manage /dev/md1 --add /dev/sda6
mdadm --manage /dev/md2 --add /dev/sda5

6. Проверяем как синхронизируются диски cat /proc/mdstat
md2: active raid1 sdb5[2] sda5[3]
11716536 blocks super 1.2 [2/2] [UU]

md1: active raid1 sdb6[2] sda6[3]
[>....................] recovery = 0.1% (97408/71585536) finish=73.3min speed=76234K/sec

md0: active raid1 sda1[3] sdb1[2]
9763840 blocks super 1.2 [2/2] [UU]

unused devices: 7. После делаем grub-install /dev/sda и update-grub
Ответ написан
Комментировать
Пригласить эксперта
Ответы на вопрос 2
merlin-vrn
@merlin-vrn
Да, похоже, будто при замене диска вы неправильно прописали на него груб.

Странно, что он не видит файловых систем ни на одной партиции. А судя по названиям, это у вас grub2?

Как вариант, попробуйте второй (рабочий) диск подключить на место первого. При установке груба полезно для каждого диска указывать, что он первый и единственный (не скажу с ходу, как это сделать для grub2, для grub это в шелле три команды — device (hd0) /dev/sdX, root (hd0,x), setup (hd0))
Ответ написан
Комментировать
IlyaEvseev
@IlyaEvseev
Opensource geek
После замены диска надо запускать grub-update, чтобы он записал код stage1.5 в 1 дорожку.
Это было сделано? Если нет, то надо запускаться со спасательной системы,
делать mdadm --assemble, mount, chroot и grub-update.

Второй вопрос — grub устанавливался на (md0) или на (hd0)?
Ответ написан
Комментировать
Ваш ответ на вопрос

Войдите, чтобы написать ответ

Войти через центр авторизации
Похожие вопросы